| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the other users of the fptools build system have migrated to
Cabal, and with the move to darcs we can now flatten the source tree
without losing history, so here goes.
The main change is that the ghc/ subdir is gone, and most of what it
contained is now at the top level. The build system now makes no
pretense at being multi-project, it is just the GHC build system.
No doubt this will break many things, and there will be a period of
instability while we fix the dependencies. A straightforward build
should work, but I haven't yet fixed binary/source distributions.
Changes to the Building Guide will follow, too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We had to bite the bullet here and add an extra word to every thunk,
to enable running ordinary libraries on SMP. Otherwise, we would have
needed to ship an extra set of libraries with GHC 6.6 in addition to
the two sets we already ship (normal + profiled), and all Cabal
packages would have to be compiled for SMP too. We decided it best
just to take the hit now, making SMP easily accessible to everyone in
GHC 6.6.
Incedentally, although this increases allocation by around 12% on
average, the performance hit is around 5%, and much less if your inner
loop doesn't use any laziness.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two improvements to the SMP runtime:
- support for 'par', aka sparks. Load balancing is very primitive
right now, but I have seen programs that go faster using par.
- support for backing off when a thread is found to be duplicating
a computation currently underway in another thread. This also
fixes some instability in SMP, because it turned out that when
an update frame points to an indirection, which can happen if
a thunk is under evaluation in multiple threads, then after GC
has shorted out the indirection the update will trash the value.
Now we suspend the duplicate computation to the heap before this
can happen.
Additionally:
- stack squeezing is separate from lazy blackholing, and now only
happens if there's a reasonable amount of squeezing to be done
in relation to the number of words of stack that have to be moved.
This means we won't try to shift 10Mb of stack just to save 2
words at the bottom (it probably never happened, but still).
- update frames are now marked when they have been visited by lazy
blackholing, as per the SMP paper.
- cleaned up raiseAsync() a bit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- change the type of StgRun(): now we return the Capability that the
thread currently holds. The return status of the thread is now
stored in cap->r.rRet (a new slot in the reg table).
This was necessary because on return from StgRun(), the current
TSO may be blocked, so it no longer belongs to us. If it is a bound
thread, then the Task may have been already woken up on another
Capability, so the scheduler can't use task->cap to find the
capability it currently owns.
- when shutting down, allow a bound thread to remove its TSO from
the run queue when exiting (eliminates an error condition in
releaseCapability()).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Big re-hash of the threaded/SMP runtime
This is a significant reworking of the threaded and SMP parts of
the runtime. There are two overall goals here:
- To push down the scheduler lock, reducing contention and allowing
more parts of the system to run without locks. In particular,
the scheduler does not require a lock any more in the common case.
- To improve affinity, so that running Haskell threads stick to the
same OS threads as much as possible.
At this point we have the basic structure working, but there are some
pieces missing. I believe it's reasonably stable - the important
parts of the testsuite pass in all the (normal,threaded,SMP) ways.
In more detail:
- Each capability now has a run queue, instead of one global run
queue. The Capability and Task APIs have been completely
rewritten; see Capability.h and Task.h for the details.
- Each capability has its own pool of worker Tasks. Hence, Haskell
threads on a Capability's run queue will run on the same worker
Task(s). As long as the OS is doing something reasonable, this
should mean they usually stick to the same CPU. Another way to
look at this is that we're assuming each Capability is associated
with a fixed CPU.
- What used to be StgMainThread is now part of the Task structure.
Every OS thread in the runtime has an associated Task, and it
can ask for its current Task at any time with myTask().
- removed RTS_SUPPORTS_THREADS symbol, use THREADED_RTS instead
(it is now defined for SMP too).
- The RtsAPI has had to change; we must explicitly pass a Capability
around now. The previous interface assumed some global state.
SchedAPI has also changed a lot.
- The OSThreads API now supports thread-local storage, used to
implement myTask(), although it could be done more efficiently
using gcc's __thread extension when available.
- I've moved some POSIX-specific stuff into the posix subdirectory,
moving in the direction of separating out platform-specific
implementations.
- lots of lock-debugging and assertions in the runtime. In particular,
when DEBUG is on, we catch multiple ACQUIRE_LOCK()s, and there is
also an ASSERT_LOCK_HELD() call.
What's missing so far:
- I have almost certainly broken the Win32 build, will fix soon.
- any kind of thread migration or load balancing. This is high up
the agenda, though.
- various performance tweaks to do
- throwTo and forkProcess still do not work in SMP mode
|
|
|
|
| |
More 64-fixing
|
|
|
|
| |
64 bit fix
|
|
|
|
|
|
|
|
|
|
|
| |
Avoid calling threadPaused() on exit from STG land if we're just
switching to the interpreter, and conversely call threadPaused() in
the interpreter if we're returing to the scheduler for anything other
than switching to STG.
This will probably fix the recent slowdown in GHCi (ioref001 test, for
example). It was broken when we moved the threadPaused() call into
STG from the scheduler, so it only affects the HEAD.
|
|
|
|
| |
gcc 4.0.0 fix: avoid casted expression as lvalue
|
|
|
|
| |
type fixup
|
|
|
|
| |
Warning police (added missing #include)
|
|
|
|
|
|
|
|
|
|
|
|
| |
Cleanup: all (well, most) messages from the RTS now go through the
functions in RtsUtils: barf(), debugBelch() and errorBelch(). The
latter two were previously called belch() and prog_belch()
respectively. See the comments for the right usage of these message
functions.
One reason for doing this is so that we can avoid spurious uses of
stdout/stderr by Haskell apps on platforms where we shouldn't be using
them (eg. non-console apps on Windows).
|
|
|
|
| |
Merge backend-hacking-branch onto HEAD. Yay!
|
|
|
|
|
|
| |
Tweaks to have RTS (C) sources compile with MSVC. Apart from wibbles
related to the handling of 'inline', changed Schedule.h:POP_RUN_QUEUE()
not to use expression-level statement blocks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Bound Threads
=============
Introduce a way to use foreign libraries that rely on thread local state
from multiple threads (mainly affects the threaded RTS).
See the file threads.tex in CVS at haskell-report/ffi/threads.tex
(not entirely finished yet) for a definition of this extension. A less formal
description is also found in the documentation of Control.Concurrent.
The changes mostly affect the THREADED_RTS (./configure --enable-threaded-rts),
except for saving & restoring errno on a per-TSO basis, which is also necessary
for the non-threaded RTS (a bugfix).
Detailed list of changes
------------------------
- errno is saved in the TSO object and restored when necessary:
ghc/includes/TSO.h, ghc/rts/Interpreter.c, ghc/rts/Schedule.c
- rts_mainLazyIO is no longer needed, main is no special case anymore
ghc/includes/RtsAPI.h, ghc/rts/RtsAPI.c, ghc/rts/Main.c, ghc/rts/Weak.c
- passCapability: a new function that releases the capability and "passes"
it to a specific OS thread:
ghc/rts/Capability.h ghc/rts/Capability.c
- waitThread(), scheduleWaitThread() and schedule() get an optional
Capability *initialCapability passed as an argument:
ghc/includes/SchedAPI.h, ghc/rts/Schedule.c, ghc/rts/RtsAPI.c
- Bound Thread scheduling (that's what this is all about):
ghc/rts/Schedule.h, ghc/rts/Schedule.c
- new Primop isCurrentThreadBound#:
ghc/compiler/prelude/primops.txt.pp, ghc/includes/PrimOps.h, ghc/rts/PrimOps.hc,
ghc/rts/Schedule.h, ghc/rts/Schedule.c
- a simple function, rtsSupportsBoundThreads, that returns true if THREADED_RTS
is defined:
ghc/rts/Schedule.h, ghc/rts/Schedule.c
- a new implementation of forkProcess (the old implementation stays in place
for the non-threaded case). Partially broken; works for the standard
fork-and-exec case, but not for much else. A proper forkProcess is
really next to impossible to implement:
ghc/rts/Schedule.c
- Library support for bound threads:
Control.Concurrent.
rtsSupportsBoundThreads, isCurrentThreadBound, forkOS,
runInBoundThread, runInUnboundThread
libraries/base/Control/Concurrent.hs, libraries/base/Makefile,
libraries/base/include/HsBase.h, libraries/base/cbits/forkOS.c (new file)
|
|
|
|
|
| |
comment wibble
(merge to stable)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix an obscure bug: the most general kind of heap check,
HEAP_CHECK_GEN(), is supposed to save the contents of *every* register
known to the STG machine (used in cases where we either can't figure
out which ones are live, or doing so would be too much hassle). The
problem is that it wasn't saving the L1 register.
A slight complication arose in that saving the L1 register pushed the
size of the frame over the 16 words allowed for the size of the bitmap
stored in the frame, so I changed the layout of the frame a bit.
Describing all the registers using a single bitmap is overkill when
only 8 of them can actually be pointers, so now the bitmap is only 8
bits long and we always skip over a fixed number of non-ptr words to
account for all the non-ptr regs. This is all described in StgMacros.h.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Off-by-one tidyup.
ALLOC_AP, ALLOC_PAP and MKAP were all being constructed
with size arguments equal to (1+number of args/FVs) in
ByteCodeGen.schemeE, only for Interpreter.c to subtract 1
when fishing out the payloads. This commit drops the
up-and-downery.
Simplification spotted by Andy Moran
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix some bugs in compacting GC.
Bug 1: When threading the fields of an AP or PAP, we were grabbing the
info table of the function without unthreading it first.
Bug 2: eval_thunk_selector() might accidentally find itself in
to-space when going through indirections in a compacted generation.
We must check for this case and bale out if necessary.
Bug 3: This is somewhat more nasty. When we have an AP or PAP that
points to a BCO, the layout info for the AP/PAP is in the BCO's
instruction array, which is two objects deep from the AP/PAP itself.
The trouble is, during compacting GC, we can only safely look one
object deep from the current object, because pointers from objects any
deeper might have been already updated to point to their final
destinations.
The solution is to put the arity and bitmap info for a BCO into the
BCO object itself. This means BCOs become variable-length, which is a
slight annoyance, but it also means that looking up the arity/bitmap
is quicker. There is a slight reduction in complexity in the byte
code generator due to not having to stuff the bitmap at the front of
the instruction stream.
|
|
|
|
|
| |
Threaded RTS: improve ccall performance by allocation parameters as a
variable length array instead of using malloc
|
|
|
|
|
|
| |
Fix a potential crash in the threaded RTS by copying ccall arguments
from the TSO stack to a malloced block before doing the call.
(no changes were made for the non-threaded case)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit fixes many bugs and limitations in the threaded RTS.
There are still some issues remaining, though.
The following bugs should have been fixed:
- [+] "safe" calls could cause crashes
- [+] yieldToReturningWorker/grabReturnCapability
- It used to deadlock.
- [+] couldn't wake blocked workers
- Calls into the RTS could go unanswered for a long time, and
that includes ordinary callbacks in some circumstances.
- [+] couldn't block on an MVar and expect to be woken up by a signal
handler
- Depending on the exact situation, the RTS shut down or
blocked forever and ignored the signal.
- [+] The locking scheme in RtsAPI.c didn't work
- [+] run_thread label in wrong place (schedule())
- [+] Deadlock in GHC.Handle
- if a signal arrived at the wrong time, an mvar was never
filled again
- [+] Signals delivered to the "wrong" thread were ignored or handled
too late.
Issues:
*) If GC can move TSO objects (I don't know - can it?), then ghci
will occasionally crash when calling foreign functions, because the
parameters are stored on the TSO stack.
*) There is still a race condition lurking in the code
(both threaded and non-threaded RTS are affected):
If a signal arrives after the check for pending signals in
schedule(), but before the call to select() in awaitEvent(),
select() will be called anyway. The signal handler will be
executed much later than expected.
*) For Win32, GHC doesn't yet support non-blocking IO, so while a
thread is waiting for IO, no call-ins can happen. If the RTS is
blocked in awaitEvent, it uses a polling loop on Win32, so call-ins
should work (although the polling loop looks ugly).
*) Deadlock detection is disabled for the threaded rts, because I
don't know how to do it properly in the presence of foreign call-ins
from foreign threads.
This causes the tests conc031, conc033 and conc034 to fail.
*) "safe" is currently treated as "threadsafe". Implementing "safe" in
a way that blocks other Haskell threads is more difficult than was
thought at first. I think it could be done with a few additional lines
of code, but personally, I'm strongly in favour of abolishing the
distinction.
*) Running finalizers at program termination is inefficient - there
are two OS threads passing messages back and forth for every finalizer
that is run. Also (just as in the non-threaded case) the finalizers
are run in parallel to any remaining haskell threads and to any
foreign call-ins that might still happen.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Changes to the way stack checks are handled in GHCi, to fix a rare bug
when a stack check fails in a BCO.
We now aggregate all stack use from case alternatives up to the
enclosing function/thunk BCO, and do a single stack check at the
beginning of that BCO. This simplifies the stack check failure code,
because it doesn't have to cope with the case when a case alternative
needs to restart.
We still employ the trick of doing a fixed stack check before every
BCO, only inserting an actual stack check instruction in the BCO if it
needs more stack than this fixed amount. The fixed stack check is now
only done before running a function/thunk BCO.
|
|
|
|
|
|
|
|
|
|
|
| |
Correctly describe the stack during a GHCi CCALL instruction to the
RTS. The previous hack, temporarily truncating the stack to the
topmost valid stack frame, didn't work because stack-squeezing tends
to move the stack around before the call.
The right thing to do is correctly describe the chunk of ccall args
with an info table, which is what this change does. We use a RET_DYN
info table with the number of non-ptrs from the CCALL instruction.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Merge the eval-apply-branch on to the HEAD
------------------------------------------
This is a change to GHC's evaluation model in order to ultimately make
GHC more portable and to reduce complexity in some areas.
At some point we'll update the commentary to describe the new state of
the RTS. Pending that, the highlights of this change are:
- No more Su. The Su register is gone, update frames are one
word smaller.
- Slow-entry points and arg checks are gone. Unknown function calls
are handled by automatically-generated RTS entry points (AutoApply.hc,
generated by the program in utils/genapply).
- The stack layout is stricter: there are no "pending arguments" on
the stack any more, the stack is always strictly a sequence of
stack frames.
This means that there's no need for LOOKS_LIKE_GHC_INFO() or
LOOKS_LIKE_STATIC_CLOSURE() any more, and GHC doesn't need to know
how to find the boundary between the text and data segments (BIG WIN!).
- A couple of nasty hacks in the mangler caused by the neet to
identify closure ptrs vs. info tables have gone away.
- Info tables are a bit more complicated. See InfoTables.h for the
details.
- As a side effect, GHCi can now deal with polymorphic seq. Some bugs
in GHCi which affected primitives and unboxed tuples are now
fixed.
- Binary sizes are reduced by about 7% on x86. Performance is roughly
similar, some programs get faster while some get slower. I've seen
GHCi perform worse on some examples, but haven't investigated
further yet (GHCi performance *should* be about the same or better
in theory).
- Internally the code generator is rather better organised. I've moved
info-table generation from the NCG into the main codeGen where it is
shared with the C back-end; info tables are now emitted as arrays
of words in both back-ends. The NCG is one step closer to being able
to support profiling.
This has all been fairly thoroughly tested, but no doubt I've messed
up the commit in some way.
|
|
|
|
| |
Extra arg to suspendThread() and resumeThread(); controls whether an external call is concurrent or not
|
|
|
|
| |
SMP: hack-and-slash to bring BaseReg into scope
|
|
|
|
|
|
| |
Fix up the interpreter following the recent modifications to
suspendThread/resumeThread. Someone should test that foreign imports
in the interpreter still work.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix the large block allocation bug (Yay!)
-----------------------------------------
In order to do this, I had to
1. in each heap-check failure branch, return the amount of heap
actually requested, in a known location (I added another slot
in StgRegTable called HpAlloc for this purpose). This is
useful for other reasons - in particular it makes it possible
to get accurate allocation statistics.
2. In the scheduler, if a heap check fails and we wanted more than
BLOCK_SIZE_W words, then allocate a special large block and place
it in the nursery. The nursery now has to be double-linked so
we can insert the new block in the middle.
3. The garbage collector has to be able to deal with multiple objects
in a large block. It turns out that this isn't a problem as long as
the large blocks only occur in the nursery, because we always copy
objects from the nursery during GC. One small change had to be
made: in evacuate(), we may need to follow the link field from the
block descriptor to get to the block descriptor for the head of a
large block.
4. Various other parts of the storage manager had to be modified
to cope with a nursery containing a mixture of block sizes.
Point (3) causes a slight pessimization in the garbage collector. I
don't see a way to avoid this. Point (1) causes some code bloat (a
rough measurement is around 5%), so to offset this I made the
following change which I'd been meaning to do for some time:
- Store the values of some commonly-used absolute addresses
(eg. stg_update_PAP) in the register table. This lets us use
shorter instruction forms for some absolute jumps and saves some
code space.
- The type of Capability is no longer the same as an StgRegTable.
MainRegTable renamed to MainCapability. See Regs.h for details.
Other minor changes:
- remove individual declarations for the heap-check-failure jump
points, and declare them all in StgMiscClosures.h instead. Remove
HeapStackCheck.h.
Updates to the native code generator to follow.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change the story about POSIX headers in C compilation.
Until now, all C code in the RTS and library cbits has by default been
compiled with settings for POSIXness enabled, that is:
#define _POSIX_SOURCE 1
#define _POSIX_C_SOURCE 199309L
#define _ISOC9X_SOURCE
If you wanted to negate this, you'd have to define NON_POSIX_SOURCE
before including headers.
This scheme has some bad effects:
* It means that ccall-unfoldings exported via interfaces from a
module compiled with -DNON_POSIX_SOURCE may not compile when
imported into a module which does not -DNON_POSIX_SOURCE.
* It overlaps with the feature tests we do with autoconf.
* It seems to have caused borkage in the Solaris builds for some
considerable period of time.
The New Way is:
* The default changes to not-being-in-Posix mode.
* If you want to force a C file into Posix mode, #include as
the **first** include the new file ghc/includes/PosixSource.h.
Most of the RTS C sources have this include now.
* NON_POSIX_SOURCE is almost totally expunged. Unfortunately
we have to retain some vestiges of it in ghc/compiler so that
modules compiled via C on Solaris using older compilers don't
break.
|
|
|
|
| |
Disable debugging machinery which skeaked in in the last commit.
|
|
|
|
| |
C-side FFI support for Byte/Ptr arrays.
|
|
|
|
| |
Do suspendThread/resumeThread round ccalls so that ccall_gc is supported.
|
|
|
|
| |
wibble: add cast to keep gcc happy.
|
|
|
|
| |
C-side support for FFI in GHCi (foreign import only).
|
|
|
|
| |
wibble
|
|
|
|
| |
wibble - drop a ? from a fprintf format string to avoid having it look like a trigraph seq might be present
|
|
|
|
| |
Implement opcodes bci_TESTLT_F and case bci_TESTEQ_F. (Duh.)
|
|
|
|
|
|
| |
RTS support for the ugly tagToEnum# hack. Actually a very general
thing -- just a bytecode unconditional jump, so we can do more general
control-flow in BCOs.
|
|
|
|
| |
VoidRep call/return support for the interpreter.
|
|
|
|
| |
make this compile with profiling on (it probably won't work, though).
|
|
|
|
| |
debug print wibble
|
|
|
|
|
|
|
|
|
|
|
| |
Bite the bullet and make GHCi support non-optional in the RTS. GHC
4.11 should be able to build GHCi without any additional tweaks now.
- the Linker is split into two parts: LinkerBasic.c, containing the
routines required by the rest of the RTS, and Linker.c, containing
the linker proper, which is not referred to from the rest of the RTS.
Only Linker.c requires -ldl, so programs which don't make use of the
linker (everything except GHC, in other words) won't need -ldl.
|
|
|
|
|
| |
Check the context_switch flag and yield if set, so that interpreted
code behaves properly in a multi(haskell)threaded environment.
|
|
|
|
|
| |
Implement implicit and explicit stack checks. For details, see recent
commit message for ghc/compiler/ghci/ByteCodeGen.lhs.
|
|
|
|
| |
Major performance improvements for the bytecode interpreter.
|
|
|
|
|
| |
In interpreted code, basic support for routing primop calls through
to functions in PrelPrimopWrappers.lhs.
|
|
|
|
|
| |
Add mkApUpd0# primop, used to make sure bytecode-compiled top-level things
are updateable.
|
|
|
|
| |
Latest bug fixes.
|
|
|
|
| |
Today's interpreter bug fixes: FP stuff, and unpacking constrs onto stack.
|