| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the other users of the fptools build system have migrated to
Cabal, and with the move to darcs we can now flatten the source tree
without losing history, so here goes.
The main change is that the ghc/ subdir is gone, and most of what it
contained is now at the top level. The build system now makes no
pretense at being multi-project, it is just the GHC build system.
No doubt this will break many things, and there will be a period of
instability while we fix the dependencies. A straightforward build
should work, but I haven't yet fixed binary/source distributions.
Changes to the Building Guide will follow, too.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This gives some control over affinity, while we figure out the best
way to automatically schedule threads to make best use of the
available parallelism.
In addition to the primitive, there is also:
GHC.Conc.forkOnIO :: Int -> IO () -> IO ThreadId
where 'forkOnIO i m' creates a thread on Capability (i `rem` N), where
N is the number of available Capabilities set by +RTS -N.
Threads forked by forkOnIO do not automatically migrate when there are
free Capabilities, like normal threads do. Still, if you're using
forkOnIO exclusively, it's a good idea to do +RTS -qm to disable work
pushing anyway (work pushing takes too much time when the run queues
are large, this is something we need to fix).
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are two new options in the -threaded RTS:
-qm Don't automatically migrate threads between CPUs
-qw Migrate a thread to the current CPU when it is woken up
previously both of these were effectively off, i.e. threads were
migrated between CPUs willy-milly, and threads were always migrated to
the current CPU when woken up. This is the first step in tweaking the
scheduling for more effective work balancing, there will no doubt be
more to come.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I think this missing dep is what broke my parallel build
I used make -j2 with ghc-6.4.2.20060323 and got:
------------------------------------------------------------------------
==fptools== make boot -wr --jobserver-fds=3,11 -j;
in /var/tmp/portage/ghc-6.4.2_pre20060323/work/ghc-6.4.2.20060323/ghc/includes
------------------------------------------------------------------------
Creating ghcplatform.h...
Done.
gcc -O -O2 -march=k8 -pipe -Wa,--noexecstack -c mkDerivedConstants.c -o mkDerivedConstants.o
In file included from ghcconfig.h:5,
from Stg.h:42,
from Rts.h:19,
from mkDerivedConstants.c:20:
ghcplatform.h:1:1: unterminated #ifndef
Done.
With this patch applied I can no longer repoduce this build bug.
So I think this patch should be applied to the cvs ghc-6-4-branch too.
|
|
|
|
|
| |
This fixes another instance of a subtle SMP bug (see patch "really
nasty bug in SMP").
|
|
|
|
|
|
|
| |
This is just an assertion, in effect: we should never enter a PAP, but
for convenience we previously attached the PAP apply code to the PAP
info table. The problem with this was that it makes it harder to track
down bugs that result in entering a PAP...
|
| |
|
|
|
|
|
|
|
| |
Now that we can handle using C argument registers as global registers,
extend the x86_64 register mapping. We now have 5 integer argument
registers, 4 float, and 2 double (all caller-saves). This results in a
reasonable speedup on x86_64.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We now have more stg_ap entry points: stg_ap_*_fast, which take
arguments in registers according to the platform calling convention.
This is faster if the function being called is evaluated and has the
right arity, which is the common case (see the eval/apply paper for
measurements).
We still need the stg_ap_*_info entry points for stack-based
application, such as an overflows when a function is applied to too
many argumnets. The stg_ap_*_fast functions actually just check for
an evaluated function, and if they don't find one, push the args on
the stack and invoke stg_ap_*_info. (this might be slightly slower in
some cases, but not the common case).
|
| |
|
|
|
|
|
| |
There was an integer overflow in the definition of LDV_RECORD_CREATE
when StgWord is 64 bits.
|
| |
|
|
|
|
|
|
|
|
|
| |
atomicModifyMutVar# was re-using the storage manager mutex (sm_mutex)
to get its atomicity guarantee in SMP mode. But recently the addition
of a call to dirty_MUT_VAR() to implement the read barrier lead to a
rare deadlock case, because dirty_MUT_VAR() very occasionally needs to
allocate a new block to chain on the mutable list, which requires
sm_mutex.
|
| |
|
|
|
|
|
|
|
| |
Now, the threaded RTS also includes SMP support. The -smp flag is a
synonym for -threaded. The performance implications of this are small
to negligible, and it results in a code cleanup and reduces the number
of combinations we have to test.
|
|
|
|
| |
rather than recordMutableGen(), the former works better in SMP
|
|
|
|
|
|
| |
We always assign to BaseReg on return from resumeThread(), but in
cases where BaseReg is not an lvalue (eg. unreg) we need to disable
this assigment. See comments for more details.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We had to bite the bullet here and add an extra word to every thunk,
to enable running ordinary libraries on SMP. Otherwise, we would have
needed to ship an extra set of libraries with GHC 6.6 in addition to
the two sets we already ship (normal + profiled), and all Cabal
packages would have to be compiled for SMP too. We decided it best
just to take the hit now, making SMP easily accessible to everyone in
GHC 6.6.
Incedentally, although this increases allocation by around 12% on
average, the performance hit is around 5%, and much less if your inner
loop doesn't use any laziness.
|
|
|
|
|
|
|
|
| |
Along the lines of the clean/dirty arrays and IORefs implemented
recently, now threads are marked clean or dirty depending on whether
they need to be scanned during a minor GC or not. This should speed
up GC when there are lots of threads, especially if most of them are
idle.
|
| |
|
|
|
|
|
|
|
| |
- fix a mixup in Capability.c regarding signals: signals_pending() is not
used in THREADED_RTS
- some cleanups and warning removal while I'm here
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Improve the GC behaviour of IORefs (see Ticket #650).
This is a small change to the way IORefs interact with the GC, which
should improve GC performance for programs with plenty of IORefs.
Previously we had a single closure type for mutable variables,
MUT_VAR. Mutable variables were *always* on the mutable list in older
generations, and always traversed on every GC.
Now, we have two closure types: MUT_VAR_CLEAN and MUT_VAR_DIRTY. The
latter is on the mutable list, but the former is not. (NB. this
differs from MUT_ARR_PTRS_CLEAN and MUT_ARR_PTRS_DIRTY, both of which
are on the mutable list). writeMutVar# now implements a write
barrier, by calling dirty_MUT_VAR() in the runtime, that does the
necessary modification of MUT_VAR_CLEAN into MUT_VAR_DIRY, and adding
to the mutable list if necessary.
This results in some pretty dramatic speedups for GHC itself. I've
just measureed a 30% overall speedup compiling a 31-module program
(anna) with the default heap settings :-D
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Improve the GC behaviour of IOArrays/STArrays
See Ticket #650
This is a small change to the way mutable arrays interact with the GC,
that can have a dramatic effect on performance, and make tricks with
unsafeThaw/unsafeFreeze redundant. Data.HashTable should be faster
now (I haven't measured it yet).
We now have two mutable array closure types, MUT_ARR_PTRS_CLEAN and
MUT_ARR_PTRS_DIRTY. Both are on the mutable list if the array is in
an old generation. writeArray# sets the type to MUT_ARR_PTRS_DIRTY.
The garbage collector can set the type to MUT_ARR_PTRS_CLEAN if it
finds that no element of the array points into a younger generation
(discovering this required a small addition to evacuate(), but rough
tests indicate that it doesn't measurably affect performance).
NOTE: none of this affects unboxed arrays (IOUArray/STUArray), only
boxed arrays (IOArray/STArray).
We could go further and extend the DIRTY bit to be per-block rather
than for the whole array, but for now this is an easy improvement.
|
|
|
|
|
| |
Default signal handlers weren't being installed; amazing that this has
been broken ever since I rearranged the signal handling code.
|
|
|
|
|
| |
MAYBE_GC: we should check alloc_blocks in addition to CurrentNursery,
since some allocateLocal calls don't allocate from the nursery.
|
|
|
|
| |
remove duplicate definition
|
|
|
|
|
|
| |
for TSO fields, define a Cmm macro TSO_OFFSET_xxx to get the actual
offset including the header and variable parts (we were misusing the
headerless OFFSET_xxx macros in a couple of places).
|
|
|
|
|
| |
revert rev. 1.22 again, just in case this is the cause of the
segfaults reported on OpenBSD and SuSE.
|
|
|
|
|
|
|
|
| |
Small performance improvement to STM: reduce the size of an atomically
frame from 3 words to 2 words by combining the "waiting" boolean field
with the info pointer, i.e. having two separate info tables/return
addresses for an atomically frame, one for the normal case and one for
the waiitng case.
|
|
|
|
| |
oops, undo previous (SMP.h is already included)
|
|
|
|
| |
#include SMP.h
|
|
|
|
| |
define wb() and xchg() for non-SMP versions of the RTS
|
|
|
|
| |
fix some (thankfully harmless) typos
|
|
|
|
|
|
|
|
| |
unlockClosure() requires a write barrier for the compiler - write
barriers aren't required for the CPU, but gcc re-orders non-aliasing
writes unless we use an explicit barrier.
This only just showed up when we started compiling the RTS with -O2.
|
|
|
|
| |
un-revert rev. 1.22, it wasn't the cause of last weekend's breakage
|
|
|
|
| |
Files missed from STM implementation changes
|
|
|
|
|
| |
something has gone wrong; I don't have time right now to find out
exactly what, so revert rev. 1.22 in an attempt to fix it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two improvements to the SMP runtime:
- support for 'par', aka sparks. Load balancing is very primitive
right now, but I have seen programs that go faster using par.
- support for backing off when a thread is found to be duplicating
a computation currently underway in another thread. This also
fixes some instability in SMP, because it turned out that when
an update frame points to an indirection, which can happen if
a thunk is under evaluation in multiple threads, then after GC
has shorted out the indirection the update will trash the value.
Now we suspend the duplicate computation to the heap before this
can happen.
Additionally:
- stack squeezing is separate from lazy blackholing, and now only
happens if there's a reasonable amount of squeezing to be done
in relation to the number of words of stack that have to be moved.
This means we won't try to shift 10Mb of stack just to save 2
words at the bottom (it probably never happened, but still).
- update frames are now marked when they have been visited by lazy
blackholing, as per the SMP paper.
- cleaned up raiseAsync() a bit.
|
|
|
|
| |
cosmetic
|
|
|
|
|
|
|
| |
Add wcStore(), a write-combining store if supported
(I tried using it in the update code and only succeeded in making
things slower, but it might come in handy in the future)
|
|
|
|
|
| |
Omit the __DISCARD__() call in FB_ if __GNUC__ >= 3. It doesn't
appear to be necessary now, and it prevents some gcc optimisations.
|
|
|
|
|
|
| |
Fix a crash in STM; we were releasing ownership of the transaction too
early in stmWait(), so a TSO could be woken up before we had finished
putting it to sleep properly.
|
|
|
|
| |
Win32: Use CriticalSections instead of Mutexes, they are *much* faster.
|
|
|
|
| |
Modify ACQUIRE_LOCK/RELEASE_LOCK for use in .cmm files
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Very simple work-sharing amongst Capabilities: whenever a Capability
detects that it has more than 1 thread in its run queue, it runs
around looking for empty Capabilities, and shares the threads on its
run queue equally with the free Capabilities it finds.
- unlock the garbage collector's mutable lists, by having private
mutable lists per capability (and per generation). The private
mutable lists are moved onto the main mutable lists at each GC.
This pulls the old-generation update code out of the storage manager
mutex, which is one of the last remaining causes of (alleged) contention.
- Fix some problems with synchronising when a GC is required. We should
synchronise quicker now.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- change the type of StgRun(): now we return the Capability that the
thread currently holds. The return status of the thread is now
stored in cap->r.rRet (a new slot in the reg table).
This was necessary because on return from StgRun(), the current
TSO may be blocked, so it no longer belongs to us. If it is a bound
thread, then the Task may have been already woken up on another
Capability, so the scheduler can't use task->cap to find the
capability it currently owns.
- when shutting down, allow a bound thread to remove its TSO from
the run queue when exiting (eliminates an error condition in
releaseCapability()).
|
|
|
|
| |
Fix build for way "u"
|