summaryrefslogtreecommitdiff
path: root/rts/sm/GCThread.h
Commit message (Collapse)AuthorAgeFilesLines
* Eliminate zero_static_objects_list()Simon Marlow2015-07-281-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | Summary: [Revised version of D1076 that was committed and then backed out] In a workload with a large amount of code, zero_static_objects_list() takes a significant amount of time, and furthermore it is in the single-threaded part of the GC. This patch uses a slightly fiddly scheme for marking objects on the static object lists, using a flag in the low 2 bits that flips between two states to indicate whether an object has been visited during this GC or not. We also have to take into account objects that have not been visited yet, which might appear at any time due to runtime linking. Test Plan: validate Reviewers: austin, ezyang, rwbarton, bgamari, thomie Reviewed By: bgamari, thomie Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1106
* Revert "Eliminate zero_static_objects_list()"Simon Marlow2015-07-271-5/+2
| | | | This reverts commit b949c96b4960168a3b399fe14485b24a2167b982.
* Eliminate zero_static_objects_list()Simon Marlow2015-07-221-2/+5
| | | | | | | | | | | | | | | | | | | | | Summary: In a workload with a large amount of code, zero_static_objects_list() takes a significant amount of time, and furthermore it is in the single-threaded part of the GC. This patch uses a slightly fiddly scheme for marking objects on the static object lists, using a flag in the low 2 bits that flips between two states to indicate whether an object has been visited during this GC or not. We also have to take into account objects that have not been visited yet, which might appear at any time due to runtime linking. Test Plan: validate Reviewers: austin, bgamari, ezyang, rwbarton Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1076
* Replace hooks by callbacks in RtsConfig (#8785)Simon Marlow2015-04-071-0/+1
| | | | | | | | | | | | Summary: Hooks rely on static linking semantics, and are broken by -Bsymbolic which we need when using dynamic linking. Test Plan: Built it Reviewers: austin, hvr, tibbe Differential Revision: https://phabricator.haskell.org/D8
* Revert "rts: add Emacs 'Local Variables' to every .c file"Simon Marlow2014-09-291-8/+0
| | | | This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
* rts: add Emacs 'Local Variables' to every .c fileAustin Seipp2014-07-281-0/+8
| | | | | | | | This will hopefully help ensure some basic consistency in the forward by overriding buffer variables. In particular, it sets the wrap length, the offset to 4, and turns off tabs. Signed-off-by: Austin Seipp <austin@well-typed.com>
* Avoid unnecessary clock_gettime() syscalls in GC stats.Brian Brooks2014-07-101-2/+1
| | | | | | | | | | | | | | Summary: Avoid unnecessary clock_gettime() syscalls in GC stats. Test Plan: Use strace. Reviewers: simonmar, austin Reviewed By: simonmar, austin Subscribers: simonmar, relrod, carter Differential Revision: https://phabricator.haskell.org/D39
* Tiny comment on the change from StgWord8 to StgWordSimon Peyton Jones2013-10-031-1/+1
| | | | c.f. commit 0b0fec536e35769b64b8bc5397c84138fa512155
* Globally replace "hackage.haskell.org" with "ghc.haskell.org"Simon Marlow2013-10-011-1/+1
|
* use StgWord not StgWord8 for wakeupSimon Marlow2013-10-011-1/+1
| | | | volatile StgWord8 is not guaranteed to be atomic.
* Ensure gc_thread->wakeup is of type StgWord8.Austin Seipp2013-06-211-1/+1
| | | | | | | | | | rtsBool is defined to only have two inhabitants, which are true (1) and false (0) But the wakeup flag is set to 4 possible values, outside the range of rtsBool. This leads Clang to warn about tautological comparisons. Signed-off-by: Austin Seipp <aseipp@pobox.com>
* Simplify the allocation stats accountingSimon Marlow2013-02-141-1/+0
| | | | | | | | | | | We were doing it in two different ways and asserting that the results were the same. In most cases they were, but I found one case where they weren't: the GC itself allocates some memory for running finalizers, and this memory was accounted for one way but not the other. It was simpler to remove the old way of counting allocation that to try to fix it up, so I did that.
* Hopefully fix breakage on OS X w/ LLVMSimon Marlow2013-01-171-0/+4
| | | | | | | Reordering of includes in GC.c broke on OS X because gctKey is declared in Task.h and is needed in the storage manager. This is really the wrong place for it anyway, so I've moved the gctKey pieces to where they should be.
* Deprecate lnat, and use StgWord insteadSimon Marlow2012-09-071-9/+9
| | | | | | | | | | | | lnat was originally "long unsigned int" but we were using it when we wanted a 64-bit type on a 64-bit machine. This broke on Windows x64, where long == int == 32 bits. Using types of unspecified size is bad, but what we really wanted was a type with N bits on an N-bit machine. StgWord is exactly that. lnat was mentioned in some APIs that clients might be using (e.g. StackOverflowHook()), so we leave it defined but with a comment to say that it's deprecated.
* Parallelise clearNurseries() in the parallel GCSimon Marlow2012-07-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The clearNurseries() operation resets the free pointer in each nursery block to the start of the block, emptying the nursery. In the parallel GC this was done on the main GC thread, but that's bad because it accesses the bdescr of every nursery block, and move all those cache lines onto the CPU of the main GC thread. With large nurseries, this can be especially bad. So instead we want to clear each nursery in its local GC thread. Thanks to Andreas Voellmy <andreas.voellmy@gmail.com> for idenitfying the issue. After this change and the previous patch to make the last GC a major one, I see these results for nofib/parallel on 8 cores: blackscholes +0.0% +0.0% -3.7% -3.3% +0.3% coins +0.0% +0.0% -5.1% -5.0% +0.4% gray +0.0% +0.0% -4.5% -2.1% +0.8% mandel +0.0% -0.0% -7.6% -5.1% -2.3% matmult +0.0% +5.5% -2.8% -1.9% -5.8% minimax +0.0% +0.0% -10.6% -10.5% +0.0% nbody +0.0% -4.4% +0.0% 0.07 +0.0% parfib +0.0% +1.0% +0.5% +0.9% +0.0% partree +0.0% +0.0% -2.4% -2.5% +1.7% prsa +0.0% -0.2% +1.8% +4.2% +0.0% queens +0.0% -0.0% -1.8% -1.4% -4.8% ray +0.0% -0.6% -18.5% -17.8% +0.0% sumeuler +0.0% -0.0% -3.7% -3.7% +0.0% transclos +0.0% -0.0% -25.7% -26.6% +0.0% -------------------------------------------------------------------------------- Min +0.0% -4.4% -25.7% -26.6% -5.8% Max +0.0% +5.5% +1.8% +4.2% +1.7% Geometric Mean +0.0% +0.1% -6.3% -6.1% -0.7%
* New flag +RTS -qi<n>, avoid waking up idle Capabilities to do parallel GCSimon Marlow2011-12-131-0/+1
| | | | | | | | | | | | | | | | | This is an experimental tweak to the parallel GC that avoids waking up a Capability to do parallel GC if we know that the capability has been idle for a (tunable) number of GC cycles. The idea is that if you're only using a few Capabilities, there's no point waking up the ones that aren't busy. e.g. +RTS -qi3 says "A Capability will participate in parallel GC if it was running at all since the last 3 GC cycles." Results are a bit hit and miss, and I don't completely understand why yet. Hence, for now it is turned off by default, and also not documented except in the +RTS -? output.
* Time handling overhaulSimon Marlow2011-11-251-3/+3
| | | | | | | | | | | | | | | | | | | | | Terminology cleanup: the type "Ticks" has been renamed "Time", which is an StgWord64 in units of TIME_RESOLUTION (currently nanoseconds). The terminology "tick" is now used consistently to mean the interval between timer signals. The ticker now always ticks in realtime (actually CLOCK_MONOTONIC if we have it). Before it used CPU time in the non-threaded RTS and realtime in the threaded RTS, but I've discovered that the CPU timer has terrible resolution (at least on Linux) and isn't much use for profiling. So now we always use realtime. This should also fix The default tick interval is now 10ms, except when profiling where we drop it to 1ms. This gives more accurate profiles without affecting runtime too much (<1%). Lots of cleanups - the resolution of Time is now in one place only (Rts.h) rather than having calculations that depend on the resolution scattered all over the RTS. I hope I found them all.
* Refactoring and tidy upSimon Marlow2011-04-111-85/+11
| | | | | | | | | | | | This is a port of some of the changes from my private local-GC branch (which is still in darcs, I haven't converted it to git yet). There are a couple of small functional differences in the GC stats: first, per-thread GC timings should now be more accurate, and secondly we now report average and maximum pause times. e.g. from minimax +RTS -N8 -s: Tot time (elapsed) Avg pause Max pause Gen 0 2755 colls, 2754 par 13.16s 0.93s 0.0003s 0.0150s Gen 1 769 colls, 769 par 3.71s 0.26s 0.0003s 0.0059s
* A small GC optimisationSimon Marlow2011-02-021-1/+1
| | | | | | Store the *number* of the destination generation in the Bdescr struct, so that in evacuate() we don't have to deref gen to get it. This is another improvement ported over from my GC branch.
* Change some TARGET tests to HOST tests in the RTSIan Lynagh2010-07-131-1/+1
| | | | Which was being used seemed to be random
* Fix the symbol visibility pragmasSimon Marlow2010-06-171-2/+2
|
* GC refactoring, remove "steps"Simon Marlow2009-12-031-22/+23
| | | | | | | | | | | | | | | | | | | | | The GC had a two-level structure, G generations each of T steps. Steps are for aging within a generation, mostly to avoid premature promotion. Measurements show that more than 2 steps is almost never worthwhile, and 1 step is usually worse than 2. In theory fractional steps are possible, so the ideal number of steps is somewhere between 1 and 3. GHC's default has always been 2. We can implement 2 steps quite straightforwardly by having each block point to the generation to which objects in that block should be promoted, so blocks in the nursery point to generation 0, and blocks in gen 0 point to gen 1, and so on. This commit removes the explicit step structures, merging generations with steps, thus simplifying a lot of code. Performance is unaffected. The tunable number of steps is now gone, although it may be replaced in the future by a way to tune the aging in generation 0.
* add comment: __thread is not supported by gcc on OS X yetSimon Marlow2009-09-101-0/+3
|
* Omit visibility pragmas on Windows (fixes warnings/validate failures)Simon Marlow2009-09-091-2/+2
|
* Declare RTS-private prototypes with __attribute__((visibility("hidden")))Simon Marlow2009-08-051-0/+4
| | | | | | | | | | This has no effect with static libraries, but when the RTS is in a shared library it does two things: - it prevents the function from being exposed by the shared library - internal calls to the function can use the faster non-PLT calls, because the function cannot be overriden at link time.
* RTS tidyup sweep, first phaseSimon Marlow2009-08-021-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first phase of this tidyup is focussed on the header files, and in particular making sure we are exposinng publicly exactly what we need to, and no more. - Rts.h now includes everything that the RTS exposes publicly, rather than a random subset of it. - Most of the public header files have moved into subdirectories, and many of them have been renamed. But clients should not need to include any of the other headers directly, just #include the main public headers: Rts.h, HsFFI.h, RtsAPI.h. - All the headers needed for via-C compilation have moved into the stg subdirectory, which is self-contained. Most of the headers for the rest of the RTS APIs have moved into the rts subdirectory. - I left MachDeps.h where it is, because it is so widely used in Haskell code. - I left a deprecated stub for RtsFlags.h in place. The flag structures are now exposed by Rts.h. - Various internal APIs are no longer exposed by public header files. - Various bits of dead code and declarations have been removed - More gcc warnings are turned on, and the RTS code is more warning-clean. - More source files #include "PosixSource.h", and hence only use standard POSIX (1003.1c-1995) interfaces. There is a lot more tidying up still to do, this is just the first pass. I also intend to standardise the names for external RTS APIs (e.g use the rts_ prefix consistently), and declare the internal APIs as hidden for shared libraries.
* SPARC NCG: Add a comment explaining why we can't used a pinned reg for gctBen.Lippmeier@anu.edu.au2009-04-201-3/+20
| | | | | Can't use windowed regs because the window moves during a function call. Can't use the global regs because they're reserved for other purposes.
* Don't use thread local storage on x86/not-LinuxIan Lynagh2009-04-041-2/+2
| | | | | | | With the On x86, use thread-local storage instead of stealing a reg for gct patch, on Windows and OS X: error: thread-local storage not supported for this target
* On x86, use thread-local storage instead of stealing a reg for gctSimon Marlow2009-04-031-1/+6
| | | | | | | | Benchmarks show that using TLS instead of stealing a register is better by a few percent on x86, due to the lack of registers. This only affects -threaded; without -threaded we're (now) using static storage for the GC data.
* in the non-threaded RTS, use a static gc_thread structureSimon Marlow2009-04-031-3/+17
|
* Use work-stealing for load-balancing in the GCSimon Marlow2009-03-131-4/+6
| | | | | | | | | | | | | | | | | New flag: "+RTS -qb" disables load-balancing in the parallel GC (though this is subject to change, I think we will probably want to do something more automatic before releasing this). To get the "PARGC3" configuration described in the "Runtime support for Multicore Haskell" paper, use "+RTS -qg0 -qb -RTS". The main advantage of this is that it allows us to easily disable load-balancing altogether, which turns out to be important in parallel programs. Maintaining locality is sometimes more important that spreading the work out in parallel GC. There is a side benefit in that the parallel GC should have improved locality even when load-balancing, because each processor prefers to take work from its own queue before stealing from others.
* Keep the remembered sets local to each thread during parallel GCSimon Marlow2009-01-121-0/+8
| | | | | | | | | | | | | | | | | | | | | This turns out to be quite vital for parallel programs: - The way we discover which threads to traverse is by finding dirty threads via the remembered sets (aka mutable lists). - A dirty thread will be on the remembered set of the capability that was running it, and we really want to traverse that thread's stack using the GC thread for the capability, because it is in that CPU's cache. If we get this wrong, we get penalised badly by the memory system. Previously we had per-capability mutable lists but they were aggregated before GC and traversed by just one of the GC threads. This resulted in very poor performance particularly for parallel programs with deep stacks. Now we keep per-capability remembered sets throughout GC, which also removes a lock (recordMutableGen_sync).
* Don't pin a register for gc_thread on SPARC.Ben.Lippmeier@anu.edu.au2009-01-051-1/+8
| | | | This makes the build work again.
* Use mutator threads to do GC, instead of having a separate pool of GC threadsSimon Marlow2008-11-211-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the GC had its own pool of threads to use as workers when doing parallel GC. There was a "leader", which was the mutator thread that initiated the GC, and the other threads were taken from the pool. This was simple and worked fine for sequential programs, where we did most of the benchmarking for the parallel GC, but falls down for parallel programs. When we have N mutator threads and N cores, at GC time we would have to stop N-1 mutator threads and start up N-1 GC threads, and hope that the OS schedules them all onto separate cores. It practice it doesn't, as you might expect. Now we use the mutator threads to do GC. This works quite nicely, particularly for parallel programs, where each mutator thread scans its own spark pool, which is probably in its cache anyway. There are some flag changes: -g<n> is removed (-g1 is still accepted for backwards compat). There's no way to have a different number of GC threads than mutator threads now. -q1 Use one OS thread for GC (turns off parallel GC) -qg<n> Use parallel GC for generations >= <n> (default: 1) Using parallel GC only for generations >=1 works well for sequential programs. Compiling an ordinary sequential program with -threaded and running it with -N2 or more should help if you do a lot of GC. I've found that adding -qg0 (do parallel GC for generation 0 too) speeds up some parallel programs, but slows down some sequential programs. Being conservative, I left the threshold at 1. ToDo: document the new options.
* don't steal %ebx for the GC on x86: it's also used by PICSimon Marlow2008-07-251-1/+3
|
* comment updatesSimon Marlow2008-06-031-2/+8
|
* declare the GC thread register variable more portablySimon Marlow2008-04-171-2/+29
|
* pad step_workspace to 64 bytes, to speed up access to gct->steps[]Simon Marlow2008-04-161-1/+5
|
* update copyrights in rts/smSimon Marlow2008-04-161-1/+1
|
* Reorganisation to fix problems related to the gct register variableSimon Marlow2008-04-161-0/+184
- GCAux.c contains code not compiled with the gct register enabled, it is callable from outside the GC - marking functions are moved to their relevant subsystems, outside the GC - mark_root needs to save the gct register, as it is called from outside the GC