summaryrefslogtreecommitdiff
path: root/rts/Capability.c
Commit message (Collapse)AuthorAgeFilesLines
...
* typoGabor Greif2012-02-271-5/+5
|
* Allocate pinned object blocks from the nursery, not the globalSimon Marlow2012-02-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allocator. Prompted by a benchmark posted to parallel-haskell@haskell.org by Andreas Voellmy <andreas.voellmy@gmail.com>. This program exhibits contention for the block allocator when run with -N2 and greater without the fix: {-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-} module Main where import Control.Monad import Control.Concurrent import System.Environment import GHC.IO import GHC.Exts import GHC.Conc main = do [m] <- fmap (fmap read) getArgs n <- getNumCapabilities ms <- replicateM n newEmptyMVar sequence [ forkIO $ busyWorkerB (m `quot` n) >> putMVar mv () | mv <- ms ] mapM takeMVar ms busyWorkerB :: Int -> IO () busyWorkerB n_loops = go 0 where go !n | n >= n_loops = return () | otherwise = do p <- (IO $ \s -> case newPinnedByteArray# 1024# s of { (# s', mbarr# #) -> (# s', () #) } ) go (n+1)
* Add missing initialisation of cap->disabledSimon Marlow2012-01-161-0/+1
|
* last_free_capability should never be NULLSimon Marlow2012-01-091-1/+1
|
* Support for reducing the number of Capabilities with setNumCapabilitiesSimon Marlow2011-12-151-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows setNumCapabilities to /reduce/ the number of active capabilities as well as increase it. This is particularly tricky to do, because a Capability is a large data structure and ties into the rest of the system in many ways. Trying to clean it all up would be extremely error prone. So instead, the solution is to mark the extra capabilities as "disabled". This has the following consequences: - threads on a disabled capability are migrated away by the scheduler loop - disabled capabilities do not participate in GC (see scheduleDoGC()) - No spark threads are created on this capability (see scheduleActivateSpark()) - We do not attempt to migrate threads *to* a disabled capability (see schedulePushWork()). So a disabled capability should do no work, and does not participate in GC, although it remains alive in other respects. For example, a blocked thread might wake up on a disabled capability, and it will get quickly migrated to a live capability. A disabled capability can still initiate GC if necessary. Indeed, it turns out to be hard to migrate bound threads, so we wait until the next GC to do this (see comments for details).
* New flag +RTS -qi<n>, avoid waking up idle Capabilities to do parallel GCSimon Marlow2011-12-131-0/+1
| | | | | | | | | | | | | | | | | This is an experimental tweak to the parallel GC that avoids waking up a Capability to do parallel GC if we know that the capability has been idle for a (tunable) number of GC cycles. The idea is that if you're only using a few Capabilities, there's no point waking up the ones that aren't busy. e.g. +RTS -qi3 says "A Capability will participate in parallel GC if it was running at all since the last 3 GC cycles." Results are a bit hit and miss, and I don't completely understand why yet. Hence, for now it is turned off by default, and also not documented except in the +RTS -? output.
* Allow the number of capabilities to be increased at runtime (#3729)Simon Marlow2011-12-061-18/+48
| | | | | At present the number of capabilities can only be *increased*, not decreased. The latter presents a few more challenges!
* Make forkProcess work with +RTS -NSimon Marlow2011-12-061-7/+10
| | | | | | | | | | | | | | | | | | | | | | Consider this experimental for the time being. There are a lot of things that could go wrong, but I've verified that at least it works on the test cases we have. I also did some API cleanups while I was here. Previously we had: Capability * rts_eval (Capability *cap, HaskellObj p, /*out*/HaskellObj *ret); but this API is particularly error-prone: if you forget to discard the Capability * you passed in and use the return value instead, then you're in for subtle bugs with +RTS -N later on. So I changed all these functions to this form: void rts_eval (/* inout */ Capability **cap, /* in */ HaskellObj p, /* out */ HaskellObj *ret) It's much harder to use this version incorrectly, because you have to pass the Capability in by reference.
* Fix a scheduling bug in the threaded RTSSimon Marlow2011-12-011-1/+9
| | | | | | | | | | | | | | | The parallel GC was using setContextSwitches() to stop all the other threads, which sets the context_switch flag on every Capability. That had the side effect of causing every Capability to also switch threads, and since GCs can be much more frequent than context switches, this increased the context switch frequency. When context switches are expensive (because the switch is between two bound threads or a bound and unbound thread), the difference is quite noticeable. The fix is to have a separate flag to indicate that a Capability should stop and return to the scheduler, but not switch threads. I've called this the "interrupt" flag.
* Add a new primop: getCCCS# :: State# s -> (# State# s, Addr# #)Simon Marlow2011-11-291-0/+3
| | | | | Returns a pointer to the current cost-centre stack when profiling, NULL otherwise.
* Make profiling work with multiple capabilities (+RTS -N)Simon Marlow2011-11-291-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | This means that both time and heap profiling work for parallel programs. Main internal changes: - CCCS is no longer a global variable; it is now another pseudo-register in the StgRegTable struct. Thus every Capability has its own CCCS. - There is a new built-in CCS called "IDLE", which records ticks for Capabilities in the idle state. If you profile a single-threaded program with +RTS -N2, you'll see about 50% of time in "IDLE". - There is appropriate locking in rts/Profiling.c to protect the shared cost-centre-stack data structures. This patch does enough to get it working, I have cut one big corner: the cost-centre-stack data structure is still shared amongst all Capabilities, which means that multiple Capabilities will race when updating the "allocations" and "entries" fields of a CCS. Not only does this give unpredictable results, but it runs very slowly due to cache line bouncing. It is strongly recommended that you use -fno-prof-count-entries to disable the "entries" count when profiling parallel programs. (I shall add a note to this effect to the docs).
* Add a clock domain capset and emit wall clock time on rts initDuncan Coutts2011-10-261-2/+6
|
* fix race condition in yieldCapability() (#5552)Simon Marlow2011-10-241-1/+26
| | | | | See comment for details. I've tried quite hard, but haven't been able to make a small test case that reproduces the bug.
* small optimisation for the program in #5367: if the worker threadSimon Marlow2011-08-051-5/+7
| | | | | being woken already has its wakeup flag set, don't bother signalling its condition variable again.
* Add new fully-accurate per-spark trace/eventlog eventsDuncan Coutts2011-07-181-3/+4
| | | | | | | | | | | | | | Replaces the existing EVENT_RUN/STEAL_SPARK events with 7 new events covering all stages of the spark lifcycle: create, dud, overflow, run, steal, fizzle, gc The sampled spark events are still available. There are now two event classes for sparks, the sampled and the fully accurate. They can be enabled/disabled independently. By default +RTS -l includes the sampled but not full detail spark events. Use +RTS -lf-p to enable the detailed 'f' and disable the sampled 'p' spark. Includes work by Mikolaj <mikolaj.konarski@gmail.com>
* add a missing traceSparkCounters invocationMikolaj2011-07-181-0/+1
|
* Add spark counter tracingDuncan Coutts2011-07-181-1/+6
| | | | | | | A new eventlog event containing 7 spark counters/statistics: sparks created, dud, overflowed, converted, GC'd, fizzled and remaining. These are maintained and logged separately for each capability. We log them at startup, on each GC (minor and major) and on shutdown.
* Move allocation of spark pools into initCapabilityDuncan Coutts2011-07-181-0/+1
| | | | | | Rather than a separate phase of initSparkPools. It means all the spark stuff for a capability is initialisaed at the same time, which is then becomes a good place to stick an initial spark trace event.
* Add assertion of the invariant for the spark countersDuncan Coutts2011-07-181-0/+35
| | | | | | | | | The invariant is: created = converted + remaining + gcd + fizzled Since sparks move between capabilities, we have to aggregate the counters over all capabilities. This in turn means we can only check the invariant at stable points where all but one capabilities are stopped. We can do this at shutdown time and before and after a global synchronised GC.
* Classify overflowed sparks separatelyDuncan Coutts2011-07-181-0/+1
| | | | | | | | | | | When you use `par` to make a spark, if the spark pool on the current capability is full then the spark is discarded. This represents a loss of potential parallelism and it also means there are simply a lot of sparks around. Both are things that might be of concern to a programmer when tuning a parallel program that uses par. The "+RTS -s" stats command now reports overflowed sparks, e.g. SPARKS: 100001 (15521 converted, 84480 overflowed, 0 dud, 0 GC'd, 0 fizzled)
* Use a struct for the set of spark countersDuncan Coutts2011-07-181-9/+9
|
* Change tryStealSpark so it does not consume fizzled sparksDuncan Coutts2011-07-181-2/+10
| | | | | We want to count fizzled sparks accurately. Now tryStealSpark returns fizzled sparks, and the callers now update the fizzled spark count.
* Add capability sets to the tracing/events systemDuncan Coutts2011-05-261-1/+12
| | | | | | | We trace the creation and shutdown of capabilities. All the capabilities in the process are assigned to one capabilitiy set of OS-process type. This is a second version of the patch. Includes work by Spencer Janssen.
* Rearrange shutdownCapability code slightlyDuncan Coutts2011-05-261-22/+35
| | | | | | | | | | | | | | | | | | | This is mostly for the beneift of having sensible places to put tracing code later. We want a code path that has somewhere to trace (in order): (1) starting up all capabilities; (2) N * starting up an individual capability; (3) N * shutting down an individual capability; (4) shutting down all capabilities. This has to work in both threaded and non-threaded modes. Locations (1) and (2) are provided by initCapabilities and initCapability respectively. Previously, there was no loccation for (4) and while shutdownCapability should be usable for (3) it was only called in the !THREADED_RTS case. Now, shutdownCapability is called unconditionally (and the body is conditonal on THREADED_RTS) and there is a new shutdownCapabilities that calls shutdownCapability in a loop.
* Revert "Add capability sets to the event system. Contains code from Duncan ↵Duncan Coutts2011-05-231-4/+0
| | | | | | | | Coutts." This reverts commit 58532eb46041aec8d4cbb48b054cb5b001edb43c. Turns out it didn't work on Windows and it'll need some non-trivial changes to make it work on Windows. We'll get it in later once that's sorted out.
* Add capability sets to the event system. Contains code from Duncan Coutts.Spencer Janssen2011-05-181-0/+4
|
* Refactoring and tidy upSimon Marlow2011-04-111-28/+18
| | | | | | | | | | | | This is a port of some of the changes from my private local-GC branch (which is still in darcs, I haven't converted it to git yet). There are a couple of small functional differences in the GC stats: first, per-thread GC timings should now be more accurate, and secondly we now report average and maximum pause times. e.g. from minimax +RTS -N8 -s: Tot time (elapsed) Avg pause Max pause Gen 0 2755 colls, 2754 par 13.16s 0.93s 0.0003s 0.0150s Gen 1 769 colls, 769 par 3.71s 0.26s 0.0003s 0.0059s
* count fizzled and GC'd sparks separatelySimon Marlow2010-11-111-1/+2
|
* count "dud" sparks (expressions that were already evaluated when sparked)Simon Marlow2010-11-011-0/+1
|
* releaseCapabilityAndQueueWorker: task->stopped should be false (#4850)Simon Marlow2010-12-211-1/+5
|
* Keep a maximum of 6 spare worker threads per Capability (#4262)Simon Marlow2010-11-251-10/+28
|
* Make sparks into weak pointers (#2185)Simon Marlow2010-05-251-4/+2
| | | | | The new strategies library (parallel-2.0+, preferably 2.2+) is now required for parallel programming, otherwise parallelism will be lost.
* New implementation of BLACKHOLEsSimon Marlow2010-03-291-61/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces the global blackhole_queue with a clever scheme that enables us to queue up blocked threads on the closure that they are blocked on, while still avoiding atomic instructions in the common case. Advantages: - gets rid of a locked global data structure and some tricky GC code (replacing it with some per-thread data structures and different tricky GC code :) - wakeups are more prompt: parallel/concurrent performance should benefit. I haven't seen anything dramatic in the parallel benchmarks so far, but a couple of threading benchmarks do improve a bit. - waking up a thread blocked on a blackhole is now O(1) (e.g. if it is the target of throwTo). - less sharing and better separation of Capabilities: communication is done with messages, the data structures are strictly owned by a Capability and cannot be modified except by sending messages. - this change will utlimately enable us to do more intelligent scheduling when threads block on each other. This is what started off the whole thing, but it isn't done yet (#3838). I'll be documenting all this on the wiki in due course.
* Use message-passing to implement throwTo in the RTSSimon Marlow2010-03-111-25/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces some complicated locking schemes with message-passing in the implementation of throwTo. The benefits are - previously it was impossible to guarantee that a throwTo from a thread running on one CPU to a thread running on another CPU would be noticed, and we had to rely on the GC to pick up these forgotten exceptions. This no longer happens. - the locking regime is simpler (though the code is about the same size) - threads can be unblocked from a blocked_exceptions queue without having to traverse the whole queue now. It's a rare case, but replaces an O(n) operation with an O(1). - generally we move in the direction of sharing less between Capabilities (aka HECs), which will become important with other changes we have planned. Also in this patch I replaced several STM-specific closure types with a generic MUT_PRIM closure type, which allowed a lot of code in the GC and other places to go away, hence the line-count reduction. The message-passing changes resulted in about a net zero line-count difference.
* Split part of the Task struct into a separate struct InCallSimon Marlow2010-03-091-21/+19
| | | | | | | | | | | | | | | The idea is that this leaves Tasks and OSThread in one-to-one correspondence. The part of a Task that represents a call into Haskell from C is split into a separate struct InCall, pointed to by the Task and the TSO bound to it. A given OSThread/Task thus always uses the same mutex and condition variable, rather than getting a new one for each callback. Conceptually it is simpler, although there are more types and indirections in a few places now. This improves callback performance by removing some of the locks that we had to take when making in-calls. Now we also keep the current Task in a thread-local variable if supported by the OS and gcc (currently only Linux).
* Fix a rare deadlock when the IO manager thread is slow to start upSimon Marlow2010-03-091-1/+9
| | | | | This fixes occasional failures of ffi002(threaded1) on a loaded machine.
* comment-out an incorrect assertionSimon Marlow2010-01-261-1/+4
|
* Expose all EventLog events as DTrace probesManuel M T Chakravarty2009-12-121-6/+5
| | | | | | | | | | | | | | - Defines a DTrace provider, called 'HaskellEvent', that provides a probe for every event of the eventlog framework. - In contrast to the original eventlog, the DTrace probes are available in all flavours of the runtime system (DTrace probes have virtually no overhead if not enabled); when -DTRACING is defined both the regular event log as well as DTrace probes can be used. - Currently, Mac OS X only. User-space DTrace probes are implemented differently on Mac OS X than in the original DTrace implementation. Nevertheless, it shouldn't be too hard to enable these probes on other platforms, too. - Documentation is at http://hackage.haskell.org/trac/ghc/wiki/DTrace
* remove unused cap->in_gc flagSimon Marlow2009-12-021-1/+0
|
* Make allocatePinned use local storage, and other refactoringsSimon Marlow2009-12-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This is a batch of refactoring to remove some of the GC's global state, as we move towards CPU-local GC. - allocateLocal() now allocates large objects into the local nursery, rather than taking a global lock and allocating then in gen 0 step 0. - allocatePinned() was still allocating from global storage and taking a lock each time, now it uses local storage. (mallocForeignPtrBytes should be faster with -threaded). - We had a gen 0 step 0, distinct from the nurseries, which are stored in a separate nurseries[] array. This is slightly strange. I removed the g0s0 global that pointed to gen 0 step 0, and removed all uses of it. I think now we don't use gen 0 step 0 at all, except possibly when there is only one generation. Possibly more tidying up is needed here. - I removed the global allocate() function, and renamed allocateLocal() to allocate(). - the alloc_blocks global is gone. MAYBE_GC() and doYouWantToGC() now check the local nursery only.
* free cap->saved_mut_lists tooSimon Marlow2009-12-011-0/+1
| | | | fixes some memory leakage at shutdown
* findSpark: exit if there's a returning foreign callSimon Marlow2009-10-091-1/+1
|
* Retry pulling from our own spark pool if there was a collisionSimon Marlow2009-10-071-21/+24
|
* Unify event logging and debug tracing.Simon Marlow2009-08-291-14/+6
| | | | | | | | | | | | | | | | | | | - tracing facilities are now enabled with -DTRACING, and -DDEBUG additionally enables debug-tracing. -DEVENTLOG has been removed. - -debug now implies -eventlog - events can be printed to stderr instead of being sent to the binary .eventlog file by adding +RTS -v (which is implied by the +RTS -Dx options). - -Dx debug messages can be sent to the binary .eventlog file by adding +RTS -l. This should help debugging by reducing the impact of debug tracing on execution time. - Various debug messages that duplicated the information in events have been removed.
* waitForReturnCapability: fix logic bugSimon Marlow2009-08-311-1/+1
| | | | | The check for whether a Capability was free was inverted, which harmed performance for callbacks.
* RTS tidyup sweep, first phaseSimon Marlow2009-08-021-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first phase of this tidyup is focussed on the header files, and in particular making sure we are exposinng publicly exactly what we need to, and no more. - Rts.h now includes everything that the RTS exposes publicly, rather than a random subset of it. - Most of the public header files have moved into subdirectories, and many of them have been renamed. But clients should not need to include any of the other headers directly, just #include the main public headers: Rts.h, HsFFI.h, RtsAPI.h. - All the headers needed for via-C compilation have moved into the stg subdirectory, which is self-contained. Most of the headers for the rest of the RTS APIs have moved into the rts subdirectory. - I left MachDeps.h where it is, because it is so widely used in Haskell code. - I left a deprecated stub for RtsFlags.h in place. The flag structures are now exposed by Rts.h. - Various internal APIs are no longer exposed by public header files. - Various bits of dead code and declarations have been removed - More gcc warnings are turned on, and the RTS code is more warning-clean. - More source files #include "PosixSource.h", and hence only use standard POSIX (1003.1c-1995) interfaces. There is a lot more tidying up still to do, this is just the first pass. I also intend to standardise the names for external RTS APIs (e.g use the rts_ prefix consistently), and declare the internal APIs as hidden for shared libraries.
* Add and export rts_unsafeGetMyCapability from rtsDuncan Coutts2009-06-121-0/+15
| | | | | | | | | | | | | | | We need this, or something equivalent, to be able to implement stgAllocForGMP outside of the rts. That's because we want to use allocateLocal which allocates from the given capability without having to take any locks. In the gmp primops we're basically in an unsafe foreign call, that is a context where we hold a current capability. So it's safe for us to use allocateLocal. We just need a way to get the current capability. The method to get the current capability varies depends on whether we're using the threaded rts or not. When stgAllocForGMP is built inside the rts that's ok because we can do it conditionally on THREADED_RTS. Outside the rts we need a single api we can call without knowing if we're talking to a threaded rts or not, hence this addition.
* Remove old GUM/GranSim codeSimon Marlow2009-06-021-1/+1
|
* Eventlog support for new event type: create spark.donnie@darthik.com2009-04-031-0/+9
|
* Add fast event loggingSimon Marlow2009-03-171-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | Generate binary log files from the RTS containing a log of runtime events with timestamps. The log file can be visualised in various ways, for investigating runtime behaviour and debugging performance problems. See for example the forthcoming ThreadScope viewer. New GHC option: -eventlog (link-time option) Enables event logging. +RTS -l (runtime option) Generates <prog>.eventlog with the binary event information. This replaces some of the tracing machinery we already had in the RTS: e.g. +RTS -vg for GC tracing (we should do this using the new event logging instead). Event logging has almost no runtime cost when it isn't enabled, though in the future we might add more fine-grained events and this might change; hence having a link-time option and compiling a separate version of the RTS for event logging. There's a small runtime cost for enabling event-logging, for most programs it shouldn't make much difference. (Todo: docs)