summaryrefslogtreecommitdiff
path: root/rts/Capability.h
Commit message (Collapse)AuthorAgeFilesLines
...
* count "dud" sparks (expressions that were already evaluated when sparked)Simon Marlow2010-11-011-0/+1
|
* Keep a maximum of 6 spare worker threads per Capability (#4262)Simon Marlow2010-11-251-0/+1
|
* Fix the symbol visibility pragmasSimon Marlow2010-06-171-2/+2
|
* Make sparks into weak pointers (#2185)Simon Marlow2010-05-251-1/+1
| | | | | The new strategies library (parallel-2.0+, preferably 2.2+) is now required for parallel programming, otherwise parallelism will be lost.
* Add wiki linksSimon Marlow2010-05-201-0/+3
|
* Change the representation of the MVar blocked queueSimon Marlow2010-04-011-4/+4
| | | | | | | | | | | | | | | | | | | | | The list of threads blocked on an MVar is now represented as a list of separately allocated objects rather than being linked through the TSOs themselves. This lets us remove a TSO from the list in O(1) time rather than O(n) time, by marking the list object. Removing this linear component fixes some pathalogical performance cases where many threads were blocked on an MVar and became unreachable simultaneously (nofib/smp/threads007), or when sending an asynchronous exception to a TSO in a long list of thread blocked on an MVar. MVar performance has actually improved by a few percent as a result of this change, slightly to my surprise. This is the final cleanup in the sequence, which let me remove the old way of waking up threads (unblockOne(), MSG_WAKEUP) in favour of the new way (tryWakeupThread and MSG_TRY_WAKEUP, which is idempotent). It is now the case that only the Capability that owns a TSO may modify its state (well, almost), and this simplifies various things. More of the RTS is based on message-passing between Capabilities now.
* New implementation of BLACKHOLEsSimon Marlow2010-03-291-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces the global blackhole_queue with a clever scheme that enables us to queue up blocked threads on the closure that they are blocked on, while still avoiding atomic instructions in the common case. Advantages: - gets rid of a locked global data structure and some tricky GC code (replacing it with some per-thread data structures and different tricky GC code :) - wakeups are more prompt: parallel/concurrent performance should benefit. I haven't seen anything dramatic in the parallel benchmarks so far, but a couple of threading benchmarks do improve a bit. - waking up a thread blocked on a blackhole is now O(1) (e.g. if it is the target of throwTo). - less sharing and better separation of Capabilities: communication is done with messages, the data structures are strictly owned by a Capability and cannot be modified except by sending messages. - this change will utlimately enable us to do more intelligent scheduling when threads block on each other. This is what started off the whole thing, but it isn't done yet (#3838). I'll be documenting all this on the wiki in due course.
* Use message-passing to implement throwTo in the RTSSimon Marlow2010-03-111-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces some complicated locking schemes with message-passing in the implementation of throwTo. The benefits are - previously it was impossible to guarantee that a throwTo from a thread running on one CPU to a thread running on another CPU would be noticed, and we had to rely on the GC to pick up these forgotten exceptions. This no longer happens. - the locking regime is simpler (though the code is about the same size) - threads can be unblocked from a blocked_exceptions queue without having to traverse the whole queue now. It's a rare case, but replaces an O(n) operation with an O(1). - generally we move in the direction of sharing less between Capabilities (aka HECs), which will become important with other changes we have planned. Also in this patch I replaced several STM-specific closure types with a generic MUT_PRIM closure type, which allowed a lot of code in the GC and other places to go away, hence the line-count reduction. The message-passing changes resulted in about a net zero line-count difference.
* Split part of the Task struct into a separate struct InCallSimon Marlow2010-03-091-1/+1
| | | | | | | | | | | | | | | The idea is that this leaves Tasks and OSThread in one-to-one correspondence. The part of a Task that represents a call into Haskell from C is split into a separate struct InCall, pointed to by the Task and the TSO bound to it. A given OSThread/Task thus always uses the same mutex and condition variable, rather than getting a new one for each callback. Conceptually it is simpler, although there are more types and indirections in a few places now. This improves callback performance by removing some of the locks that we had to take when making in-calls. Now we also keep the current Task in a thread-local variable if supported by the OS and gcc (currently only Linux).
* disable a false assertion, with a comment to explain whySimon Marlow2010-02-161-1/+2
|
* remove unused cap->in_gc flagSimon Marlow2009-12-021-3/+0
|
* Make allocatePinned use local storage, and other refactoringsSimon Marlow2009-12-011-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | This is a batch of refactoring to remove some of the GC's global state, as we move towards CPU-local GC. - allocateLocal() now allocates large objects into the local nursery, rather than taking a global lock and allocating then in gen 0 step 0. - allocatePinned() was still allocating from global storage and taking a lock each time, now it uses local storage. (mallocForeignPtrBytes should be faster with -threaded). - We had a gen 0 step 0, distinct from the nurseries, which are stored in a separate nurseries[] array. This is slightly strange. I removed the g0s0 global that pointed to gen 0 step 0, and removed all uses of it. I think now we don't use gen 0 step 0 at all, except possibly when there is only one generation. Possibly more tidying up is needed here. - I removed the global allocate() function, and renamed allocateLocal() to allocate(). - the alloc_blocks global is gone. MAYBE_GC() and doYouWantToGC() now check the local nursery only.
* Omit visibility pragmas on Windows (fixes warnings/validate failures)Simon Marlow2009-09-091-2/+2
|
* Declare RTS-private prototypes with __attribute__((visibility("hidden")))Simon Marlow2009-08-051-0/+4
| | | | | | | | | | This has no effect with static libraries, but when the RTS is in a shared library it does two things: - it prevents the function from being exposed by the shared library - internal calls to the function can use the faster non-PLT calls, because the function cannot be overriden at link time.
* RTS tidyup sweep, first phaseSimon Marlow2009-08-021-18/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first phase of this tidyup is focussed on the header files, and in particular making sure we are exposinng publicly exactly what we need to, and no more. - Rts.h now includes everything that the RTS exposes publicly, rather than a random subset of it. - Most of the public header files have moved into subdirectories, and many of them have been renamed. But clients should not need to include any of the other headers directly, just #include the main public headers: Rts.h, HsFFI.h, RtsAPI.h. - All the headers needed for via-C compilation have moved into the stg subdirectory, which is self-contained. Most of the headers for the rest of the RTS APIs have moved into the rts subdirectory. - I left MachDeps.h where it is, because it is so widely used in Haskell code. - I left a deprecated stub for RtsFlags.h in place. The flag structures are now exposed by Rts.h. - Various internal APIs are no longer exposed by public header files. - Various bits of dead code and declarations have been removed - More gcc warnings are turned on, and the RTS code is more warning-clean. - More source files #include "PosixSource.h", and hence only use standard POSIX (1003.1c-1995) interfaces. There is a lot more tidying up still to do, this is just the first pass. I also intend to standardise the names for external RTS APIs (e.g use the rts_ prefix consistently), and declare the internal APIs as hidden for shared libraries.
* GHC new build system megapatchIan Lynagh2009-04-261-1/+1
|
* Instead of a separate context-switch flag, set HpLim to zeroSimon Marlow2009-03-131-0/+13
| | | | | | | | | | | | This reduces the latency between a context-switch being triggered and the thread returning to the scheduler, which in turn should reduce the cost of the GC barrier when there are many cores. We still retain the old context_switch flag which is checked at the end of each block of allocation. The idea is that setting HpLim may fail if the the target thread is modifying HpLim at the same time; the context_switch flag is a fallback. It also allows us to "context switch soon" without forcing an immediate switch, which can be costly.
* Keep the remembered sets local to each thread during parallel GCSimon Marlow2009-01-121-3/+5
| | | | | | | | | | | | | | | | | | | | | This turns out to be quite vital for parallel programs: - The way we discover which threads to traverse is by finding dirty threads via the remembered sets (aka mutable lists). - A dirty thread will be on the remembered set of the capability that was running it, and we really want to traverse that thread's stack using the GC thread for the capability, because it is in that CPU's cache. If we get this wrong, we get penalised badly by the memory system. Previously we had per-capability mutable lists but they were aggregated before GC and traversed by just one of the GC threads. This resulted in very poor performance particularly for parallel programs with deep stacks. Now we keep per-capability remembered sets throughout GC, which also removes a lock (recordMutableGen_sync).
* Fix more problems caused by padding in the Capability structureSimon Marlow2008-12-021-3/+1
| | | | Fixes crashes on Windows and Sparc
* Remove the packing I added recently to the Capability structureSimon Marlow2008-11-281-5/+1
| | | | | | The problem is that the packing caused some unaligned loads, which lead to bus errors on Sparc (and reduced performance elsewhere, presumably).
* Use mutator threads to do GC, instead of having a separate pool of GC threadsSimon Marlow2008-11-211-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the GC had its own pool of threads to use as workers when doing parallel GC. There was a "leader", which was the mutator thread that initiated the GC, and the other threads were taken from the pool. This was simple and worked fine for sequential programs, where we did most of the benchmarking for the parallel GC, but falls down for parallel programs. When we have N mutator threads and N cores, at GC time we would have to stop N-1 mutator threads and start up N-1 GC threads, and hope that the OS schedules them all onto separate cores. It practice it doesn't, as you might expect. Now we use the mutator threads to do GC. This works quite nicely, particularly for parallel programs, where each mutator thread scans its own spark pool, which is probably in its cache anyway. There are some flag changes: -g<n> is removed (-g1 is still accepted for backwards compat). There's no way to have a different number of GC threads than mutator threads now. -q1 Use one OS thread for GC (turns off parallel GC) -qg<n> Use parallel GC for generations >= <n> (default: 1) Using parallel GC only for generations >=1 works well for sequential programs. Compiling an ordinary sequential program with -threaded and running it with -N2 or more should help if you do a lot of GC. I've found that adding -qg0 (do parallel GC for generation 0 too) speeds up some parallel programs, but slows down some sequential programs. Being conservative, I left the threshold at 1. ToDo: document the new options.
* Fix regTableToCapability() if gcc introduces paddingSimon Marlow2008-11-191-2/+8
| | | | | Also avoid padding if possible using __attribute__((packed)) Fixes the Windows build
* re-instate counting of sparks convertedSimon Marlow2008-11-061-2/+2
| | | | lost in patch "Run sparks in batches"
* leave out ATTRIBUTE_ALIGNED on Windows, it gives a warningSimon Marlow2008-11-061-1/+4
|
* Run sparks in batches, instead of creating a new thread for each oneSimon Marlow2008-11-061-1/+5
| | | | | Signficantly reduces the overhead for par, which means that we can make use of paralellism at a much finer granularity.
* Move the freeing of Capabilities later in the shutdown sequenceSimon Marlow2008-10-241-2/+2
| | | | | Fixes a bug whereby the Capability has been freed but other Capabilities are still trying to steal sparks from its pool.
* Pad Capabilities and Tasks to 64 bytesSimon Marlow2008-10-231-2/+4
| | | | | This is just good practice to avoid placing two structures heavily accessed by different CPUs on the same cache line
* traverse the spark pools only once during GC rather than twiceSimon Marlow2008-10-221-1/+2
|
* Refactoring and reorganisation of the schedulerSimon Marlow2008-10-221-6/+37
| | | | | | | | | | | | | | | | | Change the way we look for work in the scheduler. Previously, checking to see whether there was anything to do was a non-side-effecting operation, but this has changed now that we do work-stealing. This lead to a refactoring of the inner loop of the scheduler. Also, lots of cleanup in the new work-stealing code, but no functional changes. One new statistic is added to the +RTS -s output: SPARKS: 1430 (2 converted, 1427 pruned) lets you know something about the use of `par` in the program.
* Work stealing for sparksberthold@mathematik.uni-marburg.de2008-09-151-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Spark stealing support for PARALLEL_HASKELL and THREADED_RTS versions of the RTS. Spark pools are per capability, separately allocated and held in the Capability structure. The implementation uses Double-Ended Queues (deque) and cas-protected access. The write end of the queue (position bottom) can only be used with mutual exclusion, i.e. by exactly one caller at a time. Multiple readers can steal()/findSpark() from the read end (position top), and are synchronised without a lock, based on a cas of the top position. One reader wins, the others return NULL for a failure. Work stealing is called when Capabilities find no other work (inside yieldCapability), and tries all capabilities 0..n-1 twice, unless a theft succeeds. Inside schedulePushWork, all considered cap.s (those which were idle and could be grabbed) are woken up. Future versions should wake up capabilities immediately when putting a new spark in the local pool, from newSpark(). Patch has been re-recorded due to conflicting bugfixes in the sparks.c, also fixing a (strange) conflict in the scheduler.
* Move the context_switch flag into the CapabilitySimon Marlow2008-09-191-0/+7
| | | | | Fixes a long-standing bug that could in some cases cause sub-optimal scheduling behaviour.
* Fix race condition in wakeupThreadOnCapability() (#2574)Simon Marlow2008-09-091-5/+4
| | | | | | | | | | | wakeupThreadOnCapbility() is used to signal another capability that there is a thread waiting to be added to its run queue. It adds the thread to the (locked) wakeup queue on the remote capability. In order to do this, it has to modify the TSO's link field, which has a write barrier. The write barrier might put the TSO on the mutable list, and the bug was that it was using the mutable list of the *target* capability, which we do not have exclusive access to. We should be using the current Capabilty's mutable list in this case.
* Capability stopping when waiting for GCberthold@mathematik.uni-marburg.de2008-08-191-0/+3
|
* Undo fix for #2185: sparks really should be treated as rootsSimon Marlow2008-07-231-1/+0
| | | | | Unless sparks are roots, strategies don't work at all: all the sparks get GC'd. We need to think about this some more.
* FIX #2185: sparks should not be treated as roots by the GCSimon Marlow2008-04-241-0/+2
|
* Reorganisation to fix problems related to the gct register variableSimon Marlow2008-04-161-0/+4
| | | | | | | | | - GCAux.c contains code not compiled with the gct register enabled, it is callable from outside the GC - marking functions are moved to their relevant subsystems, outside the GC - mark_root needs to save the gct register, as it is called from outside the GC
* hs_exit()/shutdownHaskell(): wait for outstanding foreign calls to complete ↵Simon Marlow2007-07-241-1/+1
| | | | | | | | | | | | | | | | before returning This is pertinent to #1177. When shutting down a DLL, we need to be sure that there are no OS threads that can return to the code that we are about to unload, and previously the threaded RTS was unsafe in this respect. When exiting a standalone program we don't have to be quite so paranoid: all the code will disappear at the same time as any running threads. Happily exiting a program happens via a different path: shutdownHaskellAndExit(). If we're about to exit(), then there's no need to wait for foreign calls to complete.
* Free more things that we allocate2006-12-16Ian Lynagh2006-12-151-0/+3
|
* STM invariantstharris@microsoft.com2006-10-071-1/+2
|
* Asynchronous exception support for SMPSimon Marlow2006-06-161-0/+4
| | | | | | | | | | | | | | | | | This patch makes throwTo work with -threaded, and also refactors large parts of the concurrency support in the RTS to clean things up. We have some new files: RaiseAsync.{c,h} asynchronous exception support Threads.{c,h} general threading-related utils Some of the contents of these new files used to be in Schedule.c, which is smaller and cleaner as a result of the split. Asynchronous exception support in the presence of multiple running Haskell threads is rather tricky. In fact, to my annoyance there are still one or two bugs to track down, but the majority of the tests run now.
* Reorganisation of the source treeSimon Marlow2006-04-071-0/+250
Most of the other users of the fptools build system have migrated to Cabal, and with the move to darcs we can now flatten the source tree without losing history, so here goes. The main change is that the ghc/ subdir is gone, and most of what it contained is now at the top level. The build system now makes no pretense at being multi-project, it is just the GHC build system. No doubt this will break many things, and there will be a period of instability while we fix the dependencies. A straightforward build should work, but I haven't yet fixed binary/source distributions. Changes to the Building Guide will follow, too.