|  | Commit message (Collapse) | Author | Age | Files | Lines | 
|---|
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Because garbage collector calls `retainerProfile()` and `heapCensus()`,
GC times normally include some of PROF times too. To fix this we have
these lines:
    // heapCensus() is called by the GC, so RP and HC time are
    // included in the GC stats.  We therefore subtract them to
    // obtain the actual GC cpu time.
    stats.gc_cpu_ns      -=  prof_cpu;
    stats.gc_elapsed_ns  -=  prof_elapsed;
These variables are later used for calculating GC time excluding the
final GC (which should be attributed to EXIT).
    exit_gc_elapsed      = stats.gc_elapsed_ns - start_exit_gc_elapsed;
The problem is if we subtract PROF times from `gc_elapsed_ns` and then
subtract `start_exit_gc_elapsed` from the result, we end up subtracting
PROF times twice, because `start_exit_gc_elapsed` also includes PROF
times.
We now subtract PROF times from GC after the calculations for EXIT and
MUT times. The existing assertion that checks
    INIT + MUT + GC + EXIT = TOTAL
now holds. When we subtract PROF numbers from GC, and a new assertion
    INIT + MUT + GC + PROF + EXIT = TOTAL
also holds.
Fixes #15897. New assertions added in this commit also revealed #16102,
which is also fixed by this commit. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| | Reviewers: maoe, bgamari, erikd, simonmar
Reviewed By: simonmar
Subscribers: rwbarton, carter, sjorn3
Differential Revision: https://phabricator.haskell.org/D5287 | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Test Plan:
I can't validate this because of existing errors with the debug runtime. I'll
see if this introduces any new failures.
Reviewers: simonmar, bgamari, erikd
Reviewed By: simonmar
Subscribers: rwbarton, carter
Differential Revision: https://phabricator.haskell.org/D5337 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Long ago, the stable name table and stable pointer tables were one.
Now, they are separate, and have significantly different
implementations. I believe the time has come to finish the split
that began in #7674.
* Divide `rts/Stable` into `rts/StableName` and `rts/StablePtr`.
* Give each table its own mutex.
* Add FFI functions `hs_lock_stable_ptr_table` and
`hs_unlock_stable_ptr_table` and document them.
  These are intended to replace the previously undocumented
`hs_lock_stable_tables` and `hs_lock_stable_tables`,
  which are now documented as deprecated synonyms.
* Make `eqStableName#` use pointer equality instead of unnecessarily
comparing stable name table indices.
Reviewers: simonmar, bgamari, erikd
Reviewed By: bgamari
Subscribers: rwbarton, carter
GHC Trac Issues: #15555
Differential Revision: https://phabricator.haskell.org/D5084 | 
| | |  | 
| | |  | 
| | |  | 
| | 
| 
| 
| 
| 
| | PARALLEL_HASKELL was long gone, remove references
[skip ci] | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | With a large heap it's possible to build up a lot of finalizers
between GCs.  We've observed GC spending up to 50% of its time running
finalizers.  But there's no reason we have to run finalizers during
GC, and especially no reason we have to block *all* the mutator
threads while *one* GC thread runs finalizers one by one.
I thought about a bunch of alternative ways to handle this, which are
documented along with runSomeFinalizers() in Weak.c.  The approach I
settled on is to have a capability run finalizers if it is idle.  So
running finalizers is like a low-priority background thread. This
requires some minor scheduler changes, but not much.  In the future we
might be able to move more GC work into here (I have my eye on freeing
large blocks, for example).
Test Plan:
* validate
* tested on our system and saw reductions in GC pauses of 40-50%.
Reviewers: bgamari, niteria, osa1, erikd
Reviewed By: bgamari, osa1
Subscribers: rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4521 | 
| | |  | 
| | 
| 
| 
| | (the CPP guard is already wrapped with the same guard in line 1549) | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | `all_tasks_mutex` should be released before calling `releaseCapability_`
in the parent process as `releaseCapability_` spawns worker tasks in
some cases.
Reviewers: bgamari, erikd, simonmar
Subscribers: rwbarton, thomie, carter
GHC Trac Issues: #14538
Differential Revision: https://phabricator.haskell.org/D4460 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| | Reviewers: bgamari, erikd, simonmar
Reviewed By: bgamari
Subscribers: rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4390 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Test Plan: Validate
Reviewers: erikd, simonmar
Reviewed By: simonmar
Subscribers: rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4374 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
When we use nursery chunks with +RTS -n<size>, when the current nursery
runs out we have to check whether there's another chunk available with
getNewNursery().  There was one place we weren't doing this: the ad-hoc
heap check in scheduleProcessInbox().
The impact of the bug was that we would GC too early when using nursery
chunks, especially in programs that used messages (throwTo between
capabilities could do this, also hs_try_putmvar()).
Test Plan: validate, also local testing in our application
Reviewers: bgamari, niteria, austin, erikd
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3749 | 
| | 
| 
| 
| 
| | The Haskell wrapper already checks this but we should also check it in the RTS
to catch non-Haskell callers. See #13832. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
The problem occurred when
* Threads A & B evaluate the same thunk
* Thread A context-switches, so the thunk gets blackholed
* Thread C enters the blackhole, creates a BLOCKING_QUEUE attached to
  the blackhole and thread A's `tso->bq` queue
* Thread B updates the blackhole with a value, overwriting the BLOCKING_QUEUE
* We GC, replacing A's update frame with stg_enter_checkbh
* Throw an exception in A, which ignores the stg_enter_checkbh frame
Now we have C blocked on A's tso->bq queue, but we forgot to check the
queue because the stg_enter_checkbh frame has been thrown away by the
exception.
The solution and alternative designs are discussed in Note [upd-black-hole].
This also exposed a bug in the interpreter, whereby we were sometimes
context-switching without calling `threadPaused()`.  I've fixed this
and added some Notes.
Test Plan:
* `cd testsuite/tests/concurrent && make slow`
* validate
Reviewers: niteria, bgamari, austin, erikd
Reviewed By: erikd
Subscribers: rwbarton, thomie
GHC Trac Issues: #13751
Differential Revision: https://phabricator.haskell.org/D3630 | 
| | |  | 
| | 
| 
| 
| | Our new CPP linter enforces this. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | The C code in the RTS now gets built with `-Wundef` and the Haskell code
(stages 1 and 2 only) with `-Wcpp-undef`. We now get warnings whereever
`#if` is used on undefined identifiers.
Test Plan: Validate on Linux and Windows
Reviewers: austin, angerman, simonmar, bgamari, Phyx
Reviewed By: bgamari
Subscribers: thomie, snowleopard
Differential Revision: https://phabricator.haskell.org/D3278 | 
| | 
| 
| 
| 
| 
| 
| 
| | This is causing too much platform dependent breakage at the moment. We
will need a more rigorous testing strategy before this can be
merged again.
This reverts commit 7e340c2bbf4a56959bd1e95cdd1cfdb2b7e537c2. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | The C code in the RTS now gets built with `-Wundef` and the Haskell code
(stages 1 and 2 only) with `-Wcpp-undef`. We now get warnings whereever
`#if` is used on undefined identifiers.
Test Plan: Validate on Linux and Windows
Reviewers: austin, angerman, simonmar, bgamari, Phyx
Reviewed By: bgamari
Subscribers: thomie, snowleopard
Differential Revision: https://phabricator.haskell.org/D3278 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| | Moritz Angermann reports mysterious rts crash:
  A: link: internal error: schedule: invalid what_next field
  A:     (GHC version 8.3.20170321 for arm_none_linux_android)
This change prints actual prev_what_next value.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | [skip ci]
There ware some old file names (.lhs, ...) at comments.
* rts/win32/ThrIOManager.c
  - Conc.lhs -> Conc.hs
* rts/PrimOps.cmm
  - ByteCodeLink.lhs -> ByteCodeLink.hs
  - StgMiscClosures.hc -> StgMiscClosures.cmm
* rts/AutoApply.h
  - AutoApply.hc -> AutoApply.cmm
* rts/HeapStackCheck.cmm
  - PrimOps.hc -> PrimOps.cmm
* rts/LdvProfile.h
  - Updates.hc -> Updates.cmm
* rts/Schedule.c
  - StgStartup.hc -> StgStartup.cmm
* rts/Weak.c
  - StgMiscClosures.hc -> StgMiscClosures.cmm
Reviewers: bgamari, austin, erikd, simonmar
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D3075 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This changes heap overflow to throw a HeapOverflow exception instead of
killing the process.
Test Plan: GHC CI
Reviewers: simonmar, austin, hvr, erikd, bgamari
Reviewed By: simonmar, bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2790
GHC Trac Issues: #1791 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | When rts is forked it doesn't update toplevel handler, so UserInterrupt
exception is sent to Thread1 that doesn't exist in forked process.
We install toplevel handler when fork so signal will be delivered to the
new main thread.
Fixes #12903
Reviewers: simonmar, austin, erikd, bgamari
Reviewed By: bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2770
GHC Trac Issues: #12903 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Test Plan: Validate on lots of platforms
Reviewers: erikd, simonmar, austin
Reviewed By: erikd, simonmar
Subscribers: michalt, thomie
Differential Revision: https://phabricator.haskell.org/D2699 | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
The problem boils down to global variables: in particular gc_threads[],
which was being modified by a subsequent GC before the previous GC had
finished with it.  The fix is to not use global variables.
This was causing setnumcapabilities001 to fail (again!).  It's an old
bug though.
Test Plan:
Ran setnumcapabilities001 in a loop for a couple of hours.  Before this
patch it had been failing after a few minutes.  Not a very scientific
test, but it's the best I have.
Reviewers: bgamari, austin, fryguybob, niteria, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2654 | 
| | 
| 
| 
| 
| 
| | It's entirely reasonable to set +RTS -qn without setting -N, because the
program might later call setNumCapabilities.  If we disallow it, there's
no way to use -qn on programs that use setNumCapabilities. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | The value of enabled_capabilities can change across a call to
requestSync(), and we were erroneously using an old value, causing
things to go wrong later.  It manifested as an assertion failure, I'm
not sure whether there are worse consequences or not, but we should
get this fix into 8.0.2 anyway.
The failure didn't happen for me because it only shows up on machines
with fewer than 4 processors, due to the new logic to enable -qn
automatically.  I've bumped the test parameter 8 to make it more
likely to exercise that code.
Test Plan: Ran setnumcapabilities001 many times
Reviewers: niteria, austin, erikd, rwbarton, bgamari
Reviewed By: bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2617
GHC Trac Issues: #12728 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Setting a -N value that is too large has a dramatic negative effect on
performance, but the new -qn flag can mitigate the worst of the effects
by limiting the number of GC threads.
So now, if you don't explcitly set +RTS -qn, and you set -N larger than
the number of cores (or use setNumCapabilities to do the same), we'll
default -qn to the number of cores.
These are the results from nofib/parallel on my 4-core (2 cores x 2
threads) i7 laptop, comparing -N8 before and after this change.
```
------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
------------------------------------------------------------------------
   blackscholes          +0.0%     +0.0%    -72.5%    -72.0%     +9.5%
          coins          +0.0%     -0.0%    -73.7%    -72.2%     -0.8%
         mandel          +0.0%     +0.0%    -76.4%    -75.4%     +3.3%
        matmult          +0.0%    +15.5%    -26.8%    -33.4%     +1.0%
          nbody          +0.0%     +2.4%     +0.7%     0.076      0.0%
         parfib          +0.0%     -8.5%    -33.2%    -31.5%     +2.0%
        partree          +0.0%     -0.0%    -60.4%    -56.8%     +5.7%
           prsa          +0.0%     -0.0%    -65.4%    -60.4%      0.0%
         queens          +0.0%     +0.2%    -58.8%    -58.8%     -1.5%
            ray          +0.0%     -1.5%    -88.7%    -85.6%     -3.6%
       sumeuler          +0.0%     -0.0%    -47.8%    -46.9%      0.0%
------------------------------------------------------------------------
            Min          +0.0%     -8.5%    -88.7%    -85.6%     -3.6%
            Max          +0.0%    +15.5%     +0.7%    -31.5%     +9.5%
 Geometric Mean          +0.0%     +0.6%    -61.4%    -63.1%     +1.4%
```
Test Plan: validate, nofib/parallel benchmarks
Reviewers: niteria, ezyang, nh2, austin, erikd, trofi, bgamari
Reviewed By: trofi, bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2580
GHC Trac Issues: #9221 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
This is a fast, non-blocking, asynchronous, interface to tryPutMVar that
can be called from C/C++.
It's useful for callback-based C/C++ APIs: the idea is that the callback
invokes hs_try_putmvar(), and the Haskell code waits for the callback to
run by blocking in takeMVar.
The callback doesn't block - this is often a requirement of
callback-based APIs.  The callback wakes up the Haskell thread with
minimal overhead and no unnecessary context-switches.
There are a couple of benchmarks in
testsuite/tests/concurrent/should_run.  Some example results comparing
hs_try_putmvar() with using a standard foreign export:
    ./hs_try_putmvar003 1 64 16 100 +RTS -s -N4     0.49s
    ./hs_try_putmvar003 2 64 16 100 +RTS -s -N4     2.30s
hs_try_putmvar() is 4x faster for this workload (see the source for
hs_try_putmvar003.hs for details of the workload).
An alternative solution is to use the IO Manager for this.  We've tried
it, but there are problems with that approach:
* Need to create a new file descriptor for each callback
* The IO Manger thread(s) become a bottleneck
* More potential for things to go wrong, e.g. throwing an exception in
  an IO Manager callback kills the IO Manager thread.
Test Plan: validate; new unit tests
Reviewers: niteria, erikd, ezyang, bgamari, austin, hvr
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2501 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
This is surprisingly tricky.  There were linked list bugs in the
previous version (D2430) that showed up as a test failure in
setnumcapabilities001 (that's a great stress test!).
This new version uses a different strategy that doesn't suffer from
the problem that @ezyang pointed out in D2430.  We now pre-calculate
how many threads to keep for this capability, and then migrate any
surplus threads off the front of the queue, taking care to account for
threads that can't be migrated.
Test Plan:
1. setnumcapabilities001 stress test with sanity checking (+RTS -DS) turned on:
```
cd testsuite/tests/concurrent/should_run
make TEST=setnumcapabilities001 WAY=threaded1 EXTRA_HC_OPTS=-with-rtsopts=-DS CLEANUP=0
while true; do ./setnumcapabilities001.run/setnumcapabilities001 4 9 2000 || break; done
```
2. The test case from #12419
Reviewers: niteria, ezyang, rwbarton, austin, bgamari, erikd
Subscribers: thomie, ezyang
Differential Revision: https://phabricator.haskell.org/D2441
GHC Trac Issues: #12419 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
If we had 2 threads on the run queue, say [A,B], and B is bound to the
current Task, then we would fail to migrate any threads.  This fixes it
so that we would migrate A in that case.
This will help parallelism a bit in programs that have lots of bound
threads.
Test Plan:
Test program in #12419, which is actually not a great program but it
does behave a bit better after this change.
Reviewers: ezyang, niteria, bgamari, austin, erikd
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2430
GHC Trac Issues: #12419 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
Knowing the length of the run queue in O(1) time is useful: for example
we don't have to traverse the run queue to know how many threads we have
to migrate in schedulePushWork().
Test Plan: validate
Reviewers: ezyang, erikd, bgamari, austin
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2437 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | @simonmar told me that it makes more sense this way.
Test Plan: it still builds
Reviewers: bgamari, austin, simonmar, erikd
Reviewed By: simonmar, erikd
Subscribers: thomie, simonmar
Differential Revision: https://phabricator.haskell.org/D2428 | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.
Linux provides a NUMA API for doing two things:
* Allocating memory local to a particular node
* Binding a thread to a particular node
When given the +RTS --numa flag, the runtime will
* Determine the number of NUMA nodes (N) by querying the OS
* Assign capabilities to nodes, so cap C is on node C%N
* Bind worker threads on a capability to the correct node
* Keep a separate free lists in the block layer for each node
* Allocate the nursery for a capability from node-local memory
* Allocate blocks in the GC from node-local memory
For example, using nofib/parallel/queens on a 24-core 2-socket machine:
```
$ ./Main 15 +RTS -N24 -s -A64m
  Total   time  173.960s  (  7.467s elapsed)
$ ./Main 15 +RTS -N24 -s -A64m --numa
  Total   time  150.836s  (  6.423s elapsed)
```
The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).
According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using `--numa`.
Test Plan:
* validate
* There's a new flag --debug-numa=<n> that pretends to do NUMA without
  actually making the OS calls, which is useful for testing the code
  on non-NUMA systems.
* TODO: I need to add some unit tests
Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2199 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | In addition to more const-correctness fixes this patch fixes an
infelicity of the previous const-correctness patch (995cf0f356) which
left `UNTAG_CLOSURE` taking a `const StgClosure` pointer parameter
but returning a non-const pointer. Here we restore the original type
signature of `UNTAG_CLOSURE` and add a new function
`UNTAG_CONST_CLOSURE` which takes and returns a const `StgClosure`
pointer and uses that wherever possible.
Test Plan: Validate on Linux, OS X and Windows
Reviewers: Phyx, hsyl20, bgamari, austin, simonmar, trofi
Reviewed By: simonmar, trofi
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2231 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | The assertion failure was fairly benign, I think, but this fixes it.
I've been running the test repeatedly for the last 30 mins and it hasn't
triggered.
There are other problems exposed by this test (see #12038), but I've
worked around those in the test itself for now.
I also copied the relevant bits of the parallel library here so that we
don't need parallel for the test to run. | 
| | 
| 
| 
| 
| 
| 
| 
| | It was possible for a thread to read invalid memory after a conflict
when multiple threads were synchronising.
I haven't been successful in constructing a test case that triggers
this, but we have some internal code that ran into it. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | The `nat` type was an alias for `unsigned int` with a comment saying
it was at least 32 bits. We keep the typedef in case client code is
using it but mark it as deprecated.
Test Plan: Validated on Linux, OS X and Windows
Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20
Differential Revision: https://phabricator.haskell.org/D2166 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This function had some pathalogically bad behaviour: if we had 2 threads
on the current capability and 23 other idle capabilities, we would
* grab all 23 capabilities
* migrate one Haskell thread to one of them
* wake up a worker on *all* 23 other capabilities.
This lead to a lot of unnecessary wakeups when using large -N values.
Now, we
* Count how many capabilities we need to wake up
* Start from cap->no+1, so that we don't overload low-numbered capabilities
* Only wake up capabilities that we migrated a thread to (unless we have
  sparks to steal)
This results in a pretty dramatic improvement in our production system. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This allows the GC to use fewer threads than the number of capabilities.
At each GC, we choose some of the capabilities to be "idle", which means
that the thread running on that capability (if any) will sleep for the
duration of the GC, and the other threads will do its work.  We choose
capabilities that are already idle (if any) to be the idle capabilities.
The idea is that this helps in the following situation:
* We want to use a large -N value so as to make use of hyperthreaded
  cores
* We use a large heap size, so GC is infrequent
* But we don't want to use all -N threads in the GC, because that
  thrashes the memory too much.
See docs for usage. | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| | Noticed by uselex.rb:
    removeFromRunQueue: [R]: exported from:
        ./rts/dist/build/Schedule.o
Signed-off-by: Sergei Trofimovich <siarheit@google.com> | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | On AIX, C system headers can redirect the token `stat` via
    #define stat stat64
to provide large-file support. Simply avoiding the use of `stat` as an
identifier eschews macro-replacement.
Differential Revision: https://phabricator.haskell.org/D1566 |