diff options
| author | Simon Marlow <marlowsd@gmail.com> | 2016-04-23 21:14:49 +0100 | 
|---|---|---|
| committer | Simon Marlow <marlowsd@gmail.com> | 2016-06-10 21:25:54 +0100 | 
| commit | 9e5ea67e268be2659cd30ebaed7044d298198ab0 (patch) | |
| tree | c395e74ee772ae0d59c852b3cbde743784b08d09 /rts/Schedule.c | |
| parent | b9fa72a24ba2cc3120912e6afedc9280d28d2077 (diff) | |
| download | haskell-9e5ea67e268be2659cd30ebaed7044d298198ab0.tar.gz | |
NUMA support
Summary:
The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.
Linux provides a NUMA API for doing two things:
* Allocating memory local to a particular node
* Binding a thread to a particular node
When given the +RTS --numa flag, the runtime will
* Determine the number of NUMA nodes (N) by querying the OS
* Assign capabilities to nodes, so cap C is on node C%N
* Bind worker threads on a capability to the correct node
* Keep a separate free lists in the block layer for each node
* Allocate the nursery for a capability from node-local memory
* Allocate blocks in the GC from node-local memory
For example, using nofib/parallel/queens on a 24-core 2-socket machine:
```
$ ./Main 15 +RTS -N24 -s -A64m
  Total   time  173.960s  (  7.467s elapsed)
$ ./Main 15 +RTS -N24 -s -A64m --numa
  Total   time  150.836s  (  6.423s elapsed)
```
The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).
According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using `--numa`.
Test Plan:
* validate
* There's a new flag --debug-numa=<n> that pretends to do NUMA without
  actually making the OS calls, which is useful for testing the code
  on non-NUMA systems.
* TODO: I need to add some unit tests
Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2199
Diffstat (limited to 'rts/Schedule.c')
| -rw-r--r-- | rts/Schedule.c | 5 | 
1 files changed, 3 insertions, 2 deletions
| diff --git a/rts/Schedule.c b/rts/Schedule.c index 8a08e35cc3..fca276dc08 100644 --- a/rts/Schedule.c +++ b/rts/Schedule.c @@ -726,7 +726,8 @@ schedulePushWork(Capability *cap USED_IF_THREADS,          } while (n_wanted_caps < n_capabilities-1);      } -    // Grab free capabilities, starting from cap->no+1. +    // First grab as many free Capabilities as we can.  ToDo: we should use +    // capabilities on the same NUMA node preferably, but not exclusively.      for (i = (cap->no + 1) % n_capabilities, n_free_caps=0;           n_free_caps < n_wanted_caps && i != cap->no;           i = (i + 1) % n_capabilities) { @@ -1134,7 +1135,7 @@ scheduleHandleHeapOverflow( Capability *cap, StgTSO *t )                                                 // nursery has only one                                                 // block. -            bd = allocGroup_lock(blocks); +            bd = allocGroupOnNode_lock(cap->node,blocks);              cap->r.rNursery->n_blocks += blocks;              // link the new group after CurrentNursery | 
