summaryrefslogtreecommitdiff
path: root/rts/sm/BlockAlloc.c
Commit message (Collapse)AuthorAgeFilesLines
* Make `PosixSource.h` installed and under `rts/`John Ericson2021-08-091-1/+1
| | | | | | is used outside of the rts so we do this rather than just fish it out of the repo in ad-hoc way, in order to make packages in this repo more self-contained.
* eventlog: Add MEM_RETURN event to give information about fragmentationMatthew Pickering2021-03-081-1/+5
| | | | | | | | | | | | | | See #19357 The event reports the * Current number of megablocks allocated * The number that the RTS thinks it needs * The number is managed to return to the OS When current > need then the difference is returned to the OS, the successful number of returned mblocks is reported by 'returned'. In a fragmented heap current > need but returned < current - need.
* rts: Annotate hopefully "benign" races in freeGroupBen Gamari2020-11-011-0/+25
|
* rts/BlockAlloc: Use relaxed operationsBen Gamari2020-10-241-6/+7
|
* rts: Zero block flags with -DZBen Gamari2020-05-061-0/+12
| | | | | | | Block flags are very useful for determining the state of a block. However, some block allocator users don't touch them, leading to misleading values. Ensure that we zero then when zero-on-gc is set. This is safe and makes the flags more useful during debugging.
* rts/BlockAlloc: Allow aligned allocation requestswip/gc/aligned-block-allocationÖmer Sinan Ağacan2019-10-181-1/+143
| | | | | | | | | | | This implements support for block group allocations which are aligned to an integral number of blocks. This will be used by the nonmoving garbage collector, which uses the block allocator to allocate the segments which back its heap. These segments are a fixed number of blocks in size, with each segment being aligned to the segment size boundary. This allows us to easily find the segment metadata stored at the beginning of the segment.
* Add new debug flag -DZTobias Guggenmos2019-10-031-3/+3
| | | | Zeros heap memory after gc freed it.
* BlockAlloc.c: reuse tail_of functionÖmer Sinan Ağacan2018-05-111-8/+7
|
* Prefer #if defined to #ifdefBen Gamari2017-04-281-1/+1
| | | | Our new CPP linter enforces this.
* Compact RegionsGiovanni Campagna2016-07-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This brings in initial support for compact regions, as described in the ICFP 2015 paper "Efficient Communication and Collection with Compact Normal Forms" (Edward Z. Yang et.al.) and implemented by Giovanni Campagna. Some things may change before the 8.2 release, but I (Simon M.) wanted to get the main patch committed so that we can iterate. What documentation there is is in the Data.Compact module in the new compact package. We'll need to extend and polish the documentation before the release. Test Plan: validate (new test cases included) Reviewers: ezyang, simonmar, hvr, bgamari, austin Subscribers: vikraman, Yuras, RyanGlScott, qnikst, mboes, facundominguez, rrnewton, thomie, erikd Differential Revision: https://phabricator.haskell.org/D1264 GHC Trac Issues: #11493
* NUMA cleanupsSimon Marlow2016-06-171-5/+5
| | | | | - Move the numaMap and nNumaNodes out of RtsFlags to Capability.c - Add a test to tests/rts
* NUMA supportSimon Marlow2016-06-101-134/+216
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: The aim here is to reduce the number of remote memory accesses on systems with a NUMA memory architecture, typically multi-socket servers. Linux provides a NUMA API for doing two things: * Allocating memory local to a particular node * Binding a thread to a particular node When given the +RTS --numa flag, the runtime will * Determine the number of NUMA nodes (N) by querying the OS * Assign capabilities to nodes, so cap C is on node C%N * Bind worker threads on a capability to the correct node * Keep a separate free lists in the block layer for each node * Allocate the nursery for a capability from node-local memory * Allocate blocks in the GC from node-local memory For example, using nofib/parallel/queens on a 24-core 2-socket machine: ``` $ ./Main 15 +RTS -N24 -s -A64m Total time 173.960s ( 7.467s elapsed) $ ./Main 15 +RTS -N24 -s -A64m --numa Total time 150.836s ( 6.423s elapsed) ``` The biggest win here is expected to be allocating from node-local memory, so that means programs using a large -A value (as here). According to perf, on this program the number of remote memory accesses were reduced by more than 50% by using `--numa`. Test Plan: * validate * There's a new flag --debug-numa=<n> that pretends to do NUMA without actually making the OS calls, which is useful for testing the code on non-NUMA systems. * TODO: I need to add some unit tests Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D2199
* rts: Replace `nat` with `uint32_t`Erik de Castro Lopo2016-05-051-9/+9
| | | | | | | | | | | | The `nat` type was an alias for `unsigned int` with a comment saying it was at least 32 bits. We keep the typedef in case client code is using it but mark it as deprecated. Test Plan: Validated on Linux, OS X and Windows Reviewers: simonmar, austin, thomie, hvr, bgamari, hsyl20 Differential Revision: https://phabricator.haskell.org/D2166
* Cleanups related to MAX_FREE_LISTSimon Marlow2016-05-021-19/+24
| | | | | | | | | | | | | | | | - Rename to the (more correct) NUM_FREE_LISTS - NUM_FREE_LISTS should be derived from the block and mblock sizes, not defined manually. It was actually too large by one, which caused a little bit of (benign) extra work in the form of a redundant loop iteration in some cases. - Add some ASSERTs for input preconditions to log_2() and log_2_ceil() - Fix some comments - Fix usage in allocLargeChunk, to account for the fact that log_2_ceil() can return NUM_FREE_LISTS.
* Revert "Revert "Use __builtin_clz() to implement log_1()""U-THEFACEBOOK\smarlow2016-05-021-11/+27
| | | | | | | This reverts commit 546f24e4f8a7c086b1e5afcdda624176610cbcf8. And adds a fix for Windows: we need to use __builtin_clzll() rather than __builtin_clzl(), because StgWord is unsigned long long on Windows.
* Revert "Use __builtin_clz() to implement log_2()"Simon Peyton Jones2016-04-281-21/+11
| | | | This reverts commit 24864ba5587c1a0447beabae90529e8bb4fa117a.
* Use __builtin_clz() to implement log_2()Simon Marlow2016-04-261-11/+21
| | | | A microoptimisation in the block allocator.
* rts: One more Clang-unfriendly CPP usageBen Gamari2015-12-071-3/+3
|
* Two step allocator for 64-bit systemsGiovanni Campagna2015-07-221-3/+11
| | | | | | | | | | | | | | | | | | | | | | | Summary: The current OS memory allocator conflates the concepts of allocating address space and allocating memory, which makes the HEAP_ALLOCED() implementation excessively complicated (as the only thing it cares about is address space layout) and slow. Instead, what we want is to allocate a single insanely large contiguous block of address space (to make HEAP_ALLOCED() checks fast), and then commit subportions of that in 1MB blocks as we did before. This is currently behind a flag, USE_LARGE_ADDRESS_SPACE, that is only enabled for certain OSes. Test Plan: validate Reviewers: simonmar, ezyang, austin Subscribers: thomie, carter Differential Revision: https://phabricator.haskell.org/D524 GHC Trac Issues: #9706
* initGroup: only initialize the first and last blocks of a groupSimon Marlow2015-07-151-15/+11
| | | | | | | | | | | | Summary: Initialising the whole group is expensive and unnecessary. Test Plan: validate Reviewers: austin, bgamari, rwbarton Subscribers: thomie Differential Revision: https://phabricator.haskell.org/D1071
* Revert "rts: add Emacs 'Local Variables' to every .c file"Simon Marlow2014-09-291-8/+0
| | | | This reverts commit 39b5c1cbd8950755de400933cecca7b8deb4ffcd.
* rts: detabify/dewhitespace sm/BlockAlloc.cAustin Seipp2014-08-201-35/+35
| | | | Signed-off-by: Austin Seipp <austin@well-typed.com>
* rts: add Emacs 'Local Variables' to every .c fileAustin Seipp2014-07-281-0/+8
| | | | | | | | This will hopefully help ensure some basic consistency in the forward by overriding buffer variables. In particular, it sets the wrap length, the offset to 4, and turns off tabs. Signed-off-by: Austin Seipp <austin@well-typed.com>
* Make BlockAlloc.c comment slightly more accurate (fixes #8491)Edward Z. Yang2014-04-131-1/+1
| | | | Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
* Clean up block allocator, fixes #8609Edward Z. Yang2013-12-311-4/+30
| | | | Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>
* Turn several nats into StgWord to avoid potential integer overflow (#7762)Simon Marlow2013-10-251-12/+13
|
* Allow allocNursery() to allocate single blocks (#7257)Simon Marlow2012-09-211-1/+7
| | | | | | | Forcing large allocations here can creates serious fragmentation in some cases, and since the large allocations are only a small optimisation we should allow the nursery to hoover up small blocks before allocating large chunks.
* Lots of nat -> StgWord changesSimon Marlow2012-09-071-14/+14
|
* Deprecate lnat, and use StgWord insteadSimon Marlow2012-09-071-3/+3
| | | | | | | | | | | | lnat was originally "long unsigned int" but we were using it when we wanted a 64-bit type on a 64-bit machine. This broke on Windows x64, where long == int == 32 bits. Using types of unspecified size is bad, but what we really wanted was a type with N bits on an N-bit machine. StgWord is exactly that. lnat was mentioned in some APIs that clients might be using (e.g. StackOverflowHook()), so we leave it defined but with a comment to say that it's deprecated.
* Some further tweaks to reduce fragmentation when allocating the nurserySimon Marlow2012-09-071-16/+32
|
* Reduce fragmentation when using +RTS -H (with or without a size)Simon Marlow2012-08-211-0/+35
|
* improve debug outputSimon Marlow2012-08-211-1/+1
|
* update debugging code for fragmentationSimon Marlow2011-01-251-2/+3
|
* Implement stack chunks and separate TSO/STACK objectsSimon Marlow2010-12-151-42/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes two changes to the way stacks are managed: 1. The stack is now stored in a separate object from the TSO. This means that it is easier to replace the stack object for a thread when the stack overflows or underflows; we don't have to leave behind the old TSO as an indirection any more. Consequently, we can remove ThreadRelocated and deRefTSO(), which were a pain. This is obviously the right thing, but the last time I tried to do it it made performance worse. This time I seem to have cracked it. 2. Stacks are now represented as a chain of chunks, rather than a single monolithic object. The big advantage here is that individual chunks are marked clean or dirty according to whether they contain pointers to the young generation, and the GC can avoid traversing clean stack chunks during a young-generation collection. This means that programs with deep stacks will see a big saving in GC overhead when using the default GC settings. A secondary advantage is that there is much less copying involved as the stack grows. Programs that quickly grow a deep stack will see big improvements. In some ways the implementation is simpler, as nothing special needs to be done to reclaim stack as the stack shrinks (the GC just recovers the dead stack chunks). On the other hand, we have to manage stack underflow between chunks, so there's a new stack frame (UNDERFLOW_FRAME), and we now have separate TSO and STACK objects. The total amount of code is probably about the same as before. There are new RTS flags: -ki<size> Sets the initial thread stack size (default 1k) Egs: -ki4k -ki2m -kc<size> Sets the stack chunk size (default 32k) -kb<size> Sets the stack chunk buffer size (default 1k) -ki was previously called just -k, and the old name is still accepted for backwards compatibility. These new options are documented.
* fix another sanity error, and refactor/tidy upSimon Marlow2010-12-091-9/+8
|
* sanity: fix places where we weren't filling fresh memory with 0xaaSimon Marlow2010-10-291-0/+2
|
* On Windows, when returning memory to the OS, we try to release itIan Lynagh2010-11-011-0/+2
| | | | as well as decommiting it.
* Return memory to the OS; trac #698Ian Lynagh2010-08-131-0/+34
|
* Cast some more nats to StgWord to be on the safe sideSimon Marlow2010-06-241-3/+13
| | | | And add a comment about the dangers of int overflow
* comments onlySimon Marlow2010-06-241-3/+2
|
* Fix an arithmetic overflow bug causing crashes with multi-GB heapsSimon Marlow2010-06-241-1/+1
|
* rts/sm/BlockAlloc.c: Small comment correction.Marco Túlio Gontijo e Silva2010-05-261-1/+1
|
* GC refactoring, remove "steps"Simon Marlow2009-12-031-2/+2
| | | | | | | | | | | | | | | | | | | | | The GC had a two-level structure, G generations each of T steps. Steps are for aging within a generation, mostly to avoid premature promotion. Measurements show that more than 2 steps is almost never worthwhile, and 1 step is usually worse than 2. In theory fractional steps are possible, so the ideal number of steps is somewhere between 1 and 3. GHC's default has always been 2. We can implement 2 steps quite straightforwardly by having each block point to the generation to which objects in that block should be promoted, so blocks in the nursery point to generation 0, and blocks in gen 0 point to gen 1, and so on. This commit removes the explicit step structures, merging generations with steps, thus simplifying a lot of code. Performance is unaffected. The tunable number of steps is now gone, although it may be replaced in the future by a way to tune the aging in generation 0.
* Refactoring onlySimon Marlow2009-12-021-0/+34
|
* Store a destination step in the block descriptorSimon Marlow2009-11-291-0/+1
| | | | | | | At the moment, this just saves a memory reference in the GC inner loop (worth a percent or two of GC time). Later, it will hopefully let me experiment with partial steps, and simplifying the generation/step infrastructure.
* RTS tidyup sweep, first phaseSimon Marlow2009-08-021-21/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first phase of this tidyup is focussed on the header files, and in particular making sure we are exposinng publicly exactly what we need to, and no more. - Rts.h now includes everything that the RTS exposes publicly, rather than a random subset of it. - Most of the public header files have moved into subdirectories, and many of them have been renamed. But clients should not need to include any of the other headers directly, just #include the main public headers: Rts.h, HsFFI.h, RtsAPI.h. - All the headers needed for via-C compilation have moved into the stg subdirectory, which is self-contained. Most of the headers for the rest of the RTS APIs have moved into the rts subdirectory. - I left MachDeps.h where it is, because it is so widely used in Haskell code. - I left a deprecated stub for RtsFlags.h in place. The flag structures are now exposed by Rts.h. - Various internal APIs are no longer exposed by public header files. - Various bits of dead code and declarations have been removed - More gcc warnings are turned on, and the RTS code is more warning-clean. - More source files #include "PosixSource.h", and hence only use standard POSIX (1003.1c-1995) interfaces. There is a lot more tidying up still to do, this is just the first pass. I also intend to standardise the names for external RTS APIs (e.g use the rts_ prefix consistently), and declare the internal APIs as hidden for shared libraries.
* Fix some bugs in the stack-reducing code (#2571)Simon Marlow2008-09-121-4/+12
|
* when a memory leak is detected, report which blocks are unreachableSimon Marlow2008-09-091-0/+32
|
* fix a tiny bug spotted by gcc 4.3Simon Marlow2008-06-191-1/+1
|
* fix allocated blocks calculation, and add more sanity checksSimon Marlow2008-06-081-10/+24
|