| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In order to make the packages in this repo "reinstallable", we need to
associate source code with a specific packages. Having a top level
`/includes` dir that mixes concerns (which packages' includes?) gets in
the way of this.
To start, I have moved everything to `rts/`, which is mostly correct.
There are a few things however that really don't belong in the rts (like
the generated constants haskell type, `CodeGen.Platform.h`). Those
needed to be manually adjusted.
Things of note:
- No symlinking for sake of windows, so we hard-link at configure time.
- `CodeGen.Platform.h` no longer as `.hs` extension (in addition to
being moved to `compiler/`) so as not to confuse anyone, since it is
next to Haskell files.
- Blanket `-Iincludes` is gone in both build systems, include paths now
more strictly respect per-package dependencies.
- `deriveConstants` has been taught to not require a `--target-os` flag
when generating the platform-agnostic Haskell type. Make takes
advantage of this, but Hadrian has yet to.
|
|
|
|
| |
This enables a registerised build for the riscv64 architecture.
|
|
|
|
|
|
|
|
| |
The algorithm described in the referenced paper uses this slightly
weaker atomic op.
This is the first "exotic" cas we're using. I've added a macro in the
<ORDERING>_OP style to match existing ones.
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
After a few attempts at shoring up the previous implementation, I ended
up turning to the literature and now use the proven implementation,
> N.M. Lê, A. Pop, A.Cohen, and F.Z. Nardelli. "Correct and Efficient
> Work-Stealing for Weak Memory Models". PPoPP'13, February 2013,
> ACM 978-1-4503-1922/13/02.
Note only is this approach formally proven correct under C11 semantics
but it is also proved to be a bit faster in practice.
|
|/ |
|
| |
|
| |
|
| |
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This introduces a concurrent mark & sweep garbage collector to manage the old
generation. The concurrent nature of this collector typically results in
significantly reduced maximum and mean pause times in applications with large
working sets.
Due to the large and intricate nature of the change I have opted to
preserve the fully-buildable history, including merge commits, which is
described in the "Branch overview" section below.
Collector design
================
The full design of the collector implemented here is described in detail
in a technical note
> B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
> Compiler" (2018)
This document can be requested from @bgamari.
The basic heap structure used in this design is heavily inspired by
> K. Ueno & A. Ohori. "A fully concurrent garbage collector for
> functional programs on multicore processors." /ACM SIGPLAN Notices/
> Vol. 51. No. 9 (presented at ICFP 2016)
This design is intended to allow both marking and sweeping
concurrent to execution of a multi-core mutator. Unlike the Ueno design,
which requires no global synchronization pauses, the collector
introduced here requires a stop-the-world pause at the beginning and end
of the mark phase.
To avoid heap fragmentation, the allocator consists of a number of
fixed-size /sub-allocators/. Each of these sub-allocators allocators into
its own set of /segments/, themselves allocated from the block
allocator. Each segment is broken into a set of fixed-size allocation
blocks (which back allocations) in addition to a bitmap (used to track
the liveness of blocks) and some additional metadata (used also used
to track liveness).
This heap structure enables collection via mark-and-sweep, which can be
performed concurrently via a snapshot-at-the-beginning scheme (although
concurrent collection is not implemented in this patch).
Implementation structure
========================
The majority of the collector is implemented in a handful of files:
* `rts/Nonmoving.c` is the heart of the beast. It implements the entry-point
to the nonmoving collector (`nonmoving_collect`), as well as the allocator
(`nonmoving_allocate`) and a number of utilities for manipulating the heap.
* `rts/NonmovingMark.c` implements the mark queue functionality, update
remembered set, and mark loop.
* `rts/NonmovingSweep.c` implements the sweep loop.
* `rts/NonmovingScav.c` implements the logic necessary to scavenge the
nonmoving heap.
Branch overview
===============
```
* wip/gc/opt-pause:
| A variety of small optimisations to further reduce pause times.
|
* wip/gc/compact-nfdata:
| Introduce support for compact regions into the non-moving
|\ collector
| \
| \
| | * wip/gc/segment-header-to-bdescr:
| | | Another optimization that we are considering, pushing
| | | some segment metadata into the segment descriptor for
| | | the sake of locality during mark
| | |
| * | wip/gc/shortcutting:
| | | Support for indirection shortcutting and the selector optimization
| | | in the non-moving heap.
| | |
* | | wip/gc/docs:
| |/ Work on implementation documentation.
| /
|/
* wip/gc/everything:
| A roll-up of everything below.
|\
| \
| |\
| | \
| | * wip/gc/optimize:
| | | A variety of optimizations, primarily to the mark loop.
| | | Some of these are microoptimizations but a few are quite
| | | significant. In particular, the prefetch patches have
| | | produced a nontrivial improvement in mark performance.
| | |
| | * wip/gc/aging:
| | | Enable support for aging in major collections.
| | |
| * | wip/gc/test:
| | | Fix up the testsuite to more or less pass.
| | |
* | | wip/gc/instrumentation:
| | | A variety of runtime instrumentation including statistics
| | / support, the nonmoving census, and eventlog support.
| |/
| /
|/
* wip/gc/nonmoving-concurrent:
| The concurrent write barriers.
|
* wip/gc/nonmoving-nonconcurrent:
| The nonmoving collector without the write barriers necessary
| for concurrent collection.
|
* wip/gc/preparation:
| A merge of the various preparatory patches that aren't directly
| implementing the GC.
|
|
* GHC HEAD
.
.
.
```
|
| | |
|
|/
|
|
|
|
| |
This patch adds support for the s390x architecture for the LLVM code
generator. The patch includes a register mapping of STG registers onto
s390x machine registers which enables a registerised build.
|
|
|
|
|
|
| |
Add StgToCmm module hierarchy. Platform modules that are used in several
other places (NCG, LLVM codegen, Cmm transformations) are put into
GHC.Platform.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here the following changes are introduced:
- A read barrier machine op is added to Cmm.
- The order in which a closure's fields are read and written is changed.
- Memory barriers are added to RTS code to ensure correctness on
out-or-order machines with weak memory ordering.
Cmm has a new CallishMachOp called MO_ReadBarrier. On weak memory machines, this
is lowered to an instruction that ensures memory reads that occur after said
instruction in program order are not performed before reads coming before said
instruction in program order. On machines with strong memory ordering properties
(e.g. X86, SPARC in TSO mode) no such instruction is necessary, so
MO_ReadBarrier is simply erased. However, such an instruction is necessary on
weakly ordered machines, e.g. ARM and PowerPC.
Weam memory ordering has consequences for how closures are observed and mutated.
For example, consider a closure that needs to be updated to an indirection. In
order for the indirection to be safe for concurrent observers to enter, said
observers must read the indirection's info table before they read the
indirectee. Furthermore, the entering observer makes assumptions about the
closure based on its info table contents, e.g. an INFO_TYPE of IND imples the
closure has an indirectee pointer that is safe to follow.
When a closure is updated with an indirection, both its info table and its
indirectee must be written. With weak memory ordering, these two writes can be
arbitrarily reordered, and perhaps even interleaved with other threads' reads
and writes (in the absence of memory barrier instructions). Consider this
example of a bad reordering:
- An updater writes to a closure's info table (INFO_TYPE is now IND).
- A concurrent observer branches upon reading the closure's INFO_TYPE as IND.
- A concurrent observer reads the closure's indirectee and enters it. (!!!)
- An updater writes the closure's indirectee.
Here the update to the indirectee comes too late and the concurrent observer has
jumped off into the abyss. Speculative execution can also cause us issues,
consider:
- An observer is about to case on a value in closure's info table.
- The observer speculatively reads one or more of closure's fields.
- An updater writes to closure's info table.
- The observer takes a branch based on the new info table value, but with the
old closure fields!
- The updater writes to the closure's other fields, but its too late.
Because of these effects, reads and writes to a closure's info table must be
ordered carefully with respect to reads and writes to the closure's other
fields, and memory barriers must be placed to ensure that reads and writes occur
in program order. Specifically, updates to a closure must follow the following
pattern:
- Update the closure's (non-info table) fields.
- Write barrier.
- Update the closure's info table.
Observing a closure's fields must follow the following pattern:
- Read the closure's info pointer.
- Read barrier.
- Read the closure's (non-info table) fields.
This patch updates RTS code to obey this pattern. This should fix long-standing
SMP bugs on ARM (specifically newer aarch64 microarchitectures supporting
out-of-order execution) and PowerPC. This fixes issue #15449.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This moves all URL references to Trac Wiki to their corresponding
GitLab counterparts.
This substitution is classified as follows:
1. Automated substitution using sed with Ben's mapping rule [1]
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...
2. Manual substitution for URLs containing `#` index
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz
3. Manual substitution for strings starting with `Commentary`
Old: Commentary/XxxYyy...
New: commentary/xxx-yyy...
See also !539
[1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The C code in the RTS now gets built with `-Wundef` and the Haskell code
(stages 1 and 2 only) with `-Wcpp-undef`. We now get warnings whereever
`#if` is used on undefined identifiers.
Test Plan: Validate on Linux and Windows
Reviewers: austin, angerman, simonmar, bgamari, Phyx
Reviewed By: bgamari
Subscribers: thomie, snowleopard
Differential Revision: https://phabricator.haskell.org/D3278
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This both says what we mean and silences a bunch of spurious CPP linting
warnings. This pragma is supported by all CPP implementations which we
support.
Reviewers: austin, erikd, simonmar, hvr
Reviewed By: simonmar
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3482
|
|
|
|
|
|
|
|
| |
This is causing too much platform dependent breakage at the moment. We
will need a more rigorous testing strategy before this can be
merged again.
This reverts commit 7e340c2bbf4a56959bd1e95cdd1cfdb2b7e537c2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The C code in the RTS now gets built with `-Wundef` and the Haskell code
(stages 1 and 2 only) with `-Wcpp-undef`. We now get warnings whereever
`#if` is used on undefined identifiers.
Test Plan: Validate on Linux and Windows
Reviewers: austin, angerman, simonmar, bgamari, Phyx
Reviewed By: bgamari
Subscribers: thomie, snowleopard
Differential Revision: https://phabricator.haskell.org/D3278
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use C compiler builtins for atomic SMP primitives. This saves a lot
of CPP ifdefs.
Add test for atomic xchg:
Test if __sync_lock_test_and_set() builtin stores the second argument.
The gcc manual says the actual value stored is implementation defined.
Test Plan: validate and eyeball generated assembler code
Reviewers: kgardas, simonmar, hvr, bgamari, austin, erikd
Reviewed By: simonmar
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2233
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The SMP primitives were missing appropriate memory barriers
(sync, isync instructions) on all PowerPCs.
Use the built-ins _sync_* provided by gcc and clang. This
reduces code size significantly.
Remove broken mark for concprog001 on powerpc64. The referenced
ticket number (11259) was wrong.
Test Plan: validate on powerpc and ARM
Reviewers: erikd, austin, simonmar, bgamari, hvr
Reviewed By: bgamari, hvr
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2225
GHC Trac Issues: #12070
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unfortunately (for inline `__asm__()` uses), IBM's `as` doesn't seem to support
local labels[1] like GNU `as` does so we need to workaround this when on AIX.
[1]: https://sourceware.org/binutils/docs/as/Symbol-Names.html#Symbol-Names
Turns out this also addresses the long-standing bug #485
Reviewed By: bgamari, trommler
Differential Revision: https://phabricator.haskell.org/D2029
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We are unable to produce load/store barriers for pre-ARMv7 targets.
Phab:D894 added dummy cases to SMP.h for these barriers to prevent the
build from failing under the assumption that there are no SMP-capable
devices of this vintage. However, #10433 points out that it is more
correct to simply set NOSMP for such targets.
Tested By: rwbarton
Test Plan: Validate
Reviewers: erikd, rwbarton, austin
Reviewed By: rwbarton
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1704
GHC Trac Issues: #10433
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some headers used `new` as parameter name, which is reserved word in
C++. This patch changes these names to `new_`.
Test Plan: validate
Reviewers: austin, ezyang, bgamari, simonmar
Reviewed By: simonmar
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1107
GHC Trac Issues: #10700
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Extend the PowerPC 32-bit native code generator for "64-bit
PowerPC ELF Application Binary Interface Supplement 1.9" by
Ian Lance Taylor and "Power Architecture 64-Bit ELF V2 ABI Specification --
OpenPOWER ABI for Linux Supplement" by IBM.
The latter ABI is mainly used on POWER7/7+ and POWER8
Linux systems running in little-endian mode. The code generator
supports both static and dynamic linking. PowerPC 64-bit
code for ELF ABI 1.9 and 2 is mostly position independent
anyway, and thus so is all the code emitted by the code
generator. In other words, -fPIC does not make a difference.
rts/stg/SMP.h support is implemented.
Following the spirit of the introductory comment in
PPC/CodeGen.hs, the rest of the code is a straightforward
extension of the 32-bit implementation.
Limitations:
* Code is generated only in the medium code model, which
is also gcc's default
* Local symbols are not accessed directly, which seems to
also be the case for 32-bit
* LLVM does not work, but this does not work on 32-bit either
* Must use the system runtime linker in GHCi, because the
GHC linker for "static" object files (rts/Linker.c) for
PPC 64-bit is not implemented. The system runtime
(dynamic) linker works.
* The handling of the system stack (register 1) is not ELF-
compliant so stack traces break. Instead of allocating a new
stack frame, spill code should use the "official" spill area
in the current stack frame and deallocation code should restore
the back chain
* DWARF support is missing
Fixes #9863
Test Plan: validate (on powerpc, too)
Reviewers: simonmar, trofi, erikd, austin
Reviewed By: trofi
Subscribers: bgamari, arnons1, kgardas, thomie
Differential Revision: https://phabricator.haskell.org/D629
GHC Trac Issues: #9863
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the previous implementation, the `stlxr` instruction clobbered
the value that was supposed to be returned by the the `xchg`
function.
Signed-off-by: Erik de Castro Lopo <erikd@mega-nerd.com>
Test Plan: build on aarch64
Reviewers: austin, bgamari, rwbarton
Subscribers: bgamari, thomie
Differential Revision: https://phabricator.haskell.org/D932
|
|
|
|
|
|
|
|
|
|
|
|
| |
As Reid mentioned in a comment on D894, the case fixed by this revision
likely isn't really correct, because old ARM binaries could run on newer
machines, meaning we need to detect at runtime whether we need a proper
barrier.
But in the mean time, this actually stops the build from failing - which
is better off. So we'll just remember this when we fix it in the future.
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Assuming there is no real SMP systems on these CPUs
I've added only compiler barrier (otherwise write_barrier
and friends need to be fixed as well).
Patch also fixes build breakage reported in #10244.
Signed-off-by: Sergei Trofimovich <siarheit@google.com>
Reviewers: rwbarton, nomeata, austin
Reviewed By: nomeata, austin
Subscribers: bgamari, thomie
Differential Revision: https://phabricator.haskell.org/D894
GHC Trac Issues: #10244
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
For compatibility with ARM machines from pre v6, the RTS provides
implementations of certain atomic operations. Previously, these
were only included in the threaded RTS.
But ghc (the library) contains the code in compiler/cbits/genSym.c, which
uses these operations if there is more than one capability. But there is only
one libHSghc, so the linker wants to resolve these symbols in every case.
By providing these operations in all RTSs, the linker is happy. The only
downside is a small amount of dead code in the non-threaded RTS on old ARM
machines.
Test Plan: It helped here.
Reviewers: bgamari, austin
Reviewed By: bgamari, austin
Subscribers: carter, thomie
Differential Revision: https://phabricator.haskell.org/D564
GHC Trac Issues: #8951
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
git history does not contain state where 'WITHSMP' macro was ever defined.
It should have always been '!NOSMP'.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Test Plan: build tested
Reviewers: austin, simonmar
Reviewed By: austin, simonmar
Subscribers: simonmar, relrod, carter
Differential Revision: https://phabricator.haskell.org/D33
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
*p is both read and written to by the cmpxchg instruction, and therefore
should be given the '+' constraint modifier.
(In GCC's extended ASM language, '+' means that the operand is both read
and written to whereas '=' means that it is only written to.)
Otherwise, the compiler is allowed to rewrite something like
SpinLock lock;
initSpinLock(&lock); /* sets lock = 1 */
ACQUIRE_SPIN_LOCK(&lock);
into
SpinLock lock;
ACQUIRE_SPIN_LOCK(&lock);
because according to the asm statement, the previous value of 'lock' is
not important.
|
|\ |
|
| |
| |
| |
| | |
#define. (fixes #8077)
|
| | |
|
|/ |
|
|
|
|
|
|
|
| |
When the bootstrap compiler does not include this patch, you must add this line
to mk/build.mk, otherwise the ARM architecture cannot be detected due to a
-undef option given to the C pre-processor.
SRC_HC_OPTS = -pgmP 'gcc -E -traditional'
|
|
|
|
|
|
| |
This patch fixes RTS' xchg and cas functions. On ARMv7 it is recommended
to add memory barrier after using ldrex/strex for implementing atomic
lock or operation.
|
|
|
|
|
|
| |
This patch provides implementation of ARMv7 specific memory barriers.
It uses dmb sy isn (or shortly dmb) for store/load and load/load barriers
and dmb st isn for store/store barrier.
|
| |
|
| |
|
|
|
|
|
|
| |
This is the Stephen Blackheath's GHC/ARM registerised port
which is using modified version of LLVM and which provides
basic registerised build functionality
|
| |
|
|
|
|
|
| |
This is more pleasant than having the C generator check whether the
function it's calling is cas, and not generate a prototype if so.
|
| |
|
| |
|