| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
| |
Previously we would leave the card table of new arrays uninitialized.
This wasn't a soundness issue: at worst we would end up doing
unnecessary scavenging during GC, after which the card table would be
reset. That being said, it seems worth initializing this properly to
avoid both unnecessary work and non-determinism.
Fixes #19143.
|
|
|
|
|
|
| |
But only when profiling or DEBUG are enabled.
Fixes #17572.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously `TICK_BUMP_BY` was defined as
```c
#define TICK_BUMP_BY(ctr,n) CLong[ctr] = CLong[ctr] + n
```
Yet the tickers themselves were defined as `StgInt`s. This happened to
work out correctly on Linux, where `CLong` is 64-bits. However, it
failed on Windows, where `CLong` is 32-bits, resulting in #18782.
Fixes #18783.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't want to save both Fn and Dn register sets on x86-64 as they are
aliased to the same arch register (XMMn).
Moreover, when SAVE_STGREGS was used in conjunction with `jump foo [*]`
which makes a set of Cmm registers alive so that they cover all arch
registers used to pass parameter, we could have Fn, Dn and XMMn alive at
the same time. It made the LLVM code generator choke (see #17920).
Now `SAVE_REGS/RESTORE_REGS` and `jump foo [*]` use the same set of
registers.
|
|
|
|
|
|
|
|
|
|
|
| |
The code is just more confusing than it needs to be. We don't need to mix
the threaded check with the ldv profiling check since ldv's init already
checks for this. Hence they can be two separate checks. Taking the sanity
checking into account is also cleaner via DebugFlags.sanity. No need for
checking the DEBUG define.
The ZERO_SLOP_FOR_LDV_PROF and ZERO_SLOP_FOR_SANITY_CHECK definitions the
old code had also make things a lot more opaque IMO so I removed those.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This extends the non-moving collector to allow concurrent collection.
The full design of the collector implemented here is described in detail
in a technical note
B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
Compiler" (2018)
This extension involves the introduction of a capability-local
remembered set, known as the /update remembered set/, which tracks
objects which may no longer be visible to the collector due to mutation.
To maintain this remembered set we introduce a write barrier on
mutations which is enabled while a concurrent mark is underway.
The update remembered set representation is similar to that of the
nonmoving mark queue, being a chunked array of `MarkEntry`s. Each
`Capability` maintains a single accumulator chunk, which it flushed
when it (a) is filled, or (b) when the nonmoving collector enters its
post-mark synchronization phase.
While the write barrier touches a significant amount of code it is
conceptually straightforward: the mutator must ensure that the referee
of any pointer it overwrites is added to the update remembered set.
However, there are a few details:
* In the case of objects with a dirty flag (e.g. `MVar`s) we can
exploit the fact that only the *first* mutation requires a write
barrier.
* Weak references, as usual, complicate things. In particular, we must
ensure that the referee of a weak object is marked if dereferenced by
the mutator. For this we (unfortunately) must introduce a read
barrier, as described in Note [Concurrent read barrier on deRefWeak#]
(in `NonMovingMark.c`).
* Stable names are also a bit tricky as described in Note [Sweeping
stable names in the concurrent collector] (`NonMovingSweep.c`).
We take quite some pains to ensure that the high thread count often seen
in parallel Haskell applications doesn't affect pause times. To this end
we allow thread stacks to be marked either by the thread itself (when it
is executed or stack-underflows) or the concurrent mark thread (if the
thread owning the stack is never scheduled). There is a non-trivial
handshake to ensure that this happens without racing which is described
in Note [StgStack dirtiness flags and concurrent marking].
Co-Authored-by: Ömer Sinan Ağacan <omer@well-typed.com>
|
|
|
|
|
|
|
| |
Until 0472f0f6a92395d478e9644c0dbd12948518099f there was a meaningful
host vs target distinction (though it wasn't used right, in genapply).
After that, they did not differ in meaningful ways, so it's best to just
only keep one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here the following changes are introduced:
- A read barrier machine op is added to Cmm.
- The order in which a closure's fields are read and written is changed.
- Memory barriers are added to RTS code to ensure correctness on
out-or-order machines with weak memory ordering.
Cmm has a new CallishMachOp called MO_ReadBarrier. On weak memory machines, this
is lowered to an instruction that ensures memory reads that occur after said
instruction in program order are not performed before reads coming before said
instruction in program order. On machines with strong memory ordering properties
(e.g. X86, SPARC in TSO mode) no such instruction is necessary, so
MO_ReadBarrier is simply erased. However, such an instruction is necessary on
weakly ordered machines, e.g. ARM and PowerPC.
Weam memory ordering has consequences for how closures are observed and mutated.
For example, consider a closure that needs to be updated to an indirection. In
order for the indirection to be safe for concurrent observers to enter, said
observers must read the indirection's info table before they read the
indirectee. Furthermore, the entering observer makes assumptions about the
closure based on its info table contents, e.g. an INFO_TYPE of IND imples the
closure has an indirectee pointer that is safe to follow.
When a closure is updated with an indirection, both its info table and its
indirectee must be written. With weak memory ordering, these two writes can be
arbitrarily reordered, and perhaps even interleaved with other threads' reads
and writes (in the absence of memory barrier instructions). Consider this
example of a bad reordering:
- An updater writes to a closure's info table (INFO_TYPE is now IND).
- A concurrent observer branches upon reading the closure's INFO_TYPE as IND.
- A concurrent observer reads the closure's indirectee and enters it. (!!!)
- An updater writes the closure's indirectee.
Here the update to the indirectee comes too late and the concurrent observer has
jumped off into the abyss. Speculative execution can also cause us issues,
consider:
- An observer is about to case on a value in closure's info table.
- The observer speculatively reads one or more of closure's fields.
- An updater writes to closure's info table.
- The observer takes a branch based on the new info table value, but with the
old closure fields!
- The updater writes to the closure's other fields, but its too late.
Because of these effects, reads and writes to a closure's info table must be
ordered carefully with respect to reads and writes to the closure's other
fields, and memory barriers must be placed to ensure that reads and writes occur
in program order. Specifically, updates to a closure must follow the following
pattern:
- Update the closure's (non-info table) fields.
- Write barrier.
- Update the closure's info table.
Observing a closure's fields must follow the following pattern:
- Read the closure's info pointer.
- Read barrier.
- Read the closure's (non-info table) fields.
This patch updates RTS code to obey this pattern. This should fix long-standing
SMP bugs on ARM (specifically newer aarch64 microarchitectures supporting
out-of-order execution) and PowerPC. This fixes issue #15449.
Co-Authored-By: Ben Gamari <ben@well-typed.com>
|
|
|
|
|
|
|
| |
The primop stgFloatToWord32 was sign-extending the 32-bit word, resulting
in weird negative Word32s. Zero-extend them instead.
Closes #16617.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This moves all URL references to Trac Wiki to their corresponding
GitLab counterparts.
This substitution is classified as follows:
1. Automated substitution using sed with Ben's mapping rule [1]
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...
2. Manual substitution for URLs containing `#` index
Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz
New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz
3. Manual substitution for strings starting with `Commentary`
Old: Commentary/XxxYyy...
New: commentary/xxx-yyy...
See also !539
[1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As #15571 reports, eager blackholing breaks sanity checks as we can't
zero the payload when eagerly blackholing (because we'll be using the
payload after blackholing), but by the time we blackhole a previously
eagerly blackholed object (in `threadPaused()`) we don't have the
correct size information for the object (because the object's type
becomes BLACKHOLE when we eagerly blackhole it) so can't properly zero
the slop.
This problem can be solved for AP_STACK eager blackholing (which unlike
eager blackholing in general, is not optional) by zeroing the payload
after entering the stack. This patch implements this idea.
Fixes #15571.
Test Plan:
Previously concprog001 when compiled and run with sanity checks
ghc-stage2 Mult.hs -debug -rtsopts
./Mult +RTS -DS
was failing with
Mult: internal error: checkClosure: stack frame
(GHC version 8.7.20180821 for x86_64_unknown_linux)
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
thic patch fixes this panic. The test still panics, but it runs for a while
before panicking (instead of directly panicking as before), and the new problem
seems unrelated:
Mult: internal error: ASSERTION FAILED: file rts/sm/Sanity.c, line 296
(GHC version 8.7.20180919 for x86_64_unknown_linux)
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
The new problem will be fixed in another diff.
I also tried slow validate (which requires D5164): this does not introduce any
new failures.
Reviewers: simonmar, bgamari, erikd
Reviewed By: simonmar
Subscribers: rwbarton, carter
GHC Trac Issues: #15571
Differential Revision: https://phabricator.haskell.org/D5165
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems like we currently support string literals in Cmm, so we can use
__LINE__ CPP macro in assertion macros. This improves error messages
that previously looked like
ASSERTION FAILED: file (null), line 1302
(null) part now shows the actual file name.
Also inline some single-use string literals in PrimOps.cmm.
Reviewers: bgamari, simonmar, erikd
Reviewed By: bgamari
Subscribers: rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4862
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This feature has some very serious correctness issues (#14310),
introduces a great deal of complexity, and hasn't seen wide usage.
Consequently we are removing it, as proposed in Proposal #77 [1]. This
is heavily based on a patch from fryguybob.
Updates stm submodule.
[1] https://github.com/ghc-proposals/ghc-proposals/pull/77
Test Plan: Validate
Reviewers: erikd, simonmar, hvr
Reviewed By: simonmar
Subscribers: rwbarton, thomie, carter
GHC Trac Issues: #14310
Differential Revision: https://phabricator.haskell.org/D4760
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
get/setAllocationCounter didn't take into account allocations in the
current block. This was known at the time, but it turns out to be
important to have more accuracy when using these in a fine-grained
way.
Test Plan:
New unit test to test incrementally larger allocaitons. Before I got
results like this:
```
+0
+0
+0
+0
+0
+4096
+0
+0
+0
+0
+0
+4064
+0
+0
+4088
+4056
+0
+0
+0
+4088
+4096
+4056
+4096
```
Notice how the results aren't always monotonically increasing. After
this patch:
```
+344
+416
+488
+560
+632
+704
+776
+848
+920
+992
+1064
+1136
+1208
+1280
+1352
+1424
+1496
+1568
+1640
+1712
+1784
+1856
+1928
+2000
+2072
+2144
```
Reviewers: hvr, erikd, simonmar, jrtc27, trommler
Reviewed By: simonmar
Subscribers: trommler, jrtc27, rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4363
|
|
|
|
| |
Our new CPP linter enforces this.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While looking at #13615 I noticed that there was this strange open-coded
memcpy in the definition of the cloneArray macro. I don't see why this
should be preferable to memcpy.
Test Plan: Validate, particularly focusing on array operations
Reviewers: simonmar, tibbe, austin, alexbiehl
Reviewed By: tibbe, alexbiehl
Subscribers: alexbiehl, rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3504
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This both says what we mean and silences a bunch of spurious CPP linting
warnings. This pragma is supported by all CPP implementations which we
support.
Reviewers: austin, erikd, simonmar, hvr
Reviewed By: simonmar
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3482
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fixes some cases of wrong stacks being generated by the profiler.
For background and details on the fix see
`Note [Evaluating functions with profiling]` in `rts/Apply.cmm`.
This does have an impact on allocations for some programs when
profiling. nofib results:
```
k-nucleotide +0.0% +8.8% +11.0% +11.0% 0.0%
puzzle +0.0% +12.5% 0.244 0.246 0.0%
typecheck 0.0% +8.7% +16.1% +16.2% 0.0%
------------------------------------------------------------------------
--------
Min -0.0% -0.0% -34.4% -35.5% -25.0%
Max +0.0% +12.5% +48.9% +49.4% +10.6%
Geometric Mean +0.0% +0.6% +2.0% +1.8% -0.3%
```
But runtimes don't seem to be affected much, and the examples I looked
at were completely legitimate. For example, in puzzle we have this:
```
position :: ItemType -> StateType -> BankType
position Bono = bonoPos
position Edge = edgePos
position Larry = larryPos
position Adam = adamPos
```
where the identifiers on the rhs are all record selectors. Previously
the profiler gave a stack that looked like
```
position
bonoPos
...
```
i.e. `bonoPos` was at the same level of the call stack as `position`,
but now it looks like
```
position
bonoPos
...
```
I used the normaliser from the testsuite to diff the profiling output
from other nofib programs and they all looked better.
Test Plan:
* the broken test passes
* validate
* compiled and ran all of nofib, measured perf, diff'd several .prof
files
Reviewers: niteria, erikd, austin, scpmw, bgamari
Reviewed By: bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2804
GHC Trac Issues: #5654, #10007
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test Plan: Validate on lots of platforms
Reviewers: erikd, simonmar, austin
Reviewed By: erikd, simonmar
Subscribers: michalt, thomie
Differential Revision: https://phabricator.haskell.org/D2699
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adjust `CmmParse.y` to parse the `cmpxchg{8, 16, 32, 64}` instructions
and use the 32 respectively the 64 bit variant in `Primops.cmm`. This
effectively eliminates the compare-and-swap ccall to the rts.
Based off the mailing list question from @osa1
(https://mail.haskell.org/pipermail/ghc-devs/2016-July/012506.html).
Reviewers: simonmar, austin, erikd, bgamari, trommler
Reviewed By: erikd, bgamari, trommler
Subscribers: carter, trommler, osa1, thomie
Differential Revision: https://phabricator.haskell.org/D2431
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.
Linux provides a NUMA API for doing two things:
* Allocating memory local to a particular node
* Binding a thread to a particular node
When given the +RTS --numa flag, the runtime will
* Determine the number of NUMA nodes (N) by querying the OS
* Assign capabilities to nodes, so cap C is on node C%N
* Bind worker threads on a capability to the correct node
* Keep a separate free lists in the block layer for each node
* Allocate the nursery for a capability from node-local memory
* Allocate blocks in the GC from node-local memory
For example, using nofib/parallel/queens on a 24-core 2-socket machine:
```
$ ./Main 15 +RTS -N24 -s -A64m
Total time 173.960s ( 7.467s elapsed)
$ ./Main 15 +RTS -N24 -s -A64m --numa
Total time 150.836s ( 6.423s elapsed)
```
The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).
According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using `--numa`.
Test Plan:
* validate
* There's a new flag --debug-numa=<n> that pretends to do NUMA without
actually making the OS calls, which is useful for testing the code
on non-NUMA systems.
* TODO: I need to add some unit tests
Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2199
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
it seems that this closure type has not been in use since 5d52d9, so all
this is dead and untested code. This removes it. Some of the code might
be useful for a counting indirection as described in #10613, so when
implementing that, have a look at what this commit removes.
Test Plan: validate on harbormaster
Reviewers: austin, bgamari, simonmar
Reviewed By: simonmar
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D1821
|
|
|
|
|
|
|
|
|
|
| |
Rename StgArrWords to StgArrBytes (see Trac #8552)
Reviewed By: austin
Differential Revision: https://phabricator.haskell.org/D1233
GHC Trac Issues: #8552
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Alignment needs to be a compile-time constant. Previously the code
generators had to jump through hoops to ensure this was the case as the
alignment was passed as a CmmExpr in the arguments list. Now we take
care of this up front.
This fixes #8131.
Authored-by: Reid Barton <rwbarton@gmail.com>
Dusted-off-by: Ben Gamari <ben@smart-cactus.org>
Tests for T8131
Test Plan: Validate
Reviewers: rwbarton, austin
Reviewed By: rwbarton, austin
Subscribers: bgamari, carter, thomie
Differential Revision: https://phabricator.haskell.org/D624
GHC Trac Issues: #8131
|
|
|
|
|
|
|
| |
This reverts commit 35672072b4091d6f0031417bc160c568f22d0469.
Conflicts:
compiler/main/DriverPipeline.hs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
In preparation for indirecting all references to closures,
we rename _closure to _static_closure to ensure any old code
will get an undefined symbol error. In order to reference
a closure foobar_closure (which is now undefined), you should instead
use STATIC_CLOSURE(foobar). For convenience, a number of these
old identifiers are macro'd.
Across C-- and C (Windows and otherwise), there were differing
conventions on whether or not foobar_closure or &foobar_closure
was the address of the closure. Now, all foobar_closure references
are addresses, and no & is necessary.
CHARLIKE/INTLIKE were not changed, simply alpha-renamed.
Part of remove HEAP_ALLOCED patch set (#8199)
Depends on D265
Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
Test Plan: validate
Reviewers: simonmar, austin
Subscribers: simonmar, ezyang, carter, thomie
Differential Revision: https://phabricator.haskell.org/D267
GHC Trac Issues: #8199
|
|
|
|
| |
Signed-off-by: Austin Seipp <austin@well-typed.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The two new primops with the type-signatures
resizeMutableByteArray# :: MutableByteArray# s -> Int#
-> State# s -> (# State# s, MutableByteArray# s #)
shrinkMutableByteArray# :: MutableByteArray# s -> Int#
-> State# s -> State# s
allow to resize MutableByteArray#s in-place (when possible), and are useful
for algorithms where memory is temporarily over-allocated. The motivating
use-case is for implementing integer backends, where the final target size of
the result is either N or N+1, and only known after the operation has been
performed.
A future commit will implement a stateful variant of the
`sizeofMutableByteArray#` operation (see #9447 for details), since now the
size of a `MutableByteArray#` may change over its lifetime (i.e before
it gets frozen or GCed).
Test Plan: ./validate --slow
Reviewers: ezyang, austin, simonmar
Reviewed By: austin, simonmar
Differential Revision: https://phabricator.haskell.org/D133
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These array types are smaller than Array# and MutableArray# and are
faster when the array size is small, as they don't have the overhead
of a card table. Having no card table reduces the closure size with 2
words in the typical small array case and leads to less work when
updating or GC:ing the array.
Reduces both the runtime and memory allocation by 8.8% on my insert
benchmark for the HashMap type in the unordered-containers package,
which makes use of lots of small arrays. With tuned GC settings
(i.e. `+RTS -A6M`) the runtime reduction is 15%.
Fixes #8923.
|
|
|
|
|
|
| |
This should reduce code size when there's little to gain from inlining
these primops, while still retaining the inlining benefit when the
size of the copy is known statically.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The inline allocation version is 69% faster than the out-of-line
version, when cloning an array of 16 unit elements on a 64-bit
machine.
Comparing the new and the old primop implementations isn't
straightforward. The old version had a missing heap check that I
discovered during the development of the new version. Comparing the
old and the new version would requiring fixing the old version, which
in turn means reimplementing the equivalent of MAYBE_CG in StgCmmPrim.
The inline allocation threshold is configurable via
-fmax-inline-alloc-size which gives the maximum array size, in bytes,
to allocate inline. The size does not include the closure header size.
Allowing the same primop to be either inline or out-of-line has some
implication for how we lay out heap checks. We always place a heap
check around out-of-line primops, as they may allocate outside of our
knowledge. However, for the inline primops we only allow allocation
via the standard means (i.e. virtHp). Since the clone primops might be
either inline or out-of-line the heap check layout code now consults
shouldInlinePrimOp to know whether a primop will be inlined.
|
|
|
|
|
|
|
|
|
|
| |
We were passing the function address to stg_gc_prim_p in R9, which was
wrong because the call was a high-level call and didn't declare R9 as
a parameter. Passing R9 as an argument is the right way, but
unfortunately that exposed another bug: we were using the same macro
in some low-level Cmm, where it is illegal to call functions with
arguments (see Note [syntax of cmm files]). So we now have low-level
variants of STK_CHK() and STK_CHK_P() for use in low-level Cmm code.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* the new StgCmmArgRep module breaks a dependency cycle; I also
untabified it, but made no real changes
* updated the documentation in the wiki and change the user guide to
point there
* moved the allocation enters for ticky and CCS to after the heap check
* I left LDV where it was, which was before the heap check at least
once, since I have no idea what it is
* standardized all (active?) ticky alloc totals to bytes
* in order to avoid double counting StgCmmLayout.adjustHpBackwards
no longer bumps ALLOC_HEAP_ctr
* I resurrected the SLOW_CALL counters
* the new module StgCmmArgRep breaks cyclic dependency between
Layout and Ticky (which the SLOW_CALL counters cause)
* renamed them SLOW_CALL_fast_<pattern> and VERY_SLOW_CALL
* added ALLOC_RTS_ctr and _tot ticky counters
* eg allocation by Storage.c:allocate or a BUILD_PAP in stg_ap_*_info
* resurrected ticky counters for ALLOC_THK, ALLOC_PAP, and
ALLOC_PRIM
* added -ticky and -DTICKY_TICKY in ways.mk for debug ways
* added a ticky counter for total LNE entries
* new flags for ticky: -ticky-allocd -ticky-dyn-thunk -ticky-LNE
* all off by default
* -ticky-allocd: tracks allocation *of* closure in addition to
allocation *by* that closure
* -ticky-dyn-thunk tracks dynamic thunks as if they were functions
* -ticky-LNE tracks LNEs as if they were functions
* updated the ticky report format, including making the argument
categories (more?) accurate again
* the printed name for things in the report include the unique of
their ticky parent as well as if they are not top-level
|
| |
|
|
|
|
|
| |
Vector values are now always passed on the stack. This isn't particularly
efficient, but it will have to do for now.
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, threads blocked on an STM retry would be sent a wakeup
message each time an unpark was requested. This could result in the
accumulation of a large number of wake-up messages, which would slow
wake-up once the sleeping thread is finally scheduled.
Here, we introduce a new closure type, STM_AWOKEN, which marks a TSO
which has been sent a wake-up message, allowing us to send only one
wakeup.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
x86-64.
On x86-64 F and D registers are both drawn from SSE registers, so there is no
reason not to draw them from the same pool of available SSE registers. This
means that whereas previously a function could only receive two Double arguments
in registers even if it did not have any Float arguments, now it can receive up
to 6 arguments that are any mix of Float and Double in registers.
This patch breaks the LLVM back end. The next patch will fix this breakage.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The main change here is that the Cmm parser now allows high-level cmm
code with argument-passing and function calls. For example:
foo ( gcptr a, bits32 b )
{
if (b > 0) {
// we can make tail calls passing arguments:
jump stg_ap_0_fast(a);
}
return (x,y);
}
More details on the new cmm syntax are in Note [Syntax of .cmm files]
in CmmParse.y.
The old syntax is still more-or-less supported for those occasional
code fragments that really need to explicitly manipulate the stack.
However there are a couple of differences: it is now obligatory to
give a list of live GlobalRegs on every jump, e.g.
jump %ENTRY_CODE(Sp(0)) [R1];
Again, more details in Note [Syntax of .cmm files].
I have rewritten most of the .cmm files in the RTS into the new
syntax, except for AutoApply.cmm which is generated by the genapply
program: this file could be generated in the new syntax instead and
would probably be better off for it, but I ran out of enthusiasm.
Some other changes in this batch:
- The PrimOp calling convention is gone, primops now use the ordinary
NativeNodeCall convention. This means that primops and "foreign
import prim" code must be written in high-level cmm, but they can
now take more than 10 arguments.
- CmmSink now does constant-folding (should fix #7219)
- .cmm files now go through the cmmPipeline, and as a result we
generate better code in many cases. All the object files generated
for the RTS .cmm files are now smaller. Performance should be
better too, but I haven't measured it yet.
- RET_DYN frames are removed from the RTS, lots of code goes away
- we now have some more canned GC points to cover unboxed-tuples with
2-4 pointers, which will reduce code size a little.
|