| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
| |
submodule updates: nofib, haddock
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch removes all CafInfo predictions and various hacks to preserve
predicted CafInfos from the compiler and assigns final CafInfos to
interface Ids after code generation. SRT analysis is extended to support
static data, and Cmm generator is modified to allow generating
static_link fields after SRT analysis.
This also fixes `-fcatch-bottoms`, which introduces error calls in case
expressions in CorePrep, which runs *after* CoreTidy (which is where we
decide on CafInfos) and turns previously non-CAFFY things into CAFFY.
Fixes #17648
Fixes #9718
Evaluation
==========
NoFib
-----
Boot with: `make boot mode=fast`
Run: `make mode=fast EXTRA_RUNTEST_OPTS="-cachegrind" NoFibRuns=1`
--------------------------------------------------------------------------------
Program Size Allocs Instrs Reads Writes
--------------------------------------------------------------------------------
CS -0.0% 0.0% -0.0% -0.0% -0.0%
CSD -0.0% 0.0% -0.0% -0.0% -0.0%
FS -0.0% 0.0% -0.0% -0.0% -0.0%
S -0.0% 0.0% -0.0% -0.0% -0.0%
VS -0.0% 0.0% -0.0% -0.0% -0.0%
VSD -0.0% 0.0% -0.0% -0.0% -0.5%
VSM -0.0% 0.0% -0.0% -0.0% -0.0%
anna -0.1% 0.0% -0.0% -0.0% -0.0%
ansi -0.0% 0.0% -0.0% -0.0% -0.0%
atom -0.0% 0.0% -0.0% -0.0% -0.0%
awards -0.0% 0.0% -0.0% -0.0% -0.0%
banner -0.0% 0.0% -0.0% -0.0% -0.0%
bernouilli -0.0% 0.0% -0.0% -0.0% -0.0%
binary-trees -0.0% 0.0% -0.0% -0.0% -0.0%
boyer -0.0% 0.0% -0.0% -0.0% -0.0%
boyer2 -0.0% 0.0% -0.0% -0.0% -0.0%
bspt -0.0% 0.0% -0.0% -0.0% -0.0%
cacheprof -0.0% 0.0% -0.0% -0.0% -0.0%
calendar -0.0% 0.0% -0.0% -0.0% -0.0%
cichelli -0.0% 0.0% -0.0% -0.0% -0.0%
circsim -0.0% 0.0% -0.0% -0.0% -0.0%
clausify -0.0% 0.0% -0.0% -0.0% -0.0%
comp_lab_zift -0.0% 0.0% -0.0% -0.0% -0.0%
compress -0.0% 0.0% -0.0% -0.0% -0.0%
compress2 -0.0% 0.0% -0.0% -0.0% -0.0%
constraints -0.0% 0.0% -0.0% -0.0% -0.0%
cryptarithm1 -0.0% 0.0% -0.0% -0.0% -0.0%
cryptarithm2 -0.0% 0.0% -0.0% -0.0% -0.0%
cse -0.0% 0.0% -0.0% -0.0% -0.0%
digits-of-e1 -0.0% 0.0% -0.0% -0.0% -0.0%
digits-of-e2 -0.0% 0.0% -0.0% -0.0% -0.0%
dom-lt -0.0% 0.0% -0.0% -0.0% -0.0%
eliza -0.0% 0.0% -0.0% -0.0% -0.0%
event -0.0% 0.0% -0.0% -0.0% -0.0%
exact-reals -0.0% 0.0% -0.0% -0.0% -0.0%
exp3_8 -0.0% 0.0% -0.0% -0.0% -0.0%
expert -0.0% 0.0% -0.0% -0.0% -0.0%
fannkuch-redux -0.0% 0.0% -0.0% -0.0% -0.0%
fasta -0.0% 0.0% -0.0% -0.0% -0.0%
fem -0.0% 0.0% -0.0% -0.0% -0.0%
fft -0.0% 0.0% -0.0% -0.0% -0.0%
fft2 -0.0% 0.0% -0.0% -0.0% -0.0%
fibheaps -0.0% 0.0% -0.0% -0.0% -0.0%
fish -0.0% 0.0% -0.0% -0.0% -0.0%
fluid -0.1% 0.0% -0.0% -0.0% -0.0%
fulsom -0.0% 0.0% -0.0% -0.0% -0.0%
gamteb -0.0% 0.0% -0.0% -0.0% -0.0%
gcd -0.0% 0.0% -0.0% -0.0% -0.0%
gen_regexps -0.0% 0.0% -0.0% -0.0% -0.0%
genfft -0.0% 0.0% -0.0% -0.0% -0.0%
gg -0.0% 0.0% -0.0% -0.0% -0.0%
grep -0.0% 0.0% -0.0% -0.0% -0.0%
hidden -0.0% 0.0% -0.0% -0.0% -0.0%
hpg -0.1% 0.0% -0.0% -0.0% -0.0%
ida -0.0% 0.0% -0.0% -0.0% -0.0%
infer -0.0% 0.0% -0.0% -0.0% -0.0%
integer -0.0% 0.0% -0.0% -0.0% -0.0%
integrate -0.0% 0.0% -0.0% -0.0% -0.0%
k-nucleotide -0.0% 0.0% -0.0% -0.0% -0.0%
kahan -0.0% 0.0% -0.0% -0.0% -0.0%
knights -0.0% 0.0% -0.0% -0.0% -0.0%
lambda -0.0% 0.0% -0.0% -0.0% -0.0%
last-piece -0.0% 0.0% -0.0% -0.0% -0.0%
lcss -0.0% 0.0% -0.0% -0.0% -0.0%
life -0.0% 0.0% -0.0% -0.0% -0.0%
lift -0.0% 0.0% -0.0% -0.0% -0.0%
linear -0.1% 0.0% -0.0% -0.0% -0.0%
listcompr -0.0% 0.0% -0.0% -0.0% -0.0%
listcopy -0.0% 0.0% -0.0% -0.0% -0.0%
maillist -0.0% 0.0% -0.0% -0.0% -0.0%
mandel -0.0% 0.0% -0.0% -0.0% -0.0%
mandel2 -0.0% 0.0% -0.0% -0.0% -0.0%
mate -0.0% 0.0% -0.0% -0.0% -0.0%
minimax -0.0% 0.0% -0.0% -0.0% -0.0%
mkhprog -0.0% 0.0% -0.0% -0.0% -0.0%
multiplier -0.0% 0.0% -0.0% -0.0% -0.0%
n-body -0.0% 0.0% -0.0% -0.0% -0.0%
nucleic2 -0.0% 0.0% -0.0% -0.0% -0.0%
para -0.0% 0.0% -0.0% -0.0% -0.0%
paraffins -0.0% 0.0% -0.0% -0.0% -0.0%
parser -0.1% 0.0% -0.0% -0.0% -0.0%
parstof -0.1% 0.0% -0.0% -0.0% -0.0%
pic -0.0% 0.0% -0.0% -0.0% -0.0%
pidigits -0.0% 0.0% -0.0% -0.0% -0.0%
power -0.0% 0.0% -0.0% -0.0% -0.0%
pretty -0.0% 0.0% -0.3% -0.4% -0.4%
primes -0.0% 0.0% -0.0% -0.0% -0.0%
primetest -0.0% 0.0% -0.0% -0.0% -0.0%
prolog -0.0% 0.0% -0.0% -0.0% -0.0%
puzzle -0.0% 0.0% -0.0% -0.0% -0.0%
queens -0.0% 0.0% -0.0% -0.0% -0.0%
reptile -0.0% 0.0% -0.0% -0.0% -0.0%
reverse-complem -0.0% 0.0% -0.0% -0.0% -0.0%
rewrite -0.0% 0.0% -0.0% -0.0% -0.0%
rfib -0.0% 0.0% -0.0% -0.0% -0.0%
rsa -0.0% 0.0% -0.0% -0.0% -0.0%
scc -0.0% 0.0% -0.3% -0.5% -0.4%
sched -0.0% 0.0% -0.0% -0.0% -0.0%
scs -0.0% 0.0% -0.0% -0.0% -0.0%
simple -0.1% 0.0% -0.0% -0.0% -0.0%
solid -0.0% 0.0% -0.0% -0.0% -0.0%
sorting -0.0% 0.0% -0.0% -0.0% -0.0%
spectral-norm -0.0% 0.0% -0.0% -0.0% -0.0%
sphere -0.0% 0.0% -0.0% -0.0% -0.0%
symalg -0.0% 0.0% -0.0% -0.0% -0.0%
tak -0.0% 0.0% -0.0% -0.0% -0.0%
transform -0.0% 0.0% -0.0% -0.0% -0.0%
treejoin -0.0% 0.0% -0.0% -0.0% -0.0%
typecheck -0.0% 0.0% -0.0% -0.0% -0.0%
veritas -0.0% 0.0% -0.0% -0.0% -0.0%
wang -0.0% 0.0% -0.0% -0.0% -0.0%
wave4main -0.0% 0.0% -0.0% -0.0% -0.0%
wheel-sieve1 -0.0% 0.0% -0.0% -0.0% -0.0%
wheel-sieve2 -0.0% 0.0% -0.0% -0.0% -0.0%
x2n1 -0.0% 0.0% -0.0% -0.0% -0.0%
--------------------------------------------------------------------------------
Min -0.1% 0.0% -0.3% -0.5% -0.5%
Max -0.0% 0.0% -0.0% -0.0% -0.0%
Geometric Mean -0.0% -0.0% -0.0% -0.0% -0.0%
--------------------------------------------------------------------------------
Program Size Allocs Instrs Reads Writes
--------------------------------------------------------------------------------
circsim -0.1% 0.0% -0.0% -0.0% -0.0%
constraints -0.0% 0.0% -0.0% -0.0% -0.0%
fibheaps -0.0% 0.0% -0.0% -0.0% -0.0%
gc_bench -0.0% 0.0% -0.0% -0.0% -0.0%
hash -0.0% 0.0% -0.0% -0.0% -0.0%
lcss -0.0% 0.0% -0.0% -0.0% -0.0%
power -0.0% 0.0% -0.0% -0.0% -0.0%
spellcheck -0.0% 0.0% -0.0% -0.0% -0.0%
--------------------------------------------------------------------------------
Min -0.1% 0.0% -0.0% -0.0% -0.0%
Max -0.0% 0.0% -0.0% -0.0% -0.0%
Geometric Mean -0.0% +0.0% -0.0% -0.0% -0.0%
Manual inspection of programs in testsuite/tests/programs
---------------------------------------------------------
I built these programs with a bunch of dump flags and `-O` and compared
STG, Cmm, and Asm dumps and file sizes.
(Below the numbers in parenthesis show number of modules in the program)
These programs have identical compiler (same .hi and .o sizes, STG, and
Cmm and Asm dumps):
- Queens (1), andre_monad (1), cholewo-eval (2), cvh_unboxing (3),
andy_cherry (7), fun_insts (1), hs-boot (4), fast2haskell (2),
jl_defaults (1), jq_readsPrec (1), jules_xref (1), jtod_circint (4),
jules_xref2 (1), lennart_range (1), lex (1), life_space_leak (1),
bargon-mangler-bug (7), record_upd (1), rittri (1), sanders_array (1),
strict_anns (1), thurston-module-arith (2), okeefe_neural (1),
joao-circular (6), 10queens (1)
Programs with different compiler outputs:
- jl_defaults (1): For some reason GHC HEAD marks a lot of top-level
`[Int]` closures as CAFFY for no reason. With this patch we no longer
make them CAFFY and generate less SRT entries. For some reason Main.o
is slightly larger with this patch (1.3%) and the executable sizes are
the same. (I'd expect both to be smaller)
- launchbury (1): Same as jl_defaults: top-level `[Int]` closures marked
as CAFFY for no reason. Similarly `Main.o` is 1.4% larger but the
executable sizes are the same.
- galois_raytrace (13): Differences are in the Parse module. There are a
lot, but some of the changes are caused by the fact that for some
reason (I think a bug) GHC HEAD marks the dictionary for `Functor
Identity` as CAFFY. Parse.o is 0.4% larger, the executable size is the
same.
- north_array: We now generate less SRT entries because some of array
primops used in this program like `NewArrayOp` get eliminated during
Stg-to-Cmm and turn some CAFFY things into non-CAFFY. Main.o gets 24%
larger (9224 bytes from 9000 bytes), executable sizes are the same.
- seward-space-leak: Difference in this program is better shown by this
smaller example:
module Lib where
data CDS
= Case [CDS] [(Int, CDS)]
| Call CDS CDS
instance Eq CDS where
Case sels1 rets1 == Case sels2 rets2 =
sels1 == sels2 && rets1 == rets2
Call a1 b1 == Call a2 b2 =
a1 == a2 && b1 == b2
_ == _ =
False
In this program GHC HEAD builds a new SRT for the recursive group of
`(==)`, `(/=)` and the dictionary closure. Then `/=` points to `==`
in its SRT field, and `==` uses the SRT object as its SRT. With this
patch we use the closure for `/=` as the SRT and add `==` there. Then
`/=` gets an empty SRT field and `==` points to `/=` in its SRT
field.
This change looks fine to me.
Main.o gets 0.07% larger, executable sizes are identical.
head.hackage
------------
head.hackage's CI script builds 428 packages from Hackage using this
patch with no failures.
Compiler performance
--------------------
The compiler perf tests report that the compiler allocates slightly more
(worst case observed so far is 4%). However most programs in the test
suite are small, single file programs. To benchmark compiler performance
on something more realistic I build Cabal (the library, 236 modules)
with different optimisation levels. For the "max residency" row I run
GHC with `+RTS -s -A100k -i0 -h` for more accurate numbers. Other rows
are generated with just `-s`. (This is because `-i0` causes running GC
much more frequently and as a result "bytes copied" gets inflated by
more than 25x in some cases)
* -O0
| | GHC HEAD | This MR | Diff |
| --------------- | -------------- | -------------- | ------ |
| Bytes allocated | 54,413,350,872 | 54,701,099,464 | +0.52% |
| Bytes copied | 4,926,037,184 | 4,990,638,760 | +1.31% |
| Max residency | 421,225,624 | 424,324,264 | +0.73% |
* -O1
| | GHC HEAD | This MR | Diff |
| --------------- | --------------- | --------------- | ------ |
| Bytes allocated | 245,849,209,992 | 246,562,088,672 | +0.28% |
| Bytes copied | 26,943,452,560 | 27,089,972,296 | +0.54% |
| Max residency | 982,643,440 | 991,663,432 | +0.91% |
* -O2
| | GHC HEAD | This MR | Diff |
| --------------- | --------------- | --------------- | ------ |
| Bytes allocated | 291,044,511,408 | 291,863,910,912 | +0.28% |
| Bytes copied | 37,044,237,616 | 36,121,690,472 | -2.49% |
| Max residency | 1,071,600,328 | 1,086,396,256 | +1.38% |
Extra compiler allocations
--------------------------
Runtime allocations of programs are as reported above (NoFib section).
The compiler now allocates more than before. Main source of allocation
in this patch compared to base commit is the new SRT algorithm
(GHC.Cmm.Info.Build). Below is some of the extra work we do with this
patch, numbers generated by profiled stage 2 compiler when building a
pathological case (the test 'ManyConstructors') with '-O2':
- We now sort the final STG for a module, which means traversing the
entire program, generating free variable set for each top-level
binding, doing SCC analysis, and re-ordering the program. In
ManyConstructors this step allocates 97,889,952 bytes.
- We now do SRT analysis on static data, which in a program like
ManyConstructors causes analysing 10,000 bindings that we would
previously just skip. This step allocates 70,898,352 bytes.
- We now maintain an SRT map for the entire module as we compile Cmm
groups:
data ModuleSRTInfo = ModuleSRTInfo
{ ...
, moduleSRTMap :: SRTMap
}
(SRTMap is just a strict Map from the 'containers' library)
This map gets an entry for most bindings in a module (exceptions are
THUNKs and CAFFY static functions). For ManyConstructors this map
gets 50015 entries.
- Once we're done with code generation we generate a NameSet from SRTMap
for the non-CAFFY names in the current module. This set gets the same
number of entries as the SRTMap.
- Finally we update CafInfos in ModDetails for the non-CAFFY Ids, using
the NameSet generated in the previous step. This usually does the
least amount of allocation among the work listed here.
Only place with this patch where we do less work in the CAF analysis in
the tidying pass (CoreTidy). However that doesn't save us much, as the
pass still needs to traverse the whole program and update IdInfos for
other reasons. Only thing we don't here do is the `hasCafRefs` pass over
the RHS of bindings, which is a stateless pass that returns a boolean
value, so it doesn't allocate much.
(Metric changes blow are all increased allocations)
Metric changes
--------------
Metric Increase:
ManyAlternatives
ManyConstructors
T13035
T14683
T1969
T9961
|
|
|
|
|
|
| |
incomplete-uni-patterns and incomplete-record-updates will be in -Wall at a
future date, so prepare for that by disabling those warnings on files that
trigger them.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is part two of fixing #17334.
There are two parts to this commit:
- A bugfix for computing loop levels
- A bugfix of basic block invariants in the NCG.
-----------------------------------------------------------
In the first bug we ended up with a CFG of the sort: [A -> B -> C]
This was represented via maps as fromList [(A,B),(B,C)] and later
transformed into a adjacency array. However the transformation did
not include block C in the array (since we only looked at the keys of
the map).
This was still fine until we tried to look up successors for C and tried
to read outside of the array bounds when accessing C.
In order to prevent this in the future I refactored to code to include
all nodes as keys in the map representation. And make this a invariant
which is checked in a few places.
Overall I expect this to make the code more robust as now any failed
lookup will represent an error, versus failed lookups sometimes being
expected and sometimes not.
In terms of performance this makes some things cheaper (getting a list
of all nodes) and others more expensive (adding a new edge). Overall
this adds up to no noteable performance difference.
-----------------------------------------------------------
Part 2: When the NCG generated a new basic block, it did
not always insert a NEWBLOCK meta instruction in the stream which
caused a quite subtle bug.
During instruction selection a statement `s`
in a block B with control of the sort: B -> C
will sometimes result in control
flow of the sort:
┌ < ┐
v ^
B -> B1 ┴ -> C
as is the case for some atomic operations.
Now to keep the CFG in sync when introducing B1 we clearly
want to insert it between B and C. However there is
a catch when we have to deal with self loops.
We might start with code and a CFG of these forms:
loop:
stmt1 ┌ < ┐
.... v ^
stmtX loop ┘
stmtY
....
goto loop:
Now we introduce B1:
┌ ─ ─ ─ ─ ─┐
loop: │ ┌ < ┐ │
instrs v │ │ ^
.... loop ┴ B1 ┴ ┘
instrsFromX
stmtY
goto loop:
This is simple, all outgoing edges from loop now simply
start from B1 instead and the code generator knows which
new edges it introduced for the self loop of B1.
Disaster strikes if the statement Y follows the same pattern.
If we apply the same rule that all outgoing edges change then
we end up with:
loop ─> B1 ─> B2 ┬─┐
│ │ └─<┤ │
│ └───<───┘ │
└───────<────────┘
This is problematic. The edge B1->B1 is modified as expected.
However the modification is wrong!
The assembly in this case looked like this:
_loop:
<instrs>
_B1:
...
cmpxchgq ...
jne _B1
<instrs>
<end _B1>
_B2:
...
cmpxchgq ...
jne _B2
<instrs>
jmp loop
There is no edge _B2 -> _B1 here. It's still a self loop onto _B1.
The problem here is that really B1 should be two basic blocks.
Otherwise we have control flow in the *middle* of a basic block.
A contradiction!
So to account for this we add yet another basic block marker:
_B:
<instrs>
_B1:
...
cmpxchgq ...
jne _B1
jmp _B1'
_B1':
<instrs>
<end _B1>
_B2:
...
Now when inserting B2 we will only look at the outgoing edges of B1' and
everything will work out nicely.
You might also wonder why we don't insert jumps at the end of _B1'. There is
no way another block ends up jumping to the labels _B1 or _B2 since they are
essentially invisible to other blocks. View them as control flow labels local
to the basic block if you'd like.
Not doing this ultimately caused (part 2 of) #17334.
|
|
|
|
|
|
|
|
|
|
|
| |
These kinds of imports are necessary in some cases such as
importing instances of typeclasses or intentionally creating
dependencies in the build system, but '-Wunused-imports' can't
detect when they are no longer needed. This commit removes the
unused ones currently in the code base (not including test files
or submodules), with the hope that doing so may increase
parallelism in the build system by removing unnecessary
dependencies.
|
|
|
|
|
|
|
| |
ghc-pkg needs to be aware of platforms so it can figure out which
subdire within the user package db to use. This is admittedly
roundabout, but maybe Cabal could use the same notion of a platform as
GHC to good affect too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The splitter is an evil Perl script that processes assembler code.
Its job can be done better by the linker's --gc-sections flag. GHC
passes this flag to the linker whenever -split-sections is passed on
the command line.
This is based on @DemiMarie's D2768.
Fixes Trac #11315
Fixes Trac #9832
Fixes Trac #8964
Fixes Trac #8685
Fixes Trac #8629
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It never really encoded a invariant.
* The linear register allocator just did partial pattern matches
* The graph allocator just set it to (Just mapEmpty) for Nothing
So I changed LiveInfo to directly contain the map.
Further natCmmTopToLive which filled in Nothing is no longer exported.
Instead we know call cmmTopLiveness which changes the type AND fills
in the map.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
This patch implements a new code layout algorithm.
It has been tested for x86 and is disabled on other platforms.
Performance varies slightly be CPU/Machine but in general seems to be better
by around 2%.
Nofib shows only small differences of about +/- ~0.5% overall depending on
flags/machine performance in other benchmarks improved significantly.
Other benchmarks includes at least the benchmarks of: aeson, vector, megaparsec, attoparsec,
containers, text and xeno.
While the magnitude of gains differed three different CPUs where tested with
all getting faster although to differing degrees. I tested: Sandy Bridge(Xeon), Haswell,
Skylake
* Library benchmark results summarized:
* containers: ~1.5% faster
* aeson: ~2% faster
* megaparsec: ~2-5% faster
* xml library benchmarks: 0.2%-1.1% faster
* vector-benchmarks: 1-4% faster
* text: 5.5% faster
On average GHC compile times go down, as GHC compiled with the new layout
is faster than the overhead introduced by using the new layout algorithm,
Things this patch does:
* Move code responsilbe for block layout in it's own module.
* Move the NcgImpl Class into the NCGMonad module.
* Extract a control flow graph from the input cmm.
* Update this cfg to keep it in sync with changes during
asm codegen. This has been tested on x64 but should work on x86.
Other platforms still use the old codelayout.
* Assign weights to the edges in the CFG based on type and limited static
analysis which are then used for block layout.
* Once we have the final code layout eliminate some redundant jumps.
In particular turn a sequences of:
jne .foo
jmp .bar
foo:
into
je bar
foo:
..
Test Plan: ci
Reviewers: bgamari, jmct, jrtc27, simonmar, simonpj, RyanGlScott
Reviewed By: RyanGlScott
Subscribers: RyanGlScott, trommler, jmct, carter, thomie, rwbarton
GHC Trac Issues: #15124
Differential Revision: https://phabricator.haskell.org/D4726
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
On Windows one is not allowed to drop the stack by more than a page size.
The reason for this is that the OS only allocates enough stack till what
the TEB specifies. After that a guard page is placed and the rest of the
virtual address space is unmapped.
The intention is that doing stack allocations will cause you to hit the
guard which will then map the next page in and move the guard. This is
done to prevent what in the Linux world is known as stack clash
vulnerabilities https://access.redhat.com/security/cve/cve-2017-1000364.
There are modules in GHC for which the liveliness analysis thinks the
reserved 8KB of spill slots isn't enough. One being DynFlags and the
other being Cabal.
Though I think the Cabal one is likely a bug:
```
4d6544: 81 ec 00 46 00 00 sub $0x4600,%esp
4d654a: 8d 85 94 fe ff ff lea -0x16c(%ebp),%eax
4d6550: 3b 83 1c 03 00 00 cmp 0x31c(%ebx),%eax
4d6556: 0f 82 de 8d 02 00 jb 4ff33a <_cLpg_info+0x7a>
4d655c: c7 45 fc 14 3d 50 00 movl $0x503d14,-0x4(%ebp)
4d6563: 8b 75 0c mov 0xc(%ebp),%esi
4d6566: 83 c5 fc add $0xfffffffc,%ebp
4d6569: 66 f7 c6 03 00 test $0x3,%si
4d656e: 0f 85 a6 d7 02 00 jne 503d1a <_cLpb_info+0x6>
4d6574: 81 c4 00 46 00 00 add $0x4600,%esp
```
It allocates nearly 18KB of spill slots for a simple 4 line function
and doesn't even use it. Note that this doesn't happen on x64 or
when making a validate build. Only when making a build without a
validate and build.mk.
This and the allocation in DynFlags means the stack allocation will jump
over the guard page into unmapped memory areas and GHC or an end program
segfaults.
The pagesize on x86 Windows is 4KB which means we hit it very easily for
these two modules, which explains the total DOA of GHC 32bit for the past
3 releases and the "random" segfaults on Windows.
```
0:000> bp 00503d29
0:000> gn
Breakpoint 0 hit
WARNING: Stack overflow detected. The unwound frames are extracted from outside
normal stack bounds.
eax=03b6b9c9 ebx=00dc90f0 ecx=03cac48c edx=03cac43d esi=03b6b9c9 edi=03abef40
eip=00503d29 esp=013e96fc ebp=03cf8f70 iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202
setup+0x103d29:
00503d29 89442440 mov dword ptr [esp+40h],eax ss:002b:013e973c=????????
WARNING: Stack overflow detected. The unwound frames are extracted from outside
normal stack bounds.
WARNING: Stack overflow detected. The unwound frames are extracted from outside
normal stack bounds.
0:000> !teb
TEB at 00384000
ExceptionList: 013effcc
StackBase: 013f0000
StackLimit: 013eb000
```
This doesn't fix the liveliness analysis but does fix the allocations, by
emitting a function call to `__chkstk_ms` when doing allocations of larger
than a page, this will make sure the stack is probed every page so the kernel
maps in the next page.
`__chkstk_ms` is provided by `libGCC`, which is under the
`GNU runtime exclusion license`, so it's safe to link against it, even for
proprietary code. (Technically we already do since we link compiled C code in.)
For allocations smaller than a page we drop the stack and probe the new address.
This avoids the function call and still makes sure we hit the guard if needed.
PS: In case anyone is Wondering why we didn't notice this before, it's because we
only test x86_64 and on Windows 10. On x86_64 the page size is 8KB and also the
kernel is a bit more lenient on Windows 10 in that it seems to catch the segfault
and resize the stack if it was unmapped:
```
0:000> t
eax=03b6b9c9 ebx=00dc90f0 ecx=03cac48c edx=03cac43d esi=03b6b9c9 edi=03abef40
eip=00503d2d esp=013e96fc ebp=03cf8f70 iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202
setup+0x103d2d:
00503d2d 8b461b mov eax,dword ptr [esi+1Bh] ds:002b:03b6b9e4=03cac431
0:000> !teb
TEB at 00384000
ExceptionList: 013effcc
StackBase: 013f0000
StackLimit: 013e9000
```
Likely Windows 10 has a guard page larger than previous versions.
This fixes the stack allocations, and as soon as I get the time I will look at
the liveliness analysis. I find it highly unlikely that simple Cabal function
requires ~2200 spill slots.
Test Plan: ./validate
Reviewers: simonmar, bgamari
Reviewed By: bgamari
Subscribers: AndreasK, rwbarton, thomie, carter
GHC Trac Issues: #15154
Differential Revision: https://phabricator.haskell.org/D4917
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The main job of this commit is to track more accurately the scope
of tyvars introduced by user-written foralls. For example, it would
be to have something like this:
forall a. Int -> (forall k (b :: k). Proxy '[a, b]) -> Bool
In that type, a's kind must be k, but k isn't in scope. We had a
terrible way of doing this before (not worth repeating or describing
here, but see the old tcImplicitTKBndrs and friends), but now
we have a principled approach: make an Implication when kind-checking
a forall. Doing so then hooks into the existing machinery for
preventing skolem-escape, performing floating, etc. This also means
that we bump the TcLevel whenever going into a forall.
The new behavior is done in TcHsType.scopeTyVars, but see also
TcHsType.tc{Im,Ex}plicitTKBndrs, which have undergone significant
rewriting. There are several Notes near there to guide you. Of
particular interest there is that Implication constraints can now
have skolems that are out of order; this situation is reported in
TcErrors.
A major consequence of this is a slightly tweaked process for type-
checking type declarations. The new Note [Use SigTvs in kind-checking
pass] in TcTyClsDecls lays it out.
The error message for dependent/should_fail/TypeSkolEscape has become
noticeably worse. However, this is because the code in TcErrors goes to
some length to preserve pre-8.0 error messages for kind errors. It's time
to rip off that plaster and get rid of much of the kind-error-specific
error messages. I tried this, and doing so led to a lovely error message
for TypeSkolEscape. So: I'm accepting the error message quality regression
for now, but will open up a new ticket to fix it, along with a larger
error-message improvement I've been pondering. This applies also to
dependent/should_fail/{BadTelescope2,T14066,T14066e}, polykinds/T11142.
Other minor changes:
- isUnliftedTypeKind didn't look for tuples and sums. It does now.
- check_type used check_arg_type on both sides of an AppTy. But the left
side of an AppTy isn't an arg, and this was causing a bad error message.
I've changed it to use check_type on the left-hand side.
- Some refactoring around when we print (TYPE blah) in error messages.
The changes decrease the times when we do so, to good effect.
Of course, this is still all controlled by
-fprint-explicit-runtime-reps
Fixes #14066 #14749
Test cases: dependent/should_compile/{T14066a,T14749},
dependent/should_fail/T14066{,c,d,e,f,g,h}
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This switches the compiler/ component to get compiled with
-XNoImplicitPrelude and a `import GhcPrelude` is inserted in all
modules.
This is motivated by the upcoming "Prelude" re-export of
`Semigroup((<>))` which would cause lots of name clashes in every
modulewhich imports also `Outputable`
Reviewers: austin, goldfire, bgamari, alanz, simonmar
Reviewed By: bgamari
Subscribers: goldfire, rwbarton, thomie, mpickering, bgamari
Differential Revision: https://phabricator.haskell.org/D3989
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This copies the subset of Hoopl's functionality needed by GHC to
`cmm/Hoopl` and removes the dependency on the Hoopl package.
The main motivation for this change is the confusing/noisy interface
between GHC and Hoopl:
- Hoopl has `Label` which is GHC's `BlockId` but different than
GHC's `CLabel`
- Hoopl has `Unique` which is different than GHC's `Unique`
- Hoopl has `Unique{Map,Set}` which are different than GHC's
`Uniq{FM,Set}`
- GHC has its own specialized copy of `Dataflow`, so `cmm/Hoopl` is
needed just to filter the exposed functions (filter out some of the
Hoopl's and add the GHC ones)
With this change, we'll be able to simplify this significantly.
It'll also be much easier to do invasive changes (Hoopl is a public
package on Hackage with users that depend on the current behavior)
This should introduce no changes in functionality - it merely
copies the relevant code.
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
Test Plan: ./validate
Reviewers: austin, bgamari, simonmar
Reviewed By: bgamari, simonmar
Subscribers: simonpj, kavon, rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3616
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This refactoring makes it more obvious when we are constructing
a Node for the digraph rather than a less useful 3-tuple.
Reviewers: austin, goldfire, bgamari, simonmar, dfeuer
Reviewed By: dfeuer
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3414
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The fundamental problem with `type UniqSet = UniqFM` is that `UniqSet`
has a key invariant `UniqFM` does not. For example, `fmap` over
`UniqSet` will generally produce nonsense.
* Upgrade `UniqSet` from a type synonym to a newtype.
* Remove unused and shady `extendVarSet_C` and `addOneToUniqSet_C`.
* Use cached unique in `tyConsOfType` by replacing
`unitNameEnv (tyConName tc) tc` with `unitUniqSet tc`.
Reviewers: austin, hvr, goldfire, simonmar, niteria, bgamari
Reviewed By: niteria
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D3146
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This continues removal of `BlockId` module in favor of Hoopl's `Label`.
Most of the changes here are mechanical, apart from the orphan
`Outputable` instances for `LabelMap` and `LabelSet`. For now I've
moved them to `cmm/Hoopl`, since it's already trying to manage all
imports from Hoopl (to avoid any collisions).
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
Test Plan: validate
Reviewers: bgamari, austin, simonmar
Reviewed By: simonmar
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2800
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems that `BlockId` module could simply go away in favor
of Hoopl's `Label`. This is the first step to do that.
In a few places I had to add some type signatures, but most of
them seem to help with code readability.
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
Test Plan: ./validate
Reviewers: austin, simonmar, bgamari
Reviewed By: bgamari
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2765
|
|
|
|
|
|
|
| |
these actually are complete, but due to the use of pattern guards, the
compiler does not see that. Refactor the code that it does.
Differential Revision: https://phabricator.haskell.org/D2574
|
| |
|
|
|
|
|
|
|
| |
This documents nondeterminism in code generation and removes
the nondeterministic ufmToList function. In the future someone
will have to use nonDetEltsUFM (with proper explanation)
or pprUFM.
|
|
|
|
|
|
|
| |
This documents nondeterminism in code generation and removes
the nondeterministic ufmToList function. In the future someone
will have to use nonDetUFMToList (with proper explanation)
or pprUFMWithKeys.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We want to remove the `Ord Unique` instance because there's
no way to implement it in deterministic way and it's too
easy to use by accident.
We sometimes compute SCC for datatypes whose Ord instance
is implemented in terms of Unique. The Ord constraint on
SCC is just an artifact of some internal data structures.
We can have an alternative implementation with a data
structure that uses Uniquable instead.
This does exactly that and I'm pleased that I didn't have
to introduce any duplication to do that.
Test Plan:
./validate
I looked at performance tests and it's a tiny bit better.
Reviewers: bgamari, simonmar, ezyang, austin, goldfire
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2359
GHC Trac Issues: #4012
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
In the past the canonical way for constructing an SDoc string literal was the
composition `ptext . sLit`. But for some time now we have function `text` that
does the same. Plus it has some rules that optimize its runtime behaviour.
This patch takes all uses of `ptext . sLit` in the compiler and replaces them
with calls to `text`. The main benefits of this patch are clener (shorter) code
and less dependencies between module, because many modules now do not need to
import `FastString`. I don't expect any performance benefits - we mostly use
SDocs to report errors and it seems there is little to be gained here.
Test Plan: ./validate
Reviewers: bgamari, austin, goldfire, hvr, alanz
Subscribers: goldfire, thomie, mpickering
Differential Revision: https://phabricator.haskell.org/D1784
|
|
|
|
|
| |
George's new exhaustiveness checker now realizes these are impossible.
Yay!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Introduced in 6c7b41cc2b24f533697a62bf1843507ae043fc97.
I checked the rest of that commit, and this is all that was left to revert.
Test Plan: x
Reviewers: ezyang, austin
Reviewed By: ezyang, austin
Subscribers: simonmar, ezyang, carter, thomie
Differential Revision: https://phabricator.haskell.org/D241
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
In this example we ended up with some code that was only reachable via
an info table, because a branch had been optimised away by the native
code generator. The register allocator then got confused because it
was only considering the first block of the proc to be an entry point,
when actually any of the info tables are entry points.
Test Plan: validate
Reviewers: simonpj, austin
Subscribers: simonmar, relrod, carter
Differential Revision: https://phabricator.haskell.org/D88
|
|
|
|
|
|
|
| |
A previous fix to this was wrong: f5879acd018494b84233f26fba828ce376d0f81d
and left some unreachable code behind. So rather than try to be clever and
do this at the same time as the strongly-connected-component analysis, I'm
doing a separate reachability pass first.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In some cases, the layout of the LANGUAGE/OPTIONS_GHC lines has been
reorganized, while following the convention, to
- place `{-# LANGUAGE #-}` pragmas at the top of the source file, before
any `{-# OPTIONS_GHC #-}`-lines.
- Moreover, if the list of language extensions fit into a single
`{-# LANGUAGE ... -#}`-line (shorter than 80 characters), keep it on one
line. Otherwise split into `{-# LANGUAGE ... -#}`-lines for each
individual language extension. In both cases, try to keep the
enumeration alphabetically ordered.
(The latter layout is preferable as it's more diff-friendly)
While at it, this also replaces obsolete `{-# OPTIONS ... #-}` pragma
occurences by `{-# OPTIONS_GHC ... #-}` pragmas.
|
|
|
|
|
|
|
| |
This checks that all the required extensions are enabled for the
inferred type signature.
Updates binary and vector submodules.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The problem with unreachable code is that it might refer to undefined
registers. This happens accidentally: a block can be orphaned by an
optimisation, for example when the result of a comparsion becomes
known.
The register allocator panics when it finds an undefined register,
because they shouldn't occur in generated code. So we need to also
discard unreachable code to prevent this panic being triggered by
optimisations.
The register alloator already does a strongly-connected component
analysis, so it ought to be easy to make it discard unreachable code
as part of that traversal. It turns out that we need a different
variant of the scc algorithm to do that (see Digraph), however the new
variant also generates slightly better code by putting the blocks
within a loop in a better order for register allocation.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This removes the OldCmm data type and the CmmCvt pass that converts
new Cmm to OldCmm. The backends (NCGs, LLVM and C) have all been
converted to consume new Cmm.
The main difference between the two data types is that conditional
branches in new Cmm have both true/false successors, whereas in OldCmm
the false case was a fallthrough. To generate slightly better code we
occasionally need to invert a conditional to ensure that the
branch-not-taken becomes a fallthrough; this was previously done in
CmmCvt, and it is now done in CmmContFlowOpt.
We could go further and use the Hoopl Block representation for native
code, which would mean that we could use Hoopl's postorderDfs and
analyses for native code, but for now I've left it as is, using the
old ListGraph representation for native code.
|
|
|
|
|
|
|
| |
All Cmm procedures now include the set of global registers that are live on
procedure entry, i.e., the global registers used to pass arguments to the
procedure. Only global registers that are use to pass arguments are included in
this list.
|
|\ |
|
| | |
|
|/ |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
Proc-point splitting is only required by backends that do not support
having proc-points within a code block (that is, everything except the
native backend, i.e. LLVM and C).
Not doing proc-point splitting saves some compilation time, and might
produce slightly better code in some cases.
|
| |
|
| |
|
|
|
|
|
| |
We can now get the Platform from the DynFlags inside an SDoc, so we
no longer need to pass the Platform in.
|
| |
|
|
|
|
| |
And some knock-on changes
|
|
|
|
|
| |
CmmTop -> CmmDecl
CmmPgm -> CmmGroup
|