|  | Commit message (Collapse) | Author | Age | Files | Lines | 
|---|
| | |  | 
| | 
| 
| 
| | submodule updates: nofib, haddock | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This patch removes all CafInfo predictions and various hacks to preserve
predicted CafInfos from the compiler and assigns final CafInfos to
interface Ids after code generation. SRT analysis is extended to support
static data, and Cmm generator is modified to allow generating
static_link fields after SRT analysis.
This also fixes `-fcatch-bottoms`, which introduces error calls in case
expressions in CorePrep, which runs *after* CoreTidy (which is where we
decide on CafInfos) and turns previously non-CAFFY things into CAFFY.
Fixes #17648
Fixes #9718
Evaluation
==========
NoFib
-----
Boot with: `make boot mode=fast`
Run: `make mode=fast EXTRA_RUNTEST_OPTS="-cachegrind" NoFibRuns=1`
--------------------------------------------------------------------------------
        Program           Size    Allocs    Instrs     Reads    Writes
--------------------------------------------------------------------------------
             CS          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            CSD          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             FS          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              S          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             VS          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            VSD          -0.0%      0.0%     -0.0%     -0.0%     -0.5%
            VSM          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           anna          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
           ansi          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           atom          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         awards          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         banner          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
     bernouilli          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   binary-trees          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          boyer          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         boyer2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           bspt          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      cacheprof          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       calendar          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       cichelli          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        circsim          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       clausify          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
  comp_lab_zift          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       compress          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      compress2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
    constraints          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   cryptarithm1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   cryptarithm2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            cse          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   digits-of-e1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   digits-of-e2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         dom-lt          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          eliza          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          event          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
    exact-reals          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         exp3_8          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         expert          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
 fannkuch-redux          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          fasta          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            fem          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            fft          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           fft2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       fibheaps          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           fish          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          fluid          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
         fulsom          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         gamteb          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            gcd          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
    gen_regexps          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         genfft          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             gg          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           grep          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         hidden          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            hpg          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
            ida          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          infer          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        integer          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      integrate          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   k-nucleotide          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          kahan          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        knights          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         lambda          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
     last-piece          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           lcss          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           life          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           lift          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         linear          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
      listcompr          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       listcopy          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       maillist          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         mandel          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        mandel2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           mate          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        minimax          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        mkhprog          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
     multiplier          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         n-body          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       nucleic2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           para          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      paraffins          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         parser          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
        parstof          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
            pic          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       pidigits          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          power          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         pretty          -0.0%      0.0%     -0.3%     -0.4%     -0.4%
         primes          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      primetest          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         prolog          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         puzzle          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         queens          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        reptile          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
reverse-complem          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        rewrite          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           rfib          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            rsa          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            scc          -0.0%      0.0%     -0.3%     -0.5%     -0.4%
          sched          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            scs          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         simple          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
          solid          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        sorting          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
  spectral-norm          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         sphere          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         symalg          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            tak          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      transform          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       treejoin          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      typecheck          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        veritas          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           wang          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      wave4main          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   wheel-sieve1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   wheel-sieve2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           x2n1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
--------------------------------------------------------------------------------
            Min          -0.1%      0.0%     -0.3%     -0.5%     -0.5%
            Max          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
 Geometric Mean          -0.0%     -0.0%     -0.0%     -0.0%     -0.0%
--------------------------------------------------------------------------------
        Program           Size    Allocs    Instrs     Reads    Writes
--------------------------------------------------------------------------------
        circsim          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
    constraints          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       fibheaps          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       gc_bench          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           hash          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           lcss          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          power          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
     spellcheck          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
--------------------------------------------------------------------------------
            Min          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
            Max          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
 Geometric Mean          -0.0%     +0.0%     -0.0%     -0.0%     -0.0%
Manual inspection of programs in testsuite/tests/programs
---------------------------------------------------------
I built these programs with a bunch of dump flags and `-O` and compared
STG, Cmm, and Asm dumps and file sizes.
(Below the numbers in parenthesis show number of modules in the program)
These programs have identical compiler (same .hi and .o sizes, STG, and
Cmm and Asm dumps):
- Queens (1), andre_monad (1), cholewo-eval (2), cvh_unboxing (3),
  andy_cherry (7), fun_insts (1), hs-boot (4), fast2haskell (2),
  jl_defaults (1), jq_readsPrec (1), jules_xref (1), jtod_circint (4),
  jules_xref2 (1), lennart_range (1), lex (1), life_space_leak (1),
  bargon-mangler-bug (7), record_upd (1), rittri (1), sanders_array (1),
  strict_anns (1), thurston-module-arith (2), okeefe_neural (1),
  joao-circular (6), 10queens (1)
Programs with different compiler outputs:
- jl_defaults (1): For some reason GHC HEAD marks a lot of top-level
  `[Int]` closures as CAFFY for no reason. With this patch we no longer
  make them CAFFY and generate less SRT entries. For some reason Main.o
  is slightly larger with this patch (1.3%) and the executable sizes are
  the same. (I'd expect both to be smaller)
- launchbury (1): Same as jl_defaults: top-level `[Int]` closures marked
  as CAFFY for no reason. Similarly `Main.o` is 1.4% larger but the
  executable sizes are the same.
- galois_raytrace (13): Differences are in the Parse module. There are a
  lot, but some of the changes are caused by the fact that for some
  reason (I think a bug) GHC HEAD marks the dictionary for `Functor
  Identity` as CAFFY. Parse.o is 0.4% larger, the executable size is the
  same.
- north_array: We now generate less SRT entries because some of array
  primops used in this program like `NewArrayOp` get eliminated during
  Stg-to-Cmm and turn some CAFFY things into non-CAFFY. Main.o gets 24%
  larger (9224 bytes from 9000 bytes), executable sizes are the same.
- seward-space-leak: Difference in this program is better shown by this
  smaller example:
      module Lib where
      data CDS
        = Case [CDS] [(Int, CDS)]
        | Call CDS CDS
      instance Eq CDS where
        Case sels1 rets1 == Case sels2 rets2 =
            sels1 == sels2 && rets1 == rets2
        Call a1 b1 == Call a2 b2 =
            a1 == a2 && b1 == b2
        _ == _ =
            False
   In this program GHC HEAD builds a new SRT for the recursive group of
   `(==)`, `(/=)` and the dictionary closure. Then `/=` points to `==`
   in its SRT field, and `==` uses the SRT object as its SRT. With this
   patch we use the closure for `/=` as the SRT and add `==` there. Then
   `/=` gets an empty SRT field and `==` points to `/=` in its SRT
   field.
   This change looks fine to me.
   Main.o gets 0.07% larger, executable sizes are identical.
head.hackage
------------
head.hackage's CI script builds 428 packages from Hackage using this
patch with no failures.
Compiler performance
--------------------
The compiler perf tests report that the compiler allocates slightly more
(worst case observed so far is 4%). However most programs in the test
suite are small, single file programs. To benchmark compiler performance
on something more realistic I build Cabal (the library, 236 modules)
with different optimisation levels. For the "max residency" row I run
GHC with `+RTS -s -A100k -i0 -h` for more accurate numbers. Other rows
are generated with just `-s`. (This is because `-i0` causes running GC
much more frequently and as a result "bytes copied" gets inflated by
more than 25x in some cases)
* -O0
|                 | GHC HEAD       | This MR        | Diff   |
| --------------- | -------------- | -------------- | ------ |
| Bytes allocated | 54,413,350,872 | 54,701,099,464 | +0.52% |
| Bytes copied    |  4,926,037,184 |  4,990,638,760 | +1.31% |
| Max residency   |    421,225,624 |    424,324,264 | +0.73% |
* -O1
|                 | GHC HEAD        | This MR         | Diff   |
| --------------- | --------------- | --------------- | ------ |
| Bytes allocated | 245,849,209,992 | 246,562,088,672 | +0.28% |
| Bytes copied    |  26,943,452,560 |  27,089,972,296 | +0.54% |
| Max residency   |     982,643,440 |     991,663,432 | +0.91% |
* -O2
|                 | GHC HEAD        | This MR         | Diff   |
| --------------- | --------------- | --------------- | ------ |
| Bytes allocated | 291,044,511,408 | 291,863,910,912 | +0.28% |
| Bytes copied    |  37,044,237,616 |  36,121,690,472 | -2.49% |
| Max residency   |   1,071,600,328 |   1,086,396,256 | +1.38% |
Extra compiler allocations
--------------------------
Runtime allocations of programs are as reported above (NoFib section).
The compiler now allocates more than before. Main source of allocation
in this patch compared to base commit is the new SRT algorithm
(GHC.Cmm.Info.Build). Below is some of the extra work we do with this
patch, numbers generated by profiled stage 2 compiler when building a
pathological case (the test 'ManyConstructors') with '-O2':
- We now sort the final STG for a module, which means traversing the
  entire program, generating free variable set for each top-level
  binding, doing SCC analysis, and re-ordering the program. In
  ManyConstructors this step allocates 97,889,952 bytes.
- We now do SRT analysis on static data, which in a program like
  ManyConstructors causes analysing 10,000 bindings that we would
  previously just skip. This step allocates 70,898,352 bytes.
- We now maintain an SRT map for the entire module as we compile Cmm
  groups:
      data ModuleSRTInfo = ModuleSRTInfo
        { ...
        , moduleSRTMap :: SRTMap
        }
   (SRTMap is just a strict Map from the 'containers' library)
   This map gets an entry for most bindings in a module (exceptions are
   THUNKs and CAFFY static functions). For ManyConstructors this map
   gets 50015 entries.
- Once we're done with code generation we generate a NameSet from SRTMap
  for the non-CAFFY names in the current module. This set gets the same
  number of entries as the SRTMap.
- Finally we update CafInfos in ModDetails for the non-CAFFY Ids, using
  the NameSet generated in the previous step. This usually does the
  least amount of allocation among the work listed here.
Only place with this patch where we do less work in the CAF analysis in
the tidying pass (CoreTidy). However that doesn't save us much, as the
pass still needs to traverse the whole program and update IdInfos for
other reasons. Only thing we don't here do is the `hasCafRefs` pass over
the RHS of bindings, which is a stateless pass that returns a boolean
value, so it doesn't allocate much.
(Metric changes blow are all increased allocations)
Metric changes
--------------
Metric Increase:
    ManyAlternatives
    ManyConstructors
    T13035
    T14683
    T1969
    T9961 | 
| | |  | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This is part two of fixing #17334.
There are two parts to this commit:
- A bugfix for computing loop levels
- A bugfix of basic block invariants in the NCG.
-----------------------------------------------------------
In the first bug we ended up with a CFG of the sort: [A -> B -> C]
This was represented via maps as fromList [(A,B),(B,C)] and later
transformed into a adjacency array. However the transformation did
not include block C in the array (since we only looked at the keys of
the map).
This was still fine until we tried to look up successors for C and tried
to read outside of the array bounds when accessing C.
In order to prevent this in the future I refactored to code to include
all nodes as keys in the map representation. And make this a invariant
which is checked in a few places.
Overall I expect this to make the code more robust as now any failed
lookup will represent an error, versus failed lookups sometimes being
expected and sometimes not.
In terms of performance this makes some things cheaper (getting a list
of all nodes) and others more expensive (adding a new edge). Overall
this adds up to no noteable performance difference.
-----------------------------------------------------------
Part 2: When the NCG generated a new basic block, it did
not always insert a NEWBLOCK meta instruction in the stream which
caused a quite subtle bug.
    During instruction selection a statement `s`
    in a block B with control of the sort: B -> C
    will sometimes result in control
    flow of the sort:
            ┌ < ┐
            v   ^
      B ->  B1  ┴ -> C
    as is the case for some atomic operations.
    Now to keep the CFG in sync when introducing B1 we clearly
    want to insert it between B and C. However there is
    a catch when we have to deal with self loops.
    We might start with code and a CFG of these forms:
    loop:
        stmt1               ┌ < ┐
        ....                v   ^
        stmtX              loop ┘
        stmtY
        ....
        goto loop:
    Now we introduce B1:
                            ┌ ─ ─ ─ ─ ─┐
        loop:               │   ┌ <  ┐ │
        instrs              v   │    │ ^
        ....               loop ┴ B1 ┴ ┘
        instrsFromX
        stmtY
        goto loop:
    This is simple, all outgoing edges from loop now simply
    start from B1 instead and the code generator knows which
    new edges it introduced for the self loop of B1.
    Disaster strikes if the statement Y follows the same pattern.
    If we apply the same rule that all outgoing edges change then
    we end up with:
        loop ─> B1 ─> B2 ┬─┐
          │      │    └─<┤ │
          │      └───<───┘ │
          └───────<────────┘
    This is problematic. The edge B1->B1 is modified as expected.
    However the modification is wrong!
    The assembly in this case looked like this:
    _loop:
        <instrs>
    _B1:
        ...
        cmpxchgq ...
        jne _B1
        <instrs>
        <end _B1>
    _B2:
        ...
        cmpxchgq ...
        jne _B2
        <instrs>
        jmp loop
    There is no edge _B2 -> _B1 here. It's still a self loop onto _B1.
    The problem here is that really B1 should be two basic blocks.
    Otherwise we have control flow in the *middle* of a basic block.
    A contradiction!
    So to account for this we add yet another basic block marker:
    _B:
        <instrs>
    _B1:
        ...
        cmpxchgq ...
        jne _B1
        jmp _B1'
    _B1':
        <instrs>
        <end _B1>
    _B2:
        ...
    Now when inserting B2 we will only look at the outgoing edges of B1' and
    everything will work out nicely.
    You might also wonder why we don't insert jumps at the end of _B1'. There is
    no way another block ends up jumping to the labels _B1 or _B2 since they are
    essentially invisible to other blocks. View them as control flow labels local
    to the basic block if you'd like.
    Not doing this ultimately caused (part 2 of) #17334. | 
| | |  | 
| | 
| 
| 
| 
| 
| | Add StgToCmm module hierarchy. Platform modules that are used in several
other places (NCG, LLVM codegen, Cmm transformations) are put into
GHC.Platform. | 
| | 
| 
| 
| 
| 
| 
| | Unfortunately this will require more work; register allocation is
quite broken.
This reverts commit acd795583625401c5554f8e04ec7efca18814011. | 
| | 
| 
| 
| 
| 
| 
| | This adds support for constructing vector types from Float#, Double# etc
and performing arithmetic operations on them
Cleaned-Up-By: Ben Gamari <ben@well-typed.com> | 
| | 
| 
| 
| 
| 
| 
| | ghc-pkg needs to be aware of platforms so it can figure out which
subdire within the user package db to use. This is admittedly
roundabout, but maybe Cabal could use the same notion of a platform as
GHC to good affect too. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | * simplifies registers to have GPR, Float and Double, by removing the SSE2 and X87 Constructors
* makes -msse2 assumed/default for x86 platforms, fixing a long standing nondeterminism in rounding
behavior in 32bit haskell code
* removes the 80bit floating point representation from the supported float sizes
* theres still 1 tiny bit of x87 support needed,
for handling float and double return values in FFI calls  wrt the C ABI on x86_32,
but this one piece does not leak into the rest of NCG.
* Lots of code thats not been touched in a long time got deleted as a
consequence of all of this
all in all, this change paves the way towards a lot of future further
improvements in how GHC handles floating point computations, along with
making the native code gen more accessible to a larger pool of contributors. | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| | The graph allocator now dynamically resizes the number of stack
slots when running into the limit.
This fixes #8657.
Also loop membership of basic blocks is now available
in the register allocator for cost heuristics. | 
| | 
| 
| 
| 
| 
| 
| | under -mbmi2
This works similarly to existing implementation for popCount.
Trac ticket: #16086. | 
| | 
| 
| 
| | This reverts commit 76c8fd674435a652c75a96c85abbf26f1f221876. | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
This patch implements a new code layout algorithm.
It has been tested for x86 and is disabled on other platforms.
Performance varies slightly be CPU/Machine but in general seems to be better
by around 2%.
Nofib shows only small differences of about +/- ~0.5% overall depending on
flags/machine performance in other benchmarks improved significantly.
Other benchmarks includes at least the benchmarks of: aeson, vector, megaparsec, attoparsec,
containers, text and xeno.
While the magnitude of gains differed three different CPUs where tested with
all getting faster although to differing degrees. I tested: Sandy Bridge(Xeon), Haswell,
Skylake
* Library benchmark results summarized:
  * containers: ~1.5% faster
  * aeson: ~2% faster
  * megaparsec: ~2-5% faster
  * xml library benchmarks: 0.2%-1.1% faster
  * vector-benchmarks: 1-4% faster
  * text: 5.5% faster
On average GHC compile times go down, as GHC compiled with the new layout
is faster than the overhead introduced by using the new layout algorithm,
Things this patch does:
* Move code responsilbe for block layout in it's own module.
* Move the NcgImpl Class into the NCGMonad module.
* Extract a control flow graph from the input cmm.
* Update this cfg to keep it in sync with changes during
  asm codegen. This has been tested on x64 but should work on x86.
  Other platforms still use the old codelayout.
* Assign weights to the edges in the CFG based on type and limited static
  analysis which are then used for block layout.
* Once we have the final code layout eliminate some redundant jumps.
  In particular turn a sequences of:
      jne .foo
      jmp .bar
    foo:
  into
      je bar
    foo:
      ..
Test Plan: ci
Reviewers: bgamari, jmct, jrtc27, simonmar, simonpj, RyanGlScott
Reviewed By: RyanGlScott
Subscribers: RyanGlScott, trommler, jmct, carter, thomie, rwbarton
GHC Trac Issues: #15124
Differential Revision: https://phabricator.haskell.org/D4726 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This is the first step of implementing:
https://github.com/ghc-proposals/ghc-proposals/pull/74
The main highlights/changes:
    primops.txt.pp gets two new sections for two new primitive types for
    signed and unsigned 8-bit integers (Int8# and Word8 respectively) along
    with basic arithmetic and comparison operations. PrimRep/RuntimeRep get
    two new constructors for them. All of the primops translate into the
    existing MachOPs.
    For CmmCalls the codegen will now zero-extend the values at call
    site (so that they can be moved to the right register) and then truncate
    them back their original width.
    x86 native codegen needed some updates, since it wasn't able to deal
    with the new widths, but all the changes are quite localized. LLVM
    backend seems to just work.
This is the second attempt at merging this, after the first attempt in
D4475 had to be backed out due to regressions on i386.
Bumps binary submodule.
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
Test Plan: ./validate (on both x86-{32,64})
Reviewers: bgamari, hvr, goldfire, simonmar
Subscribers: rwbarton, carter
Differential Revision: https://phabricator.haskell.org/D5258 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| | This unfortunately broke i386 support since it introduced references to
byte-sized registers that don't exist on that architecture.
Reverts binary submodule
This reverts commit 5d5307f943d7581d7013ffe20af22233273fba06. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This is the first step of implementing:
https://github.com/ghc-proposals/ghc-proposals/pull/74
The main highlights/changes:
- `primops.txt.pp` gets two new sections for two new primitive types
  for signed and unsigned 8-bit integers (`Int8#` and `Word8`
  respectively) along with basic arithmetic and comparison
  operations. `PrimRep`/`RuntimeRep` get two new constructors for
  them. All of the primops translate into the existing `MachOP`s.
- For `CmmCall`s the codegen will now zero-extend the values at call
  site (so that they can be moved to the right register) and then
  truncate them back their original width.
- x86 native codegen needed some updates, since it wasn't able to deal
  with the new widths, but all the changes are quite localized. LLVM
  backend seems to just work.
Bumps binary submodule.
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
Test Plan: ./validate with new tests
Reviewers: hvr, goldfire, bgamari, simonmar
Subscribers: Abhiroop, dfeuer, rwbarton, thomie, carter
Differential Revision: https://phabricator.haskell.org/D4475 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
On Windows one is not allowed to drop the stack by more than a page size.
The reason for this is that the OS only allocates enough stack till what
the TEB specifies. After that a guard page is placed and the rest of the
virtual address space is unmapped.
The intention is that doing stack allocations will cause you to hit the
guard which will then map the next page in and move the guard.  This is
done to prevent what in the Linux world is known as stack clash
vulnerabilities https://access.redhat.com/security/cve/cve-2017-1000364.
There are modules in GHC for which the liveliness analysis thinks the
reserved 8KB of spill slots isn't enough.  One being DynFlags and the
other being Cabal.
Though I think the Cabal one is likely a bug:
```
  4d6544:       81 ec 00 46 00 00       sub    $0x4600,%esp
  4d654a:       8d 85 94 fe ff ff       lea    -0x16c(%ebp),%eax
  4d6550:       3b 83 1c 03 00 00       cmp    0x31c(%ebx),%eax
  4d6556:       0f 82 de 8d 02 00       jb     4ff33a <_cLpg_info+0x7a>
  4d655c:       c7 45 fc 14 3d 50 00    movl   $0x503d14,-0x4(%ebp)
  4d6563:       8b 75 0c                mov    0xc(%ebp),%esi
  4d6566:       83 c5 fc                add    $0xfffffffc,%ebp
  4d6569:       66 f7 c6 03 00          test   $0x3,%si
  4d656e:       0f 85 a6 d7 02 00       jne    503d1a <_cLpb_info+0x6>
  4d6574:       81 c4 00 46 00 00       add    $0x4600,%esp
```
It allocates nearly 18KB of spill slots for a simple 4 line function
and doesn't even use it.  Note that this doesn't happen on x64 or
when making a validate build.  Only when making a build without a
validate and build.mk.
This and the allocation in DynFlags means the stack allocation will jump
over the guard page into unmapped memory areas and GHC or an end program
segfaults.
The pagesize on x86 Windows is 4KB which means we hit it very easily for
these two modules, which explains the total DOA of GHC 32bit for the past
3 releases and the "random" segfaults on Windows.
```
0:000> bp 00503d29
0:000> gn
Breakpoint 0 hit
WARNING: Stack overflow detected. The unwound frames are extracted from outside
         normal stack bounds.
eax=03b6b9c9 ebx=00dc90f0 ecx=03cac48c edx=03cac43d esi=03b6b9c9 edi=03abef40
eip=00503d29 esp=013e96fc ebp=03cf8f70 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
setup+0x103d29:
00503d29 89442440        mov     dword ptr [esp+40h],eax ss:002b:013e973c=????????
WARNING: Stack overflow detected. The unwound frames are extracted from outside
         normal stack bounds.
WARNING: Stack overflow detected. The unwound frames are extracted from outside
         normal stack bounds.
0:000> !teb
TEB at 00384000
    ExceptionList:        013effcc
    StackBase:            013f0000
    StackLimit:           013eb000
```
This doesn't fix the liveliness analysis but does fix the allocations, by
emitting a function call to `__chkstk_ms` when doing allocations of larger
than a page, this will make sure the stack is probed every page so the kernel
maps in the next page.
`__chkstk_ms` is provided by `libGCC`, which is under the
`GNU runtime exclusion license`, so it's safe to link against it, even for
proprietary code. (Technically we already do since we link compiled C code in.)
For allocations smaller than a page we drop the stack and probe the new address.
This avoids the function call and still makes sure we hit the guard if needed.
PS: In case anyone is Wondering why we didn't notice this before, it's because we
only test x86_64 and on Windows 10.  On x86_64 the page size is 8KB and also the
kernel is a bit more lenient on Windows 10 in that it seems to catch the segfault
and resize the stack if it was unmapped:
```
0:000> t
eax=03b6b9c9 ebx=00dc90f0 ecx=03cac48c edx=03cac43d esi=03b6b9c9 edi=03abef40
eip=00503d2d esp=013e96fc ebp=03cf8f70 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
setup+0x103d2d:
00503d2d 8b461b          mov     eax,dword ptr [esi+1Bh] ds:002b:03b6b9e4=03cac431
0:000> !teb
TEB at 00384000
    ExceptionList:        013effcc
    StackBase:            013f0000
    StackLimit:           013e9000
```
Likely Windows 10 has a guard page larger than previous versions.
This fixes the stack allocations, and as soon as I get the time I will look at
the liveliness analysis. I find it highly unlikely that simple Cabal function
requires ~2200 spill slots.
Test Plan: ./validate
Reviewers: simonmar, bgamari
Reviewed By: bgamari
Subscribers: AndreasK, rwbarton, thomie, carter
GHC Trac Issues: #15154
Differential Revision: https://phabricator.haskell.org/D4917 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Jump tables always point to blocks when we first generate them.  However
there are rare situations where we can shortcut one of these blocks to a
static address during the asm shortcutting pass.
While we already updated the data section accordingly this patch also
extends this to the references stored in JMP_TBL.
Test Plan: ci
Reviewers: bgamari
Reviewed By: bgamari
Subscribers: thomie, carter
GHC Trac Issues: #15104
Differential Revision: https://phabricator.haskell.org/D4595 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
This change makes it possible to generate a static 32-bit relative label
offset on x86_64. Currently we can only generate word-sized label
offsets.
This will be used in D4634 to shrink info tables.  See D4632 for more
details.
Test Plan: See D4632
Reviewers: bgamari, niteria, michalt, erikd, jrtc27, osa1
Subscribers: thomie, carter
Differential Revision: https://phabricator.haskell.org/D4633 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Without updating the JMP_TBL information the block list in
JMP_TBL contained blocks which were eliminated in some circumstances.
The actual assembly generation doesn't look at these fields so this
didn't cause any bugs yet. However as long as we carry this information
around we should make an effort to keep it correct.
Especially since it's useful for debugging purposes and can be used
for passes near the end of the codegen pipeline.
In particular it's used by jumpDestsOfInstr which without these changes
returns the wrong destinations.
Test Plan: ci
Reviewers: bgamari
Subscribers: thomie, carter
Differential Revision: https://phabricator.haskell.org/D4566 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This adds support for the bit deposit and extraction operations provided
by the BMI and BMI2 instruction set extensions on modern amd64 machines.
Implement x86 code generator for pdep and pext.  Properly initialise
bmiVersion field.
pdep and pext test cases
Fix pattern match for pdep and pext instructions
Fix build of pdep and pext code for 32-bit architectures
Test Plan: Validate
Reviewers: austin, simonmar, bgamari, angerman
Reviewed By: bgamari
Subscribers: trommler, carter, angerman, thomie, rwbarton, newhoggy
GHC Trac Issues: #14206
Differential Revision: https://phabricator.haskell.org/D4236 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| | blockLbl was originally changed in 8b007abbeb3045900a11529d907a835080129176 to
use mkTempAsmLabel to fix an inconsistency resulting in #14221. However, this
breaks the C code generator, which doesn't support AsmTempLabels (#14454).
Instead let's try going the other direction: use a new CLabel variety,
LocalBlockLabel. Then we can teach the C code generator to deal with
these as well. | 
| | 
| 
| 
| 
| 
| | This broke the 32-bit build.
This reverts commit f5dc8ccc29429d0a1d011f62b6b430f6ae50290c. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This adds support for the bit deposit and extraction operations provided
by the BMI and BMI2 instruction set extensions on modern amd64 machines.
Test Plan: Validate
Reviewers: austin, simonmar, bgamari, hvr, goldfire, erikd
Reviewed By: bgamari
Subscribers: goldfire, erikd, trommler, newhoggy, rwbarton, thomie
GHC Trac Issues: #14206
Differential Revision: https://phabricator.haskell.org/D4063 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This switches the compiler/ component to get compiled with
-XNoImplicitPrelude and a `import GhcPrelude` is inserted in all
modules.
This is motivated by the upcoming "Prelude" re-export of
`Semigroup((<>))` which would cause lots of name clashes in every
modulewhich imports also `Outputable`
Reviewers: austin, goldfire, bgamari, alanz, simonmar
Reviewed By: bgamari
Subscribers: goldfire, rwbarton, thomie, mpickering, bgamari
Differential Revision: https://phabricator.haskell.org/D3989 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This copies the subset of Hoopl's functionality needed by GHC to
`cmm/Hoopl` and removes the dependency on the Hoopl package.
The main motivation for this change is the confusing/noisy interface
between GHC and Hoopl:
- Hoopl has `Label` which is GHC's `BlockId` but different than
  GHC's `CLabel`
- Hoopl has `Unique` which is different than GHC's `Unique`
- Hoopl has `Unique{Map,Set}` which are different than GHC's
  `Uniq{FM,Set}`
- GHC has its own specialized copy of `Dataflow`, so `cmm/Hoopl` is
  needed just to filter the exposed functions (filter out some of the
  Hoopl's and add the GHC ones)
With this change, we'll be able to simplify this significantly.
It'll also be much easier to do invasive changes (Hoopl is a public
package on Hackage with users that depend on the current behavior)
This should introduce no changes in functionality - it merely
copies the relevant code.
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
Test Plan: ./validate
Reviewers: austin, bgamari, simonmar
Reviewed By: bgamari, simonmar
Subscribers: simonpj, kavon, rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3616 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| | Reviewers: austin, dfeuer
Subscribers: dfeuer, rwbarton, thomie
GHC Trac Issues: #13629
Differential Revision: https://phabricator.haskell.org/D3508 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | While this apparently didn't matter on Linux, the OS X toolchain seems
to treat local and external symbols differently during linking. Namely,
the linker assumes that an external symbol marks the beginning of a new,
unused procedure, and consequently drops it.
Fixes regression introduced in D2741.
Test Plan: `debug` testcase on OS X
Reviewers: austin, simonmar, rwbarton
Reviewed By: rwbarton
Subscribers: rwbarton, thomie
Differential Revision: https://phabricator.haskell.org/D3135 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | As discussed in D1532, Trac Trac #11337, and Trac Trac #11338, the stack
unwinding information produced by GHC is currently quite approximate.
Essentially we assume that register values do not change at all within a
basic block. While this is somewhat true in normal Haskell code, blocks
containing foreign calls often break this assumption. This results in
unreliable call stacks, especially in the code containing foreign calls.
This is worse than it sounds as unreliable unwinding information can at
times result in segmentation faults.
This patch set attempts to improve this situation by tracking unwinding
information with finer granularity. By dispensing with the assumption of
one unwinding table per block, we allow the compiler to accurately
represent the areas surrounding foreign calls.
Towards this end we generalize the representation of unwind information
in the backend in three ways,
 * Multiple CmmUnwind nodes can occur per block
 * CmmUnwind nodes can now carry unwind information for multiple
   registers (while not strictly necessary; this makes emitting
   unwinding information a bit more convenient in the compiler)
 * The NCG backend is given an opportunity to modify the unwinding
   records since it may need to make adjustments due to, for instance,
   native calling convention requirements for foreign calls (see
   #11353).
This sets the stage for resolving #11337 and #11338.
Test Plan: Validate
Reviewers: scpmw, simonmar, austin, erikd
Subscribers: qnikst, thomie
Differential Revision: https://phabricator.haskell.org/D2741 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This continues removal of `BlockId` module in favor of Hoopl's `Label`.
Most of the changes here are mechanical, apart from the orphan
`Outputable` instances for `LabelMap` and `LabelSet`.  For now I've
moved them to `cmm/Hoopl`, since it's already trying to manage all
imports from Hoopl (to avoid any collisions).
Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>
Test Plan: validate
Reviewers: bgamari, austin, simonmar
Reviewed By: simonmar
Subscribers: thomie
Differential Revision: https://phabricator.haskell.org/D2800 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This reverses some of the work done in Trac #1405, and assumes GHC is
smart enough to do its own unboxing of booleans now.
I would like to do some more performance measurements, but the code
changes can be reviewed already.
Test Plan:
With a perf build:
./inplace/bin/ghc-stage2 nofib/spectral/simple/Main.hs -fforce-recomp
+RTS -t --machine-readable
before:
```
  [("bytes allocated", "1300744864")
  ,("num_GCs", "302")
  ,("average_bytes_used", "8811118")
  ,("max_bytes_used", "24477464")
  ,("num_byte_usage_samples", "9")
  ,("peak_megabytes_allocated", "64")
  ,("init_cpu_seconds", "0.001")
  ,("init_wall_seconds", "0.001")
  ,("mutator_cpu_seconds", "2.833")
  ,("mutator_wall_seconds", "4.283")
  ,("GC_cpu_seconds", "0.960")
  ,("GC_wall_seconds", "0.961")
  ]
```
after:
```
  [("bytes allocated", "1301088064")
  ,("num_GCs", "310")
  ,("average_bytes_used", "8820253")
  ,("max_bytes_used", "24539904")
  ,("num_byte_usage_samples", "9")
  ,("peak_megabytes_allocated", "64")
  ,("init_cpu_seconds", "0.001")
  ,("init_wall_seconds", "0.001")
  ,("mutator_cpu_seconds", "2.876")
  ,("mutator_wall_seconds", "4.474")
  ,("GC_cpu_seconds", "0.965")
  ,("GC_wall_seconds", "0.979")
  ]
```
CPU time seems to be up a bit, but I'm not sure. Unfortunately CPU time
measurements are rather noisy.
Reviewers: austin, bgamari, rwbarton
Subscribers: nomeata
Differential Revision: https://phabricator.haskell.org/D1143
GHC Trac Issues: #1405 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This commit renames the Size module in the native code generator to
Format, as proposed by a todo, as well as adjusting parameter names in
other modules that use it.
Test Plan: validate
Reviewers: austin, simonmar, bgamari
Reviewed By: simonmar, bgamari
Subscribers: bgamari, simonmar, thomie
Projects: #ghc
Differential Revision: https://phabricator.haskell.org/D865 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This generates DWARF, albeit indirectly using the assembler. This is
the easiest (and, apparently, quite standard) method of generating the
.debug_line DWARF section.
Notes:
* Note we have to make sure that .file directives appear correctly
  before the respective .loc. Right now we ppr them manually, which makes
  them absent from dumps. Fixing this would require .file to become a
  native instruction.
* We have to pass a lot of things around the native code generator. I
  know Ian did quite a bit of refactoring already, but having one common
  monad could *really* simplify things here...
* To support SplitObjcs, we need to emit/reset all DWARF data at every
  split. We use the occassion to move split marker generation to
  cmmNativeGenStream as well, so debug data extraction doesn't have to
  choke on it.
(From Phabricator D396) | 
| | 
| 
| 
| 
| 
| 
| 
| | This reverts commit f0fcc41d755876a1b02d1c7c79f57515059f6417.
New changes: now works on 32-bit platforms too.  I added some basic
support for 64-bit subtraction and comparison operations to the x86
NCG. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Don't export `getUs` and `getUniqueUs`. `UniqSM` has a `MonadUnique` instance:
    instance MonadUnique UniqSM where
        getUniqueSupplyM = getUs
        getUniqueM  = getUniqueUs
        getUniquesM = getUniquesUs
Commandline-fu used:
    git grep -l 'getUs\>' |
        grep -v compiler/basicTypes/UniqSupply.lhs |
        xargs sed -i 's/getUs/getUniqueSupplyM/g
    git grep -l 'getUniqueUs\>' |
        grep -v combiler/basicTypes/UniqSupply.lhs |
        xargs sed -i 's/getUniqueUs/getUniqueM/g'
Follow up on b522d3a3f970a043397a0d6556ca555648e7a9c3
Reviewed By: austin, hvr
Differential Revision: https://phabricator.haskell.org/D220 | 
| | 
| 
| 
| 
| 
| | ...some files more or less recently touched by me
[ci skip] | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
These MachOps are used by addIntC# and subIntC#, which in turn are
used in integer-gmp when adding or subtracting small Integers. The
following benchmark shows a ~6% speedup after this commit on x86_64
(building GHC with BuildFlavour=perf).
    {-# LANGUAGE MagicHash #-}
    import GHC.Exts
    import Criterion.Main
    count :: Int -> Integer
    count (I# n#) = go n# 0
      where go :: Int# -> Integer -> Integer
            go 0# acc = acc
            go n# acc = go (n# -# 1#) $! acc + 1
    main = defaultMain [bgroup "count"
                          [bench "100" $ whnf count 100]]
Differential Revision: https://phabricator.haskell.org/D140 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This is a pre-requisite for implementing count-{leading,trailing}-zero
prim-ops (re #9340) and may be useful to NCG to help turn some code into
branch-less code sequences.
Test Plan: Compiles and validates in combination with clz/ctz primop impl
Reviewers: ezyang, rwbarton, simonmar, austin
Subscribers: simonmar, relrod, ezyang, carter
Differential Revision: https://phabricator.haskell.org/D141 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This is a pre-requisite for implementing count-{leading,trailing}-zero
prim-ops (re #9340)
Reviewers: ezyang, rwbarton, simonmar, austin
Subscribers: simonmar, relrod, ezyang, carter
Differential Revision: https://phabricator.haskell.org/D141 | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Test Plan:
 - ran validate
 - ran T9013 test with all ways
 - ran CarryOverflow test with all ways, for good measure
Reviewers: austin, simonmar
Reviewed By: simonmar
Differential Revision: https://phabricator.haskell.org/D137 | 
| | |  | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| | Before LOCK was a separate instruction and this led to the register
allocator separating it from the instruction it was supposed to be a
prefix of, leading to illegal assembly such as
    lock mov
Fix contributed by PÁLI Gábor János. | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | This is the second attempt to add this functionality. The first
attempt was reverted in 950fcae46a82569e7cd1fba1637a23b419e00ecd, due
to register allocator failure on x86. Given how the register
allocator currently works, we don't have enough registers on x86 to
support cmpxchg using complicated addressing modes. Instead we fall
back to a simpler addressing mode on x86.
Adds the following primops:
 * atomicReadIntArray#
 * atomicWriteIntArray#
 * fetchSubIntArray#
 * fetchOrIntArray#
 * fetchXorIntArray#
 * fetchAndIntArray#
Makes these pre-existing out-of-line primops inline:
 * fetchAddIntArray#
 * casIntArray# | 
| | 
| 
| 
| 
| 
| 
| 
| | This commit caused the register allocator to fail on i386.
This reverts commit d8abf85f8ca176854e9d5d0b12371c4bc402aac3 and
04dd7cb3423f1940242fdfe2ea2e3b8abd68a177 (the second being a fix to
the first). | 
| | 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| | Summary:
Add more primops for atomic ops on byte arrays
Adds the following primops:
 * atomicReadIntArray#
 * atomicWriteIntArray#
 * fetchSubIntArray#
 * fetchOrIntArray#
 * fetchXorIntArray#
 * fetchAndIntArray#
Makes these pre-existing out-of-line primops inline:
 * fetchAddIntArray#
 * casIntArray# |