diff options
Diffstat (limited to 'docs/users_guide/profiling.rst')
-rw-r--r-- | docs/users_guide/profiling.rst | 1390 |
1 files changed, 1390 insertions, 0 deletions
diff --git a/docs/users_guide/profiling.rst b/docs/users_guide/profiling.rst new file mode 100644 index 0000000000..34525d12b8 --- /dev/null +++ b/docs/users_guide/profiling.rst @@ -0,0 +1,1390 @@ +.. _profiling: + +Profiling +========= + +.. index:: + single: profiling + single: cost-centre profiling + single: -p; RTS option + +GHC comes with a time and space profiling system, so that you can answer +questions like "why is my program so slow?", or "why is my program using +so much memory?". + +Profiling a program is a three-step process: + +1. Re-compile your program for profiling with the ``-prof`` option, and + probably one of the options for adding automatic annotations: + ``-fprof-auto`` is the most common [1]_. ``-fprof-auto`` + + If you are using external packages with ``cabal``, you may need to + reinstall these packages with profiling support; typically this is + done with ``cabal install -p package --reinstall``. + +2. Having compiled the program for profiling, you now need to run it to + generate the profile. For example, a simple time profile can be + generated by running the program with ``+RTS -p``, which generates a file + named ``⟨prog⟩.prof`` where ⟨prog⟩ is the name of your program (without the + ``.exe`` extension, if you are on Windows). + + There are many different kinds of profile that can be generated, + selected by different RTS options. We will be describing the various + kinds of profile throughout the rest of this chapter. Some profiles + require further processing using additional tools after running the + program. + +3. Examine the generated profiling information, use the information to + optimise your program, and repeat as necessary. + +.. _cost-centres: + +Cost centres and cost-centre stacks +----------------------------------- + +GHC's profiling system assigns costs to cost centres. A cost is simply +the time or space (memory) required to evaluate an expression. Cost +centres are program annotations around expressions; all costs incurred +by the annotated expression are assigned to the enclosing cost centre. +Furthermore, GHC will remember the stack of enclosing cost centres for +any given expression at run-time and generate a call-tree of cost +attributions. + +Let's take a look at an example: + +:: + + main = print (fib 30) + fib n = if n < 2 then 1 else fib (n-1) + fib (n-2) + +Compile and run this program as follows: + +:: + + $ ghc -prof -fprof-auto -rtsopts Main.hs + $ ./Main +RTS -p + 121393 + $ + +When a GHC-compiled program is run with the ``-p`` RTS option, it +generates a file called ``prog.prof``. In this case, the file will +contain something like this: + +:: + + Wed Oct 12 16:14 2011 Time and Allocation Profiling Report (Final) + + Main +RTS -p -RTS + + total time = 0.68 secs (34 ticks @ 20 ms) + total alloc = 204,677,844 bytes (excludes profiling overheads) + + COST CENTRE MODULE %time %alloc + + fib Main 100.0 100.0 + + + individual inherited + COST CENTRE MODULE no. entries %time %alloc %time %alloc + + MAIN MAIN 102 0 0.0 0.0 100.0 100.0 + CAF GHC.IO.Handle.FD 128 0 0.0 0.0 0.0 0.0 + CAF GHC.IO.Encoding.Iconv 120 0 0.0 0.0 0.0 0.0 + CAF GHC.Conc.Signal 110 0 0.0 0.0 0.0 0.0 + CAF Main 108 0 0.0 0.0 100.0 100.0 + main Main 204 1 0.0 0.0 100.0 100.0 + fib Main 205 2692537 100.0 100.0 100.0 100.0 + +The first part of the file gives the program name and options, and the +total time and total memory allocation measured during the run of the +program (note that the total memory allocation figure isn't the same as +the amount of *live* memory needed by the program at any one time; the +latter can be determined using heap profiling, which we will describe +later in :ref:`prof-heap`). + +The second part of the file is a break-down by cost centre of the most +costly functions in the program. In this case, there was only one +significant function in the program, namely ``fib``, and it was +responsible for 100% of both the time and allocation costs of the +program. + +The third and final section of the file gives a profile break-down by +cost-centre stack. This is roughly a call-tree profile of the program. +In the example above, it is clear that the costly call to ``fib`` came +from ``main``. + +The time and allocation incurred by a given part of the program is +displayed in two ways: “individual”, which are the costs incurred by the +code covered by this cost centre stack alone, and “inherited”, which +includes the costs incurred by all the children of this node. + +The usefulness of cost-centre stacks is better demonstrated by modifying +the example slightly: + +:: + + main = print (f 30 + g 30) + where + f n = fib n + g n = fib (n `div` 2) + + fib n = if n < 2 then 1 else fib (n-1) + fib (n-2) + +Compile and run this program as before, and take a look at the new +profiling results: + +:: + + COST CENTRE MODULE no. entries %time %alloc %time %alloc + + MAIN MAIN 102 0 0.0 0.0 100.0 100.0 + CAF GHC.IO.Handle.FD 128 0 0.0 0.0 0.0 0.0 + CAF GHC.IO.Encoding.Iconv 120 0 0.0 0.0 0.0 0.0 + CAF GHC.Conc.Signal 110 0 0.0 0.0 0.0 0.0 + CAF Main 108 0 0.0 0.0 100.0 100.0 + main Main 204 1 0.0 0.0 100.0 100.0 + main.g Main 207 1 0.0 0.0 0.0 0.1 + fib Main 208 1973 0.0 0.1 0.0 0.1 + main.f Main 205 1 0.0 0.0 100.0 99.9 + fib Main 206 2692537 100.0 99.9 100.0 99.9 + +Now although we had two calls to ``fib`` in the program, it is +immediately clear that it was the call from ``f`` which took all the +time. The functions ``f`` and ``g`` which are defined in the ``where`` +clause in ``main`` are given their own cost centres, ``main.f`` and +``main.g`` respectively. + +The actual meaning of the various columns in the output is: + + The number of times this particular point in the call tree was + entered. + + The percentage of the total run time of the program spent at this + point in the call tree. + + The percentage of the total memory allocations (excluding profiling + overheads) of the program made by this call. + + The percentage of the total run time of the program spent below this + point in the call tree. + + The percentage of the total memory allocations (excluding profiling + overheads) of the program made by this call and all of its + sub-calls. + +In addition you can use the ``-P`` RTS option ``-P`` to get the +following additional information: + +``ticks`` + The raw number of time “ticks” which were attributed to this + cost-centre; from this, we get the ``%time`` figure mentioned above. + +``bytes`` + Number of bytes allocated in the heap while in this cost-centre; + again, this is the raw number from which we get the ``%alloc`` + figure mentioned above. + +What about recursive functions, and mutually recursive groups of +functions? Where are the costs attributed? Well, although GHC does keep +information about which groups of functions called each other +recursively, this information isn't displayed in the basic time and +allocation profile, instead the call-graph is flattened into a tree as +follows: a call to a function that occurs elsewhere on the current stack +does not push another entry on the stack, instead the costs for this +call are aggregated into the caller [2]_. + +.. _scc-pragma: + +Inserting cost centres by hand +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Cost centres are just program annotations. When you say ``-fprof-auto`` +to the compiler, it automatically inserts a cost centre annotation +around every binding not marked INLINE in your program, but you are +entirely free to add cost centre annotations yourself. + +The syntax of a cost centre annotation is + +:: + + {-# SCC "name" #-} <expression> + +where ``"name"`` is an arbitrary string, that will become the name of +your cost centre as it appears in the profiling output, and +``<expression>`` is any Haskell expression. An ``SCC`` annotation +extends as far to the right as possible when parsing. (SCC stands for +"Set Cost Centre"). The double quotes can be omitted if ``name`` is a +Haskell identifier, for example: + +:: + + {-# SCC my_function #-} <expression> + +Here is an example of a program with a couple of SCCs: + +:: + + main :: IO () + main = do let xs = [1..1000000] + let ys = [1..2000000] + print $ {-# SCC last_xs #-} last xs + print $ {-# SCC last_init_xs #-} last $ init xs + print $ {-# SCC last_ys #-} last ys + print $ {-# SCC last_init_ys #-} last $ init ys + +which gives this profile when run: + +:: + + COST CENTRE MODULE no. entries %time %alloc %time %alloc + + MAIN MAIN 102 0 0.0 0.0 100.0 100.0 + CAF GHC.IO.Handle.FD 130 0 0.0 0.0 0.0 0.0 + CAF GHC.IO.Encoding.Iconv 122 0 0.0 0.0 0.0 0.0 + CAF GHC.Conc.Signal 111 0 0.0 0.0 0.0 0.0 + CAF Main 108 0 0.0 0.0 100.0 100.0 + main Main 204 1 0.0 0.0 100.0 100.0 + last_init_ys Main 210 1 25.0 27.4 25.0 27.4 + main.ys Main 209 1 25.0 39.2 25.0 39.2 + last_ys Main 208 1 12.5 0.0 12.5 0.0 + last_init_xs Main 207 1 12.5 13.7 12.5 13.7 + main.xs Main 206 1 18.8 19.6 18.8 19.6 + last_xs Main 205 1 6.2 0.0 6.2 0.0 + +.. _prof-rules: + +Rules for attributing costs +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +While running a program with profiling turned on, GHC maintains a +cost-centre stack behind the scenes, and attributes any costs (memory +allocation and time) to whatever the current cost-centre stack is at the +time the cost is incurred. + +The mechanism is simple: whenever the program evaluates an expression +with an SCC annotation, ``{-# SCC c -#} E``, the cost centre ``c`` is +pushed on the current stack, and the entry count for this stack is +incremented by one. The stack also sometimes has to be saved and +restored; in particular when the program creates a thunk (a lazy +suspension), the current cost-centre stack is stored in the thunk, and +restored when the thunk is evaluated. In this way, the cost-centre stack +is independent of the actual evaluation order used by GHC at runtime. + +At a function call, GHC takes the stack stored in the function being +called (which for a top-level function will be empty), and *appends* it +to the current stack, ignoring any prefix that is identical to a prefix +of the current stack. + +We mentioned earlier that lazy computations, i.e. thunks, capture the +current stack when they are created, and restore this stack when they +are evaluated. What about top-level thunks? They are "created" when the +program is compiled, so what stack should we give them? The technical +name for a top-level thunk is a CAF ("Constant Applicative Form"). GHC +assigns every CAF in a module a stack consisting of the single cost +centre ``M.CAF``, where ``M`` is the name of the module. It is also +possible to give each CAF a different stack, using the option +``-fprof-cafs``. This is especially useful when +compiling with ``-ffull-laziness`` (as is default with ``-O`` and +higher), as constants in function bodies will be lifted to the top-level +and become CAFs. You will probably need to consult the Core +(``-ddump-simpl``) in order to determine what these CAFs correspond to. + +.. index:: + single: -fprof-cafs + +.. _prof-compiler-options: + +Compiler options for profiling +------------------------------ + +.. index:: + single: profiling; options + single: options; for profiling + +``-prof`` + .. index:: + single: -prof + + To make use of the profiling system *all* modules must be compiled + and linked with the ``-prof`` option. Any ``SCC`` annotations you've + put in your source will spring to life. + + Without a ``-prof`` option, your ``SCC``\ s are ignored; so you can + compile ``SCC``-laden code without changing it. + +There are a few other profiling-related compilation options. Use them +*in addition to* ``-prof``. These do not have to be used consistently +for all modules in a program. + +``-fprof-auto`` + .. index:: + single: -fprof-auto + + *All* bindings not marked INLINE, whether exported or not, top level + or nested, will be given automatic ``SCC`` annotations. Functions + marked INLINE must be given a cost centre manually. + +``-fprof-auto-top`` + .. index:: + single: -fprof-auto-top + single: cost centres; automatically inserting + + GHC will automatically add ``SCC`` annotations for all top-level + bindings not marked INLINE. If you want a cost centre on an INLINE + function, you have to add it manually. + +``-fprof-auto-exported`` + .. index:: + single: -fprof-auto-top + + .. index:: + single: cost centres; automatically inserting + + GHC will automatically add ``SCC`` annotations for all exported + functions not marked INLINE. If you want a cost centre on an INLINE + function, you have to add it manually. + +``-fprof-auto-calls`` + .. index:: + single: -fprof-auto-calls + + Adds an automatic ``SCC`` annotation to all *call sites*. This is + particularly useful when using profiling for the purposes of + generating stack traces; see the function ``traceStack`` in the + module ``Debug.Trace``, or the ``-xc`` RTS flag + (:ref:`rts-options-debugging`) for more details. + +``-fprof-cafs`` + .. index:: + single: -fprof-cafs + + The costs of all CAFs in a module are usually attributed to one + “big” CAF cost-centre. With this option, all CAFs get their own + cost-centre. An “if all else fails” option… + +``-fno-prof-auto`` + .. index:: + single: -no-fprof-auto + + Disables any previous ``-fprof-auto``, ``-fprof-auto-top``, or + ``-fprof-auto-exported`` options. + +``-fno-prof-cafs`` + .. index:: + single: -fno-prof-cafs + + Disables any previous ``-fprof-cafs`` option. + +``-fno-prof-count-entries`` + .. index:: + single: -fno-prof-count-entries + + Tells GHC not to collect information about how often functions are + entered at runtime (the "entries" column of the time profile), for + this module. This tends to make the profiled code run faster, and + hence closer to the speed of the unprofiled code, because GHC is + able to optimise more aggressively if it doesn't have to maintain + correct entry counts. This option can be useful if you aren't + interested in the entry counts (for example, if you only intend to + do heap profiling). + +.. _prof-time-options: + +Time and allocation profiling +----------------------------- + +To generate a time and allocation profile, give one of the following RTS +options to the compiled program when you run it (RTS options should be +enclosed between ``+RTS ... -RTS`` as usual): + +``-p``, ``-P``, ``-pa`` + .. index:: + single: -p; RTS option + single: -P; RTS option + single: -pa; RTS option + single: time profile + + The ``-p`` option produces a standard *time profile* report. It is + written into the file ``program.prof``. + + The ``-P`` option produces a more detailed report containing the + actual time and allocation data as well. (Not used much.) + + The ``-pa`` option produces the most detailed report containing all + cost centres in addition to the actual time and allocation data. + +``-Vsecs`` + .. index:: + single: -V; RTS option + + Sets the interval that the RTS clock ticks at, which is also the + sampling interval of the time and allocation profile. The default is + 0.02 seconds. + +``-xc`` + .. index:: + single: -xc; RTS option + + This option causes the runtime to print out the current cost-centre + stack whenever an exception is raised. This can be particularly + useful for debugging the location of exceptions, such as the + notorious ``Prelude.head: empty list`` error. See + :ref:`rts-options-debugging`. + +.. _prof-heap: + +Profiling memory usage +---------------------- + +In addition to profiling the time and allocation behaviour of your +program, you can also generate a graph of its memory usage over time. +This is useful for detecting the causes of space leaks, when your +program holds on to more memory at run-time that it needs to. Space +leaks lead to slower execution due to heavy garbage collector activity, +and may even cause the program to run out of memory altogether. + +To generate a heap profile from your program: + +1. Compile the program for profiling (:ref:`prof-compiler-options`). + +2. Run it with one of the heap profiling options described below (eg. + ``-h`` for a basic producer profile). This generates the file + ``prog.hp``. + +3. Run ``hp2ps`` to produce a Postscript file, ``prog.ps``. The + ``hp2ps`` utility is described in detail in :ref:`hp2ps`. + +4. Display the heap profile using a postscript viewer such as Ghostview, + or print it out on a Postscript-capable printer. + +For example, here is a heap profile produced for the ``sphere`` program +from GHC's ``nofib`` benchmark suite, + +.. image:: images/prof_scc.* + +You might also want to take a look at +`hp2any <http://www.haskell.org/haskellwiki/Hp2any>`__, a more advanced +suite of tools (not distributed with GHC) for displaying heap profiles. + +.. _rts-options-heap-prof: + +RTS options for heap profiling +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are several different kinds of heap profile that can be generated. +All the different profile types yield a graph of live heap against time, +but they differ in how the live heap is broken down into bands. The +following RTS options select which break-down to use: + +``-hc`` + .. index:: + single: -hc; RTS option + + (can be shortened to ``-h``). Breaks down the graph by the + cost-centre stack which produced the data. + +``-hm`` + .. index:: + single: -hm; RTS option + + Break down the live heap by the module containing the code which + produced the data. + +``-hd`` + .. index:: + single: -hd; RTS option + + Breaks down the graph by closure description. For actual data, the + description is just the constructor name, for other closures it is a + compiler-generated string identifying the closure. + +``-hy`` + .. index:: + single: -hy; RTS option + + Breaks down the graph by type. For closures which have function type + or unknown/polymorphic type, the string will represent an + approximation to the actual type. + +``-hr`` + .. index:: + single: -hr; RTS option + + Break down the graph by retainer set. Retainer profiling is + described in more detail below (:ref:`retainer-prof`). + +``-hb`` + .. index:: + single: -hb; RTS option + + Break down the graph by biography. Biographical profiling is + described in more detail below (:ref:`biography-prof`). + +In addition, the profile can be restricted to heap data which satisfies +certain criteria - for example, you might want to display a profile by +type but only for data produced by a certain module, or a profile by +retainer for a certain type of data. Restrictions are specified as +follows: + +``-hc ⟨name⟩`` + .. index:: + single: -hc; RTS option + + Restrict the profile to closures produced by cost-centre stacks with + one of the specified cost centres at the top. + +``-hC ⟨name⟩`` + .. index:: + single: -hC; RTS option + + Restrict the profile to closures produced by cost-centre stacks with + one of the specified cost centres anywhere in the stack. + +``-hm ⟨module⟩`` + .. index:: + single: -hm; RTS option + + Restrict the profile to closures produced by the specified modules. + +``-hd ⟨desc⟩`` + .. index:: + single: -hd; RTS option + + Restrict the profile to closures with the specified description + strings. + +``-hy ⟨type⟩`` + .. index:: + single: -hy; RTS option + + Restrict the profile to closures with the specified types. + +``-hr ⟨cc⟩`` + .. index:: + single: -hr; RTS option + + Restrict the profile to closures with retainer sets containing + cost-centre stacks with one of the specified cost centres at the + top. + +``-hb ⟨bio⟩`` + .. index:: + single: -hb; RTS option + + Restrict the profile to closures with one of the specified + biographies, where ⟨bio⟩ is one of ``lag``, ``drag``, ``void``, or + ``use``. + +For example, the following options will generate a retainer profile +restricted to ``Branch`` and ``Leaf`` constructors: + +:: + + prog +RTS -hr -hdBranch,Leaf + +There can only be one "break-down" option (eg. ``-hr`` in the example +above), but there is no limit on the number of further restrictions that +may be applied. All the options may be combined, with one exception: GHC +doesn't currently support mixing the ``-hr`` and ``-hb`` options. + +There are three more options which relate to heap profiling: + +``-isecs`` + .. index:: + single: -i + + Set the profiling (sampling) interval to ⟨secs⟩ seconds (the default + is 0.1 second). Fractions are allowed: for example ``-i0.2`` will + get 5 samples per second. This only affects heap profiling; time + profiles are always sampled with the frequency of the RTS clock. See + :ref:`prof-time-options` for changing that. + +``-xt`` + .. index:: + single: -xt; RTS option + + Include the memory occupied by threads in a heap profile. Each + thread takes up a small area for its thread state in addition to the + space allocated for its stack (stacks normally start small and then + grow as necessary). + + This includes the main thread, so using ``-xt`` is a good way to see + how much stack space the program is using. + + Memory occupied by threads and their stacks is labelled as “TSO” and + “STACK” respectively when displaying the profile by closure + description or type description. + +``-Lnum`` + .. index:: + single: -L; RTS option + + Sets the maximum length of a cost-centre stack name in a heap + profile. Defaults to 25. + +.. _retainer-prof: + +Retainer Profiling +~~~~~~~~~~~~~~~~~~ + +Retainer profiling is designed to help answer questions like “why is +this data being retained?”. We start by defining what we mean by a +retainer: + + A retainer is either the system stack, an unevaluated closure + (thunk), or an explicitly mutable object. + +In particular, constructors are *not* retainers. + +An object B retains object A if (i) B is a retainer object and (ii) +object A can be reached by recursively following pointers starting from +object B, but not meeting any other retainer objects on the way. Each +live object is retained by one or more retainer objects, collectively +called its retainer set, or its retainer set, or its retainers. + +When retainer profiling is requested by giving the program the ``-hr`` +option, a graph is generated which is broken down by retainer set. A +retainer set is displayed as a set of cost-centre stacks; because this +is usually too large to fit on the profile graph, each retainer set is +numbered and shown abbreviated on the graph along with its number, and +the full list of retainer sets is dumped into the file ``prog.prof``. + +Retainer profiling requires multiple passes over the live heap in order +to discover the full retainer set for each object, which can be quite +slow. So we set a limit on the maximum size of a retainer set, where all +retainer sets larger than the maximum retainer set size are replaced by +the special set ``MANY``. The maximum set size defaults to 8 and can be +altered with the ``-R`` RTS option: + +``-R ⟨size⟩`` + Restrict the number of elements in a retainer set to ⟨size⟩ (default + 8). + +Hints for using retainer profiling +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The definition of retainers is designed to reflect a common cause of +space leaks: a large structure is retained by an unevaluated +computation, and will be released once the computation is forced. A good +example is looking up a value in a finite map, where unless the lookup +is forced in a timely manner the unevaluated lookup will cause the whole +mapping to be retained. These kind of space leaks can often be +eliminated by forcing the relevant computations to be performed eagerly, +using ``seq`` or strictness annotations on data constructor fields. + +Often a particular data structure is being retained by a chain of +unevaluated closures, only the nearest of which will be reported by +retainer profiling - for example A retains B, B retains C, and C retains +a large structure. There might be a large number of Bs but only a single +A, so A is really the one we're interested in eliminating. However, +retainer profiling will in this case report B as the retainer of the +large structure. To move further up the chain of retainers, we can ask +for another retainer profile but this time restrict the profile to B +objects, so we get a profile of the retainers of B: + +:: + + prog +RTS -hr -hcB + +This trick isn't foolproof, because there might be other B closures in +the heap which aren't the retainers we are interested in, but we've +found this to be a useful technique in most cases. + +.. _biography-prof: + +Biographical Profiling +~~~~~~~~~~~~~~~~~~~~~~ + +A typical heap object may be in one of the following four states at each +point in its lifetime: + +- The lag stage, which is the time between creation and the first use + of the object, + +- the use stage, which lasts from the first use until the last use of + the object, and + +- The drag stage, which lasts from the final use until the last + reference to the object is dropped. + +- An object which is never used is said to be in the void state for its + whole lifetime. + +A biographical heap profile displays the portion of the live heap in +each of the four states listed above. Usually the most interesting +states are the void and drag states: live heap in these states is more +likely to be wasted space than heap in the lag or use states. + +It is also possible to break down the heap in one or more of these +states by a different criteria, by restricting a profile by biography. +For example, to show the portion of the heap in the drag or void state +by producer: + +:: + + prog +RTS -hc -hbdrag,void + +Once you know the producer or the type of the heap in the drag or void +states, the next step is usually to find the retainer(s): + +:: + + prog +RTS -hr -hccc... + +.. note:: + This two stage process is required because GHC cannot currently + profile using both biographical and retainer information simultaneously. + +.. _mem-residency: + +Actual memory residency +~~~~~~~~~~~~~~~~~~~~~~~ + +How does the heap residency reported by the heap profiler relate to the +actual memory residency of your program when you run it? You might see a +large discrepancy between the residency reported by the heap profiler, +and the residency reported by tools on your system (eg. ``ps`` or +``top`` on Unix, or the Task Manager on Windows). There are several +reasons for this: + +- There is an overhead of profiling itself, which is subtracted from + the residency figures by the profiler. This overhead goes away when + compiling without profiling support, of course. The space overhead is + currently 2 extra words per heap object, which probably results in + about a 30% overhead. + +- Garbage collection requires more memory than the actual residency. + The factor depends on the kind of garbage collection algorithm in + use: a major GC in the standard generation copying collector will + usually require 3L bytes of memory, where L is the amount of live + data. This is because by default (see the ``+RTS -F`` option) we + allow the old generation to grow to twice its size (2L) before + collecting it, and we require additionally L bytes to copy the live + data into. When using compacting collection (see the ``+RTS -c`` + option), this is reduced to 2L, and can further be reduced by + tweaking the ``-F`` option. Also add the size of the allocation area + (currently a fixed 512Kb). + +- The stack isn't counted in the heap profile by default. See the + ``+RTS -xt`` option. + +- The program text itself, the C stack, any non-heap data (e.g. data + allocated by foreign libraries, and data allocated by the RTS), and + ``mmap()``\'d memory are not counted in the heap profile. + +.. _hp2ps: + +``hp2ps`` -- Rendering heap profiles to PostScript +-------------------------------------------------- + +.. index:: + single: hp2ps + single: heap profiles + single: postscript, from heap profiles + single: -h⟨break-down⟩ + +Usage: + +:: + + hp2ps [flags] [<file>[.hp]] + +The program ``hp2ps`` program converts a heap profile as produced +by the ``-h<break-down>`` runtime option into a PostScript graph of the +heap profile. By convention, the file to be processed by ``hp2ps`` has a +``.hp`` extension. The PostScript output is written to ``<file>@.ps``. +If ``<file>`` is omitted entirely, then the program behaves as a filter. + +``hp2ps`` is distributed in ``ghc/utils/hp2ps`` in a GHC source +distribution. It was originally developed by Dave Wakeling as part of +the HBC/LML heap profiler. + +The flags are: + +``-d`` + In order to make graphs more readable, ``hp2ps`` sorts the shaded + bands for each identifier. The default sort ordering is for the + bands with the largest area to be stacked on top of the smaller + ones. The ``-d`` option causes rougher bands (those representing + series of values with the largest standard deviations) to be stacked + on top of smoother ones. + +``-b`` + Normally, ``hp2ps`` puts the title of the graph in a small box at + the top of the page. However, if the JOB string is too long to fit + in a small box (more than 35 characters), then ``hp2ps`` will choose + to use a big box instead. The ``-b`` option forces ``hp2ps`` to use + a big box. + +``-e<float>[in|mm|pt]`` + Generate encapsulated PostScript suitable for inclusion in LaTeX + documents. Usually, the PostScript graph is drawn in landscape mode + in an area 9 inches wide by 6 inches high, and ``hp2ps`` arranges + for this area to be approximately centred on a sheet of a4 paper. + This format is convenient of studying the graph in detail, but it is + unsuitable for inclusion in LaTeX documents. The ``-e`` option + causes the graph to be drawn in portrait mode, with float specifying + the width in inches, millimetres or points (the default). The + resulting PostScript file conforms to the Encapsulated PostScript + (EPS) convention, and it can be included in a LaTeX document using + Rokicki's dvi-to-PostScript converter ``dvips``. + +``-g`` + Create output suitable for the ``gs`` PostScript previewer (or + similar). In this case the graph is printed in portrait mode without + scaling. The output is unsuitable for a laser printer. + +``-l`` + Normally a profile is limited to 20 bands with additional + identifiers being grouped into an ``OTHER`` band. The ``-l`` flag + removes this 20 band and limit, producing as many bands as + necessary. No key is produced as it won't fit!. It is useful for + creation time profiles with many bands. + +``-m<int>`` + Normally a profile is limited to 20 bands with additional + identifiers being grouped into an ``OTHER`` band. The ``-m`` flag + specifies an alternative band limit (the maximum is 20). + + ``-m0`` requests the band limit to be removed. As many bands as + necessary are produced. However no key is produced as it won't fit! + It is useful for displaying creation time profiles with many bands. + +``-p`` + Use previous parameters. By default, the PostScript graph is + automatically scaled both horizontally and vertically so that it + fills the page. However, when preparing a series of graphs for use + in a presentation, it is often useful to draw a new graph using the + same scale, shading and ordering as a previous one. The ``-p`` flag + causes the graph to be drawn using the parameters determined by a + previous run of ``hp2ps`` on ``file``. These are extracted from + ``file@.aux``. + +``-s`` + Use a small box for the title. + +``-t<float>`` + Normally trace elements which sum to a total of less than 1% of the + profile are removed from the profile. The ``-t`` option allows this + percentage to be modified (maximum 5%). + + ``-t0`` requests no trace elements to be removed from the profile, + ensuring that all the data will be displayed. + +``-c`` + Generate colour output. + +``-y`` + Ignore marks. + +``-?`` + Print out usage information. + +.. _manipulating-hp: + +Manipulating the hp file +~~~~~~~~~~~~~~~~~~~~~~~~ + +(Notes kindly offered by Jan-Willem Maessen.) + +The ``FOO.hp`` file produced when you ask for the heap profile of a +program ``FOO`` is a text file with a particularly simple structure. +Here's a representative example, with much of the actual data omitted: + +:: + + JOB "FOO -hC" + DATE "Thu Dec 26 18:17 2002" + SAMPLE_UNIT "seconds" + VALUE_UNIT "bytes" + BEGIN_SAMPLE 0.00 + END_SAMPLE 0.00 + BEGIN_SAMPLE 15.07 + ... sample data ... + END_SAMPLE 15.07 + BEGIN_SAMPLE 30.23 + ... sample data ... + END_SAMPLE 30.23 + ... etc. + BEGIN_SAMPLE 11695.47 + END_SAMPLE 11695.47 + +The first four lines (``JOB``, ``DATE``, ``SAMPLE_UNIT``, +``VALUE_UNIT``) form a header. Each block of lines starting with +``BEGIN_SAMPLE`` and ending with ``END_SAMPLE`` forms a single sample +(you can think of this as a vertical slice of your heap profile). The +hp2ps utility should accept any input with a properly-formatted header +followed by a series of *complete* samples. + +Zooming in on regions of your profile +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can look at particular regions of your profile simply by loading a +copy of the ``.hp`` file into a text editor and deleting the unwanted +samples. The resulting ``.hp`` file can be run through ``hp2ps`` and +viewed or printed. + +Viewing the heap profile of a running program +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``.hp`` file is generated incrementally as your program runs. In +principle, running ``hp2ps`` on the incomplete file should produce a +snapshot of your program's heap usage. However, the last sample in the +file may be incomplete, causing ``hp2ps`` to fail. If you are using a +machine with UNIX utilities installed, it's not too hard to work around +this problem (though the resulting command line looks rather Byzantine): + +:: + + head -`fgrep -n END_SAMPLE FOO.hp | tail -1 | cut -d : -f 1` FOO.hp \ + | hp2ps > FOO.ps + +The command ``fgrep -n END_SAMPLE FOO.hp`` finds the end of every +complete sample in ``FOO.hp``, and labels each sample with its ending +line number. We then select the line number of the last complete sample +using ``tail`` and ``cut``. This is used as a parameter to ``head``; the +result is as if we deleted the final incomplete sample from ``FOO.hp``. +This results in a properly-formatted .hp file which we feed directly to +``hp2ps``. + +Viewing a heap profile in real time +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``gv`` and ``ghostview`` programs have a "watch file" option can be +used to view an up-to-date heap profile of your program as it runs. +Simply generate an incremental heap profile as described in the previous +section. Run ``gv`` on your profile: + +:: + + gv -watch -orientation=seascape FOO.ps + +If you forget the ``-watch`` flag you can still select "Watch file" from +the "State" menu. Now each time you generate a new profile ``FOO.ps`` +the view will update automatically. + +This can all be encapsulated in a little script: + +:: + + #!/bin/sh + head -`fgrep -n END_SAMPLE FOO.hp | tail -1 | cut -d : -f 1` FOO.hp \ + | hp2ps > FOO.ps + gv -watch -orientation=seascape FOO.ps & + while [ 1 ] ; do + sleep 10 # We generate a new profile every 10 seconds. + head -`fgrep -n END_SAMPLE FOO.hp | tail -1 | cut -d : -f 1` FOO.hp \ + | hp2ps > FOO.ps + done + +Occasionally ``gv`` will choke as it tries to read an incomplete copy of +``FOO.ps`` (because ``hp2ps`` is still running as an update occurs). A +slightly more complicated script works around this problem, by using the +fact that sending a SIGHUP to gv will cause it to re-read its input +file: + +:: + + #!/bin/sh + head -`fgrep -n END_SAMPLE FOO.hp | tail -1 | cut -d : -f 1` FOO.hp \ + | hp2ps > FOO.ps + gv FOO.ps & + gvpsnum=$! + while [ 1 ] ; do + sleep 10 + head -`fgrep -n END_SAMPLE FOO.hp | tail -1 | cut -d : -f 1` FOO.hp \ + | hp2ps > FOO.ps + kill -HUP $gvpsnum + done + +.. _prof-threaded: + +Profiling Parallel and Concurrent Programs +------------------------------------------ + +Combining ``-threaded`` and ``-prof`` is perfectly fine, and indeed it +is possible to profile a program running on multiple processors with the +``+RTS -N`` option. [3]_ + +Some caveats apply, however. In the current implementation, a profiled +program is likely to scale much less well than the unprofiled program, +because the profiling implementation uses some shared data structures +which require locking in the runtime system. Furthermore, the memory +allocation statistics collected by the profiled program are stored in +shared memory but *not* locked (for speed), which means that these +figures might be inaccurate for parallel programs. + +We strongly recommend that you use ``-fno-prof-count-entries`` when +compiling a program to be profiled on multiple cores, because the entry +counts are also stored in shared memory, and continuously updating them +on multiple cores is extremely slow. + +We also recommend using +`ThreadScope <http://www.haskell.org/haskellwiki/ThreadScope>`__ for +profiling parallel programs; it offers a GUI for visualising parallel +execution, and is complementary to the time and space profiling features +provided with GHC. + +.. _hpc: + +Observing Code Coverage +----------------------- + +.. index:: + single: code coverage + single: Haskell Program Coverage + single: hpc + +Code coverage tools allow a programmer to determine what parts of their +code have been actually executed, and which parts have never actually +been invoked. GHC has an option for generating instrumented code that +records code coverage as part of the Haskell Program Coverage (HPC) +toolkit, which is included with GHC. HPC tools can be used to render the +generated code coverage information into human understandable format. + +Correctly instrumented code provides coverage information of two kinds: +source coverage and boolean-control coverage. Source coverage is the +extent to which every part of the program was used, measured at three +different levels: declarations (both top-level and local), alternatives +(among several equations or case branches) and expressions (at every +level). Boolean coverage is the extent to which each of the values True +and False is obtained in every syntactic boolean context (ie. guard, +condition, qualifier). + +HPC displays both kinds of information in two primary ways: textual +reports with summary statistics (``hpc report``) and sources with color +mark-up (``hpc markup``). For boolean coverage, there are four possible +outcomes for each guard, condition or qualifier: both True and False +values occur; only True; only False; never evaluated. In hpc-markup +output, highlighting with a yellow background indicates a part of the +program that was never evaluated; a green background indicates an +always-True expression and a red background indicates an always-False +one. + +A small example: Reciprocation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For an example we have a program, called ``Recip.hs``, which computes +exact decimal representations of reciprocals, with recurring parts +indicated in brackets. + +:: + + reciprocal :: Int -> (String, Int) + reciprocal n | n > 1 = ('0' : '.' : digits, recur) + | otherwise = error + "attempting to compute reciprocal of number <= 1" + where + (digits, recur) = divide n 1 [] + divide :: Int -> Int -> [Int] -> (String, Int) + divide n c cs | c `elem` cs = ([], position c cs) + | r == 0 = (show q, 0) + | r /= 0 = (show q ++ digits, recur) + where + (q, r) = (c*10) `quotRem` n + (digits, recur) = divide n r (c:cs) + + position :: Int -> [Int] -> Int + position n (x:xs) | n==x = 1 + | otherwise = 1 + position n xs + + showRecip :: Int -> String + showRecip n = + "1/" ++ show n ++ " = " ++ + if r==0 then d else take p d ++ "(" ++ drop p d ++ ")" + where + p = length d - r + (d, r) = reciprocal n + + main = do + number <- readLn + putStrLn (showRecip number) + main + +HPC instrumentation is enabled with the -fhpc flag: + +:: + + $ ghc -fhpc Recip.hs + +GHC creates a subdirectory ``.hpc`` in the current directory, and puts +HPC index (``.mix``) files in there, one for each module compiled. You +don't need to worry about these files: they contain information needed +by the ``hpc`` tool to generate the coverage data for compiled modules +after the program is run. + +:: + + $ ./Recip + 1/3 + = 0.(3) + +Running the program generates a file with the ``.tix`` suffix, in this +case ``Recip.tix``, which contains the coverage data for this run of the +program. The program may be run multiple times (e.g. with different test +data), and the coverage data from the separate runs is accumulated in +the ``.tix`` file. To reset the coverage data and start again, just +remove the ``.tix`` file. + +Having run the program, we can generate a textual summary of coverage: + +:: + + $ hpc report Recip + 80% expressions used (81/101) + 12% boolean coverage (1/8) + 14% guards (1/7), 3 always True, + 1 always False, + 2 unevaluated + 0% 'if' conditions (0/1), 1 always False + 100% qualifiers (0/0) + 55% alternatives used (5/9) + 100% local declarations used (9/9) + 100% top-level declarations used (5/5) + +We can also generate a marked-up version of the source. + +:: + + $ hpc markup Recip + writing Recip.hs.html + +This generates one file per Haskell module, and 4 index files, +``hpc_index.html``, ``hpc_index_alt.html``, ``hpc_index_exp.html``, +``hpc_index_fun.html``. + +Options for instrumenting code for coverage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``-fhpc`` + Enable code coverage for the current module or modules being + compiled. + + Modules compiled with this option can be freely mixed with modules + compiled without it; indeed, most libraries will typically be + compiled without ``-fhpc``. When the program is run, coverage data + will only be generated for those modules that were compiled with + ``-fhpc``, and the ``hpc`` tool will only show information about + those modules. + +The hpc toolkit +~~~~~~~~~~~~~~~ + +The hpc command has several sub-commands: + +:: + + $ hpc + Usage: hpc COMMAND ... + + Commands: + help Display help for hpc or a single command + Reporting Coverage: + report Output textual report about program coverage + markup Markup Haskell source with program coverage + Processing Coverage files: + sum Sum multiple .tix files in a single .tix file + combine Combine two .tix files in a single .tix file + map Map a function over a single .tix file + Coverage Overlays: + overlay Generate a .tix file from an overlay file + draft Generate draft overlay that provides 100% coverage + Others: + show Show .tix file in readable, verbose format + version Display version for hpc + +In general, these options act on a ``.tix`` file after an instrumented +binary has generated it. + +The hpc tool assumes you are in the top-level directory of the location +where you built your application, and the ``.tix`` file is in the same +top-level directory. You can use the flag ``--srcdir`` to use ``hpc`` +for any other directory, and use ``--srcdir`` multiple times to analyse +programs compiled from difference locations, as is typical for packages. + +We now explain in more details the major modes of hpc. + +hpc report +^^^^^^^^^^ + +``hpc report`` gives a textual report of coverage. By default, all +modules and packages are considered in generating report, unless include +or exclude are used. The report is a summary unless the ``--per-module`` +flag is used. The ``--xml-output`` option allows for tools to use hpc to +glean coverage. + +:: + + $ hpc help report + Usage: hpc report [OPTION] .. <TIX_FILE> [<MODULE> [<MODULE> ..]] + + Options: + + --per-module show module level detail + --decl-list show unused decls + --exclude=[PACKAGE:][MODULE] exclude MODULE and/or PACKAGE + --include=[PACKAGE:][MODULE] include MODULE and/or PACKAGE + --srcdir=DIR path to source directory of .hs files + multi-use of srcdir possible + --hpcdir=DIR append sub-directory that contains .mix files + default .hpc [rarely used] + --reset-hpcdirs empty the list of hpcdir's + [rarely used] + --xml-output show output in XML + +hpc markup +^^^^^^^^^^ + +``hpc markup`` marks up source files into colored html. + +:: + + $ hpc help markup + Usage: hpc markup [OPTION] .. <TIX_FILE> [<MODULE> [<MODULE> ..]] + + Options: + + --exclude=[PACKAGE:][MODULE] exclude MODULE and/or PACKAGE + --include=[PACKAGE:][MODULE] include MODULE and/or PACKAGE + --srcdir=DIR path to source directory of .hs files + multi-use of srcdir possible + --hpcdir=DIR append sub-directory that contains .mix files + default .hpc [rarely used] + --reset-hpcdirs empty the list of hpcdir's + [rarely used] + --fun-entry-count show top-level function entry counts + --highlight-covered highlight covered code, rather that code gaps + --destdir=DIR path to write output to + +hpc sum +^^^^^^^ + +``hpc sum`` adds together any number of ``.tix`` files into a single +``.tix`` file. ``hpc sum`` does not change the original ``.tix`` file; +it generates a new ``.tix`` file. + +:: + + $ hpc help sum + Usage: hpc sum [OPTION] .. <TIX_FILE> [<TIX_FILE> [<TIX_FILE> ..]] + Sum multiple .tix files in a single .tix file + + Options: + + --exclude=[PACKAGE:][MODULE] exclude MODULE and/or PACKAGE + --include=[PACKAGE:][MODULE] include MODULE and/or PACKAGE + --output=FILE output FILE + --union use the union of the module namespace (default is intersection) + +hpc combine +^^^^^^^^^^^ + +``hpc combine`` is the swiss army knife of ``hpc``. It can be used to +take the difference between ``.tix`` files, to subtract one ``.tix`` +file from another, or to add two ``.tix`` files. hpc combine does not +change the original ``.tix`` file; it generates a new ``.tix`` file. + +:: + + $ hpc help combine + Usage: hpc combine [OPTION] .. <TIX_FILE> <TIX_FILE> + Combine two .tix files in a single .tix file + + Options: + + --exclude=[PACKAGE:][MODULE] exclude MODULE and/or PACKAGE + --include=[PACKAGE:][MODULE] include MODULE and/or PACKAGE + --output=FILE output FILE + --function=FUNCTION combine .tix files with join function, default = ADD + FUNCTION = ADD | DIFF | SUB + --union use the union of the module namespace (default is intersection) + +hpc map +^^^^^^^ + +hpc map inverts or zeros a ``.tix`` file. hpc map does not change the +original ``.tix`` file; it generates a new ``.tix`` file. + +:: + + $ hpc help map + Usage: hpc map [OPTION] .. <TIX_FILE> + Map a function over a single .tix file + + Options: + + --exclude=[PACKAGE:][MODULE] exclude MODULE and/or PACKAGE + --include=[PACKAGE:][MODULE] include MODULE and/or PACKAGE + --output=FILE output FILE + --function=FUNCTION apply function to .tix files, default = ID + FUNCTION = ID | INV | ZERO + --union use the union of the module namespace (default is intersection) + +hpc overlay and hpc draft +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Overlays are an experimental feature of HPC, a textual description of +coverage. hpc draft is used to generate a draft overlay from a .tix +file, and hpc overlay generates a .tix files from an overlay. + +:: + + % hpc help overlay + Usage: hpc overlay [OPTION] .. <OVERLAY_FILE> [<OVERLAY_FILE> [...]] + + Options: + + --srcdir=DIR path to source directory of .hs files + multi-use of srcdir possible + --hpcdir=DIR append sub-directory that contains .mix files + default .hpc [rarely used] + --reset-hpcdirs empty the list of hpcdir's + [rarely used] + --output=FILE output FILE + % hpc help draft + Usage: hpc draft [OPTION] .. <TIX_FILE> + + Options: + + --exclude=[PACKAGE:][MODULE] exclude MODULE and/or PACKAGE + --include=[PACKAGE:][MODULE] include MODULE and/or PACKAGE + --srcdir=DIR path to source directory of .hs files + multi-use of srcdir possible + --hpcdir=DIR append sub-directory that contains .mix files + default .hpc [rarely used] + --reset-hpcdirs empty the list of hpcdir's + [rarely used] + --output=FILE output FILE + +Caveats and Shortcomings of Haskell Program Coverage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +HPC does not attempt to lock the ``.tix`` file, so multiple concurrently +running binaries in the same directory will exhibit a race condition. +There is no way to change the name of the ``.tix`` file generated, apart +from renaming the binary. HPC does not work with GHCi. + +.. _ticky-ticky: + +Using “ticky-ticky” profiling (for implementors) +------------------------------------------------ + +.. index:: + single: ticky-ticky profiling + +Because ticky-ticky profiling requires a certain familiarity with GHC +internals, we have moved the documentation to the GHC developers wiki. +Take a look at its +:ghc-wiki:`overview of the profiling options <Commentary/Profiling>`, +which includeds a link to the ticky-ticky profiling page. + +.. [1] + ``-fprof-auto`` was known as ``-auto-all``\ ``-auto-all`` prior to + GHC 7.4.1. + +.. [2] + Note that this policy has changed slightly in GHC 7.4.1 relative to + earlier versions, and may yet change further, feedback is welcome. + +.. [3] + This feature was added in GHC 7.4.1. |