ghc/includes/c-as-asm.lit


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509

\begin{onlystandalone}
\documentstyle[11pt,literate,a4wide]{article}
\begin{document}
\title{C as the assembly language for the Glasgow Haskell compiler}
\author{Patrick Sansom, with help from his enemies}
\date{June 1992}
\maketitle
\begin{rawlatex}
\tableofcontents
\end{rawlatex}
\end{onlystandalone}

\begin{onlypartofdoc}
\section[C-as-assembler]{Using C for our assembler}
\downsection
\end{onlypartofdoc}

% this file describes the issues;
% it then \inputs all the actual source bits that do the job


%************************************************************************
%*									*
\section[C-as-asm-intro-portable]{C as assembler: introduction to ``portable C''}
%*									*
%************************************************************************

The Glasgow Haskell compiler expresses its output in C for the usual
reasons: (1)~We hope it will be more readily portable to new machines;
(2)~We'd rather not waste our time writing a register-allocator,
instruction-selection mumbo-jumbo, etc.---in short, the dreary,
tedious bits of a code-generator.

The execution model that underlies Glasgow Haskell is the
Spineless-Tagless G-machine (STG-machine).  The constituent pieces of
the STG-machine model that must be fashioned out of good old-fashioned
C code are, roughly:
\begin{description}
\item[Dynamic heap:] The usual.

\item[Stack space:] With two stacks (A and B) growing from each end of a space,
towards each other.  To pass (some) arguments, return addresses, ``update
frames'', ...

\item[Various ``STG registers'':] Stack pointers, heap pointers,
return registers---all abstract machine-model entities...

\item[Tables of static read-only info:] Mostly so-called ``info
tables''...

\item[Code:] Lots of code fragments that do nothing but tail-call each
other.
\end{description}

OK, as mentioned, goal Numero Uno in using C as the compiler's target
is to produce code that's relatively easy to port.  To this end, we
need very-plain-vanilla bits of C to do all of the above things.
Roughly speaking, here's what comes out:

\begin{description}
\item[Dynamic heap:] @malloc(3)@ a chunk of memory and use that.

\item[Stack space:] Ditto.

\item[Various ``STG registers'':] These are just global variables (or
words in a ``global register table''), each just a 32|64-bit word (i.e.,
not a funny structure, no funny tag bits, etc.---just very ordinary
global variables).

\item[Tables of static read-only info:] Initialised arrays of 32|64-bit
words; they tend to come out looking something like (behind a wall of
CPP macros) [very straightforward]:
\begin{verbatim}
const StgWord foo_info[] = { (StgWord) foo_entry, 0, 0 }
\end{verbatim}

\item[Code:] This is the tricky part: it's hard to convince any C compiler to do
tail-calls, and certainly not all C compilers all of the time.

Fortunately, {\em all} code fragments in the STG world can be coerced
into C ``functions'' of the form:
\begin{itemize}
\item
{\em No} arguments (all the STG-world function arguments, etc., go via
the STG-world stacks and registers, not the C-world stacks/registers);
\item
They {\em always} know the next C fragment to be called (it's all
tail-recursive, after all).
\end{itemize}

So: we dictate that every one of these code fragments is an
argument-less C function of its own, which {\em returns} the address
of the next code fragment to be executed (i.e., its continuation).  Here
is an example of one such fragment (simplified from actual code):

\begin{verbatim}
stg_1848_entry (/* no args */) {
    STK_CHK(0,-5);
    PUSH_STD_UPD_FRAME(Node,StdUpdRetVecReg,0,0);
    *(SpB+5)=(StgWord)(vtbl_1930);
    Node=(StgPtr)(*(Node+SPEC_HS));
    SpB=SpB+5;
    return(*Node); /* *Node points at next code to enter */
}
\end{verbatim}

Now all we need is a so-called {\em mini-interpreter} to dispatch
these returned continuations; this is the basic idea:

\begin{verbatim}
StgFunPtr continuation = (StgFunPtr) start_cont;

while ( 1 ) { continuation = (StgFunPtr) (continuation)(); }
\end{verbatim}
\end{description}

If you are utterly baffled at this point, you probably need to read
the STG-machine paper.

All of the above is dead simple, if not particularly whizzy; in the
rest of this document, when we talk about ``portable~C'' (or, ``the
portable version''),
\index{portable C output}
we mean this stuff.

%************************************************************************
%*									*
\section[C-as-asm-intro-fast]{C as assembler: going faster, introduction to ``optimised C''}
%*									*
%************************************************************************

%************************************************************************
%*									*
\subsection[C-as-asm-portably-slow]{Why ``portable C'' is slow}
%*									*
%************************************************************************

Compiling Haskell programs via simple, utterly portable C, as outlined
above, comes at a significant cost in code size and speed.

\begin{description}
\item[Dynamic heap:] No problems associated with this.

\item[Stack space:] Or this.

[Well, not quite.  The C compiler won't take advantage of the {\em
fact} (which it doesn't know) that the heap and stacks {\em cannot
possibly overlap}.  In a perfect world, it could use this information
to select better instruction sequences.)

\item[Various ``STG registers'':] These are all in global C variables,
about which C compilers are notoriously pessimistic.

A load or store to one of these ``registers'' is almost certainly two
or more instructions---not to mention stalls for going to memory.  And
their global-you-don't-know-anything-about-me nature (in the C
compiler's view) is very optimisation-unfriendly.

And, because they're the ``registers'' of the STG execution model,
they are {\em FREQUENTLY} used!  You {\em really pay} for their
global-variableness... (estimates: 2--4x of code speed, 2x of code
size)

\item[Tables of static read-only info:] There aren't any big costs
associated with these, except that you can't micro-manage their
placement in memory, esp.~w.r.t.~entry-code fragments.
(See \sectionref{C-as-asm-native}.)


\item[Code:] Rather than really tail-jumping, we make many, many trips
through the ``mini-interpreter.''  Besides all those instructions,
there is probably plenty of pushing/popping of frames on/off the C
stack, because each dispatch by the mini-interpreter is a C function
call...

Also, we don't ``fall through'' from argument-satisfaction-checking
code into the real code for a function: we make an extra, fairly
gratuitous trip through the mini-interpreter...

We {\em estimate} that real tail-jumping should make programs go
\pl{~25%} faster.
\end{description}

%************************************************************************
%*									*
\subsection[C-as-asm-fast]{Solution: ``Optimising C,'' with our friend, Richard Stallman}
%*									*
%************************************************************************

The freely-available GNU~C compiler, GCC, (version 2.x), written under the
lead of Richard Stallman at the Free Software Foundation, is a good
C~compiler that has some non-standard extensions and machine-code
hooks that make it possible to win back practically all the slow-nesses
listed above for ``portable C''.

First, there is a non-standard extension to C that makes it possible
to put a ``global variable'' in a machine register.  For example,
something like

\begin{verbatim}
register StgPtr SpB __asm__("%i4");
\end{verbatim}

says ``global variable'' \tr{SpB} should live in machine-register \tr{%i4}.

Second, GCC offers an ``extended \tr{__asm__}'' directive, which lets
you inject raw assembly-language stuff into the C-compiler output.
Inserting \tr{jmps} in this way is is one component of the
do-real-tailjumps machinery...

The other main component of doing-real-tailjumps is shutting down the
C world's function-calling machinery, i.e., pushing/popping of stack
frames (and/or register windows, in the SPARC case).  This involves
``sedding'' the C-compiler's assembly-language output, stripping off
function prologues and epilogues, and other such really horrible
stuff...

{\em Caveat~1:} The above is machine- and compiler-specific.  {\em
Response:} We don't care.  GCC is freely and widely available and has
been ported to bazillions of machines.

{\em Caveat~2:} The above constitutes {\em serious} mangling of what
ends up in a \tr{.o} object file.  Mixing such \tr{.o} files with
others that were incompatibly compiled (e.g., w/ different
magical-global-register declarations) is {\em guaranteed} to result in
gruesome program death.

{\em Response:} We treat optimised-C \tr{.o} files quite separately.
First, this document carefully records how C files from different
``worlds'' must be compiled (next section); also, our
compilation-system driver checks that executables it links are built
from compatible \tr{.o} files (see \sectionref{driver-consistency-chking}).

The User's Guide describes how the beleaguered Haskell programmer
interacts with all this C optimisation stuff... (executive summary:
it's invisible.)

[For a discussion of how well you can do on compiling via straight
ANSI-standard C, see the paper from CMU about compiling ML...]

%************************************************************************
%*									*
\subsection[C-as-asm-worlds]{``Worlds'' that must interoperate in optimised~C}
%*									*
%************************************************************************

A Glasgow-Haskell-compiled executable is made of bits that come from
various places, or ``worlds.''  These are:
\begin{itemize}
\item
The major ``world'' is the {\em Haskell Threaded World} of
direct-jumping, machine-register-using compiled-to-optimised-C Haskell
code.

\item
The Haskell Threaded World may call into the {\em Arbitrary~C
World}; for example, to do I/O.  This is C code that is written by
someone else, compiled by someone else, over which the Glasgow Haskell
system has no control whatsoever.  However, a most pleasant property
of the Arbitrary~C World is that it doesn't care about any of the
internals of the Haskell threaded world.  In particular, it does
not know about the heap, stack etc, and hence does not modify
the magic ``registers'' @Hp@, @SpA@, etc.

\item
There may also be calls out to an {\em STG~C World}, of which the
storage manager is the major example.  (The use of the GNU
multi-precision arithmetic package to do ``Integers'' is nearly
another; except we do all the STG magic in-line, then do ``ordinary''
C calls to the library after we know it's safe to do so.) An STG~C
World is a blob of C that needs to inspect/save/restore some part of
the Haskell-Threaded-World state, notably some of its registers.  For
example, the storage manager {\em really} needs to know where the heap
pointer \tr{Hp}---in a machine register---is pointing to, before
embarking on a garbage collection... :-)

These STG~C Worlds are the tricky devils...

{\em NB:} The storage manager is a direct-jumping threaded world of
its own, with its own mapping-to-machine-registers, etc.

\item
The {\em C~Startup World}: The C runtime-system gets things going by
invoking a user-supplied function, \tr{main} (or \tr{mainIO}, if
Glasgow I/O is in force).  We must have such a function (it is
provided by the GHC RTS), and it must get us from there into the
Haskell Threaded World.

Similarly, when we finally emerge from the Haskell Threaded World, we
must exit as a well-behave C program would, i.e., set the exit status,
etc., etc.

The C~Startup World is not a big problem.
\end{itemize}

%************************************************************************
%*									*
\section[C-as-asm-issues]{Things to be {\em very careful} about with ``optimised C''}
%*									*
%************************************************************************

(Most of) These issues can be organised according to the affected
World-to-World transition...

%************************************************************************
%*									*
\subsection[C-as-asm-starting-stopping]{Starting and stopping a Haskell Threaded World}
%*									*
%************************************************************************

\begin{itemize}
\item
As part of real-tailjumping support, we completely shut down the
pushing/popping of C stack frames (by sedding off
prologues/epilogues).  This is fine---except the C compiler doesn't
know we do this and may use the stack anyway (for temporaries,
automatic variables, ...)!  The initialisation of the Haskell Threaded
World must therefore ensure that there is some/enough C-stack space
sitting around for this purpose...

\item
The ``mini-interpreter'' must be re-entrant!  The code for
@error@---and bits of the RTS---actually exploit this.

\item
(C Startup to Haskell) Beginning of mini-interpreter: The STG
registers (in real registers) must be initialised from the @smInfo@
structure.  Ending: @smInfo@ updated from the STG registers.
\end{itemize}

%************************************************************************
%*									*
\subsection[C-as-asm-haskell-to-arbitrary-C]{Haskell Threaded to/from Arbitrary~C}
%*									*
%************************************************************************

Remember, ``Arbitrary C'' cannot modify any of the STG registers, so
all that is required is to keep them safe across the call.

Hence we can just call the arbitrary~C routine.  But, {\em don't} map
any STG registers onto any call-clobbered registers; the arbitrary~C
may stomp on them.  (Just use callee-save registers.) In the SPARC
case, this means all \tr{%o} and several \tr{%g} registers are
unavailable.  GCC~2.x warns if you try to snaffle call-clobbered
registers.

%************************************************************************
%*									*
\subsection[C-as-asm-haskell-to-STG-C]{Haskell Threaded to/from special ``STG~C''}
%*									*
%************************************************************************

This is the tricky business.

[ToDo: re-make this section in its particulars (it is out-of-date);
principles still valid.]

\begin{itemize}
\item
The compiler ``knows'' about things to which it makes ``STG calls''...

It emits macro calls of the form \tr{STGCALLn(f, arg1, arg2..., arg_n)};
calls to C-function @f@, with @n@ arguments, returning (at most) one
32|64-bit value.  A use of this might look like...

\begin{verbatim}
x = STGCALL2( arby_prec_add, y, z );
\end{verbatim}

\item
In portable~C, the above would just become an ordinary C
function-call:

\begin{verbatim}
x = arby_prec_add(y,z);
\end{verbatim}

Also, in portable~C, the @xxx_SAVE@ backup global (see below) is
\tr{#defined} to be the same as \tr{xxx}!

\item
In optimised~C, the @STGCALLn@ macros turn into (highly-machine
dependent) uses of the ultra-magic @callWrapper@ function.

At least in the SPARC case, using @STGCALL1@ as an example:

\begin{verbatim}
STGCALL1( f, x ) ---> %o5 := f; callWrapper(x)
\end{verbatim}

The @callWrapper@ function is very special indeed.  It must preserve
all the callee-saves registers (SPARC e.g.: all \tr{%i} and \tr{%l}
registers).  It is {\em NOT} tail-jumped to like the
Haskell-Threaded-World routines.  So all it does is:

\begin{enumerate}
\item
Save the return address (SPARC: \tr{[%o7]+8}) into @continuation_SAVE@.

\item
Save all relevant STG register in their corresponding @_SAVE@ backup globals.
(We might have @callWrapper@ variants to save different sets.)

\item
Call the function (SPARC: \tr{call %o5}); note that the function
arguments are all in the right place already.  (NOTE: It would be hard
to think of something more machine- and compiler-assumptive!  But,
remember, we are calling code with which we are on friendly terms...)

\item
Restore the @_SAVE@ backup globals into the registers; the restore
mustn't affect the single returned 32|64-bit value (SPARC case: in \tr{%o0}).

\item
@STGJUMP@ (tail-jump) to @continuation_SAVE@.
\end{enumerate}

N.B.: @callWrapper@ only works up to a limited number of arguments
(SPARC: 5 words, \tr{%o0-%o4}), because we are using \tr{%o5} (SPARC)
for the function to call.  If we run into this limit, we should pass
the function in a global instead of \tr{%o5} (or whatever).
\end{itemize}

%************************************************************************
%*									*
\subsubsection[C-as-asm-haskell-to-GC]{...To the Storage Manager in particular...}
%*									*
%************************************************************************

The interface to the GC uses the same regime; having to save and restore
all STG and ptr regs is no big deal, because it only happens once per GC.
Other routines should only use SpA/B, Hp, HeapLimit, SuA/B (for GC).

%************************************************************************
%*									*
\section[C-as-asm-native]{Things that could be better with a native-code generator}
%*									*
%************************************************************************

Even with all the fancy GNU~C tricks and whatnot, the resulting code
speed and size isn't as good as you could do if you wrote a full-blown
native-code generator.  We have no interest in doing this---the payoff
isn't big enough---but here we list, for the record, Things That Could
Be Better:

\begin{enumerate}
\item
We could park info-tables and entry code in judicious proximity to
each other, so @ENTER_CLOSURE@ would be a
single-indirection-with-offset, rather than a double-indirection.
\end{enumerate}

%************************************************************************
%*									*
\section[C-as-asm-registers]{IMPLEMENTATION: Mapping STG registers onto machine registers}
%*									*
%************************************************************************

We have several mappings from STG~registers onto real machine registers,
for different segments of the runtime system.  Each mapping is
dependent on the target architecture as well.

\downsection
\input{StgRegs.lh}			% decls of STG registers
\upsection

%************************************************************************
%*									*
\section[C-as-asm-tailjumps]{IMPLEMENTATION: tail-jumping}
%*									*
%************************************************************************

\downsection
\input{COptJumps.lh}			% per-platform tailjumps (optimised C)

\subsection[driver-sedding]{ToDo: driver sedding}

THIS SHOULD BE THE SOURCE FOR THE PART OF THE DRIVER THAT MANGLES
OPTIMISED C ASSEMBLER FILES.

\input{../runtime/c-as-asm/StgMiniInt.lc}
\upsection

%************************************************************************
%*									*
\section[C-as-asm-wrappers]{IMPLEMENTATION: ``wrappers'' to call out from the threaded world}
%*									*
%************************************************************************

\downsection
\input{COptWraps.lh}

\input{../runtime/c-as-asm/HpOverflow.lc}
\upsection

%************************************************************************
%*									*
\section[driver-consistency-chking]{ToDo: driver consistency checking}
%*									*
%************************************************************************

THIS SHOULD BE THE SOURCE FOR THE PART OF THE DRIVER THAT CHECKS
CONSISTENCY OF EXECUTABLES.


\begin{onlystandalone}
\printindex
\end{document}
\end{onlystandalone}