docs/Beignet/Backend/gen_ir.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254

Scalar Intermediate Representation
==================================

The IR code is included in `src/ir/` of the compiler code base
The IR as designed in this compiler is the fruit of a long reflection I mostly
have with Thomas Raoux. Note I usually call it "Gen IR".

Scalar vs vector IR
-------------------

This is actually the major question: do we need a vector IR or a scalar IR? On
the LLVM side, we have both. LLVM IR can manipulate vectors and scalars (and
even generalized values but we can ignore it for now).

For that reason, the Clang front-end generates both scalar and vector code.
Typically, a `uint4` variable will output a vector of 4 integers. Arithmetic
computations will be directly done on vector variables.

One the HW side, the situation is completely different:

- We are going to use the parallel mode (align1) i.e. the struct-of-array mode
  for the EU. This is a SIMD scalar mode.

- The only source of vectors we are going to have is on the sends instructions
  (and marginally for some other instructions like the div_rem math instruction)

One may therefore argue that we need vector instructions to handle the sends.
Send will indeed require both vector destinations and sources. This may be a
strong argument *for* vectors in the IR. However, the situation is not that
good.

Indeed, if we look carefully at the send instructions we see that they will
require vectors that are *not* vectors in LLVM IR. This code for example:

<code>
__global uint4 *src;<br/>
uint4 x = src[get\_global\_id(0)];<br/>
</code>

will be translated into an untyped write in the Gen ISA. Unfortunately, the
address and the values to write are in the *same* vector. However, LLVM IR will
output a store like:

`store(%addr, %value)`

which basically uses one scalar (the address) and one value (the vector to
write). Therefore even if we handle vectors in the IR, that will not directly
solve the problem we have at the end for the send instructions.

We therefore decided to go the other direction:

- We have a purely scalar IR

- To replace vectors, we simply use multiple sources and destinations

- Real vectors required by send instructions are handled at the very bottom of
the stack in the register allocation passes.

This leads to a very simple intermediate representation which is mostly a pure
scalar RISC machine.

Very limited IR
---------------

The other major question, in particular when you look similar stacks like NVidia
SPIR, is:

do we need to encode in the IR register modifiers (abs, negate...) and immediate
registers (like in add.f x y 1.0)?

Contrary to other IRs (SPIR and even LLVM that both supports immediates), we also
chose to have a very simply IR, much simpler than the final ISA, and to merge
back what we need at the instruction selection pass. Since we need instruction
selection, let us keep the IR simple.

Also, there are a lot of major issues that can not be covered in the IR and
require to be specifically handled at the very end of the code:

- send vectors (see previous section)

- send headers (value and register allocation) which are also part of the vector
problem

- SIMD8 mode in SIMD16 code. Some send messages do not support SIMD16 encoding
and require SIMD8. Typically examples are typed writes i.e. scatters to textures.
Also, this cannot be encoded in some way in a regular scalar IR.

For these reasons, most of the problems directly related to Gen naturally find
their solutions in either the instruction selection or the register allocator.

This leads to the following strategy:

- Keep the IR very simple and limited

- Use all the analysis tools you need in the IR before the final code generation
to build any information you need. This is pure "book-keeping".

- Use any previous analysis and finish the job at the very end

This classical approach leads to limit the complexity in the IR while forcing us
to write the proper tools in the final stages.

Why not using LLVM IR directly?
-------------------------------

We hesitated a long time between writing a dedicated IR (as we did) and just
using LLVM IR. Indeed, LLVM comes with a large set of tools that are parts of
"LLVM backends". LLVM provides a lot of tools to perform the instruction
selection (`SelectionDAG`) and the register allocation. Two things however
prevent us from choosing this path:

- We only have a limited experience with LLVM and no experience at all with the
LLVM backends

- LLVM register allocators do not handle at all the peculiarities of Gen:

  * flexible register file. Gen registers are more like memory than registers
    and can be freely allocated and aliased. LLVM register allocators only
    support partial aliasing like x86 machines do (rax -> eax -> ax)

  * no proper tools to handle vectors in the register allocator as we need for
    sends

Since we will need to do some significant work anyway, this leads us to choose a
more hard-coded path with a in-house IR. Note that will not prevent us from
implementing later a LLVM backend "by the book" as Nvidia does today with SPIR
(using a LLVM backend to do the LLVM IR -> SPIR conversion)


SSA or no SSA
-------------

Since we have a purely scalar IR, implementing a SSA transformation on the IR
may be convenient. However, most the literature about compiler back-ends use
non-SSA representation of the code. Since the primary goal is to write a
compiler _back-end_ (instruction selection, register allocation and instruction
scheduling), we keep the code in non-SSA letting the higher level optimizations
to LLVM.

Types, registers, instructions, functions and units
---------------------------------------------------

The IR is organized as follows:

- Types (defined in `src/ir/type.*pp`). These are scalar types only. Since the
  code is completely lowered down, there is no more reference to structures,
  pointers or vectors. Everything is scalar values and when "vectors" or
  "structures" would be needed, we use instead multiple scalar sources or
  destinations.

- Registers (defined in `src/ir/register.*pp`). They are untyped (since Gen IR
  are untyped) and we have 65,535 of them per function

- Instructions (defined in `src/ir/instruction.*pp`). They are typed (to
  distinguish integer and FP adds for example) and possibly support multiple
  destinations and sources. We also provide a convenient framework to introspect
  the instruction in a simple (and memory efficient) way

- Functions (defined in `src/ir/function.*pp`). They are basically the counter
  part of LLVM functions or OpenCL kernels. Note that function arguments are a
  problem. We actually use the SPIR ABI. Everything smaller than the machine word
  size (i.e. 32 bits for Gen) is passed by value with a register. Everything
  else which is bigger than is passed by pointer with a ByVal attribute.
  Note that requires some special treatment in the IR (see below) to make the
  code faster by replacing function argument loads by  "pushed constants". We
  also defined one "register file" per function i.e. the registers are defined
  relatively to the function that uses them. Each function is made of basic
  blocks i.e. sequence of instructions that are executed linearly.

- Units (defined in `src/ir/unit.*pp`). Units are just a collection of
  functions and constants (not supported yet).

Function arguments and pushed constants
---------------------------------------

Gen can push values into the register file i.e. some registers are preset when
the kernel starts to run. As detailed previously, the SPIR ABI is convenient
since every argument is either one register or one pointer to load from or to
store to.

However, when a pointer is used for an argument, loads are issued which may be
avoided by using constant pushes.

Once again OCL makes the task a bit harder than expected. Indeed, the C
semantic once again applies to function arguments as well.

Look at these three examples:

### Case 1. Direct loads -> constant push can be used

<code>
struct foo { int x; int y; }; </br>
\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br>
{<br/>
&nbsp;&nbsp;dst[get\_global\_id(0)] = bar.x + bar.y;<br/>
}
</code>

We use a _direct_ _load_ for `bar` with `bar.x` and `bar.y`. Values can be
pushed into registers and we can replace the loads by register reads.

### Case 2. Indirect loads -> we need to load the values from memory

<code>
struct foo { int x[16]; }; </br>
\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br>
{<br/>
&nbsp;&nbsp;dst[get\_global\_id(0)] = bar.x[get\_local\_id(0)];<br/>
}
</code>

We use an indirect load with `bar.x[get\_local\_id(0)]`. Here we need to issue a
load from memory (well, actually, we could do a gather from registers, but it is
not supported yet).

### Case 3. Writes to arguments -> we need to spill the values to memory first

<code>
struct foo { int x[16]; }; </br>
\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br>
{<br/>
bar.x[0] = get\_global\_id(1);<br/>
&nbsp;&nbsp;dst[get\_global\_id(0)] = bar.x[get\_local\_id(0)];<br/>
}
</code>

Here the values are written before being read. This causes some troubles since
we are running in SIMD mode. Indeed, we only have in memory *one* instance of
the function arguments. Here, *many* SIMD lanes and actually *many* hardware
threads are running at the same time. This means that we can not write the data
to memory. We need to allocate a private area for each SIMD lane.

In that case, we need to spill back the function arguments into memory. We spill
once per SIMD lane. Then, we read from this private area rather than the
function arguments directly.

This analysis is partially done today in `src/ir/lowering.*pp`. We identify all
the cases but only the case with constant pushing is fully implemented.
Actually, the two last cases are easy to implement but this requires one or two
days of work.

Value and liveness analysis tools
---------------------------------

You may also notice that we provide a complete framework for value analysis
(i.e. to figure when a value or instruction destination is used and where the
instruction sources come from). The code is in `src/ir/value.*pp`. Well, today,
this code will burn a crazy amount of memory (use of std::set all over the
place) but it at least provides the analysis required by many other passes.
Compacting the data structures and using O(n) algorithms instead of the O(ln(n))
are in the TODO list for sure :-)

Finally, we also provide a liveness analysis tool which simply figures out which
registers are alive at the end of each block (classically "live out" sets).