4 files changed, 44 insertions, 48 deletions
diff --git a/docs/Beignet.mdwn b/docs/Beignet.mdwn
index 1a56a6f1..dd8f349a 100644
--- a/docs/Beignet.mdwn
+++ b/docs/Beignet.mdwn
@@ -47,18 +47,17 @@ There are some severe OpenCL related regression in current clang 3.4/3.5 version
 **Note about LLVM 3.5**
 
 * If you want to try Clang/LLVM 3.5, you need to build the clang/llvm with cxx11 enabled:
---enable-cxx11.
+--enable-cxx11. And the recommended specified version is r211037. As LLVM 3.5 hasn't been
+released and still in active development. Use version newer than the recommended version
+may be incompatbiel with beignet.
 
 **Note about OpenCV support**
 
-* We only fully tested the OpenCV 2.4 branch with beignet. And the pass rate is about 99%
-  for beignet 0.8.0. The preferred LLVM/Clang version is 3.3. One OpenCV patch is needed
-  to work with LLVM/clang, the patch is already submitted to the OpenCV upstream 2.4 repo
-  and is waiting for review: [pull request](https://github.com/Itseez/opencv/pull/2318).
-  Before it is merged, you need to apply that patch manually to OpenCV 2.4 branch.
-* As some OpenCL kerne (in OpenCV 2.4 OCL test suite) runs more than 10 seconds, it may
-  be reset by the kernel as the kernel has a GPU hangcheck mechanism. You can disable the
-  hangcheck by invoke the following command on Ubuntu system:
+* We fully tested the OpenCV 2.4 branch with beignet. And the pass rate is about 99%
+  for beignet 0.9. The preferred LLVM/Clang version is 3.3.
+* As some OpenCL kernels run more than 10 seconds, it may  be reset by the linux kernel as
+  the there is a GPU hangcheck mechanism. You can disable the hangcheck by invoke the
+  following command on Ubuntu system:
 
   `# echo -n 0 > /sys/module/i915/parameters/enable_hangcheck`
 
@@ -142,13 +141,14 @@ The code was tested on IVB GT2 with ubuntu and fedora core distribution. The rec
 kernel version is equal or newer than 3.11. Currently Only IVB is supported right now.
 Actually, the code was run on IVB GT2/GT1, and both system are well supported now.
 
-Math Function precision
------------------------
+Known Issues
+------------
 
-Currently Gen does not provide native support of high precision math functions
-required by OpenCL. We provide a software version to achieve high precision,
-which you can turn on through `export OCL_STRICT_CONFORMANCE=1`.
-But be careful, this would make your CL kernel run a little longer.
+* We don't support "extern" keyword on OpenCL kernel side.
+* Currently Gen does not provide native support of high precision math functions
+  required by OpenCL. We provide a software version to achieve high precision,
+  which you can turn on through `export OCL_STRICT_CONFORMANCE=1`.
+  But be careful, this would make your CL kernel run a little longer.
 
 TODO
 ----
@@ -158,13 +158,13 @@ all the piglit OpenCL test cases now. And the pass rate for the OpenCV test suit
 is also good. There are still some remains work items listed as below, most of them
 are extension support and performance related.
 
-- Performance tuning. Till now, the focus of beignet project is to implement all
-  the mandatory functions/features specified by the OpenCL spec. There are plenty
-  of things need to do for performance tuning. For example, the extreme slow software
-  based sin/cos/... math functions due to the native math instruction lack of necessary
-  precision. And all the code is inlined which will increase the icache miss rate
+- Performance tuning. There are some major optimizations need to be done,
+  Peephole optimization, convert to strcutured BBs and leverage Gen's structured
+  instructions, and optimize the extreme slow software based sin/cos/... math
+  functions due to the native math instruction lack of necessary precision.
+  And all the code is inlined which will increase the icache miss rate
   significantly. And many other things which are specified partially in
-  [[here|Beignet/Backend/TODO]]. We will focus on performance tuning after the version 0.8.
+  [[here|Beignet/Backend/TODO]].
 
 - Complete cl\_khr\_gl\_sharing support. We lack of some APIs implementation such
   as clCreateFromGLBuffer,clCreateFromGLRenderbuffer,clGetGLObjectInfo... Currently,
diff --git a/docs/Beignet/Backend/TODO.mdwn b/docs/Beignet/Backend/TODO.mdwn
index 7728d6ad..7651c852 100644
--- a/docs/Beignet/Backend/TODO.mdwn
+++ b/docs/Beignet/Backend/TODO.mdwn
@@ -28,17 +28,17 @@ many things must be implemented:
   instructions at the end of each basic block . They can be easily optimized.
 
 - From LLVM 3.3, we use SPIR IR. We need to use the compiler defined type to
-  represent sampler_t/image2d_t/image1d_t/....
+  represent sampler\_t/image2d\_t/image1d\_t/....
 
 - Considering to use libclc in our project and avoid to use the PCH which is not
   compatible for different clang versions. And may contribute what we have done in
-  the ocl_stdlib.h to libclc if possible.
+  the ocl\_stdlib.h to libclc if possible.
 
 - Optimize math functions. If the native math instructions don't compy with the
   OCL spec, we use pure software style to implement those math instructions which
   is extremely slow, for example. The cos and sin for HD4000 platform are very slow.
   For some applications which may not need such a high accurate results. We may
-  provide a mechanism to use native_xxx functions instead of the extremely slow
+  provide a mechanism to use native\_xxx functions instead of the extremely slow
   version.
 
 Gen IR
@@ -46,21 +46,16 @@ Gen IR
 
 The code is defined in `src/ir`. Main things to do are:
 
+- Convert unstructured BBs to structured format, and leverage Gen's structured
+  instruction such as if/else/endif to encoding those BBs. Then we can save many
+  instructions which are used to maintain software pcips and predications.
+
 - Implement those llvm.memset/llvm.memcpy more efficiently. Currently, we lower
   them as normal memcpy at llvm module level and not considering the intrinsics
   all have a constant data length.
 
 - Finishing the handling of function arguments (see the [[IR
-  description|gen_ir]] for more details)
-
-- Adding support for linking IR units together. OpenCL indeed allows to create
-  programs from several sources
-
-- Uniform analysys. This is a major performance improvement. A "uniform" value
-  is basically a value where regardless the control flow, all the activated
-  lanes will be identical. Trivial examples are immediate values, function
-  arguments. Also, operations on uniform will produce uniform values and so
-  on...
+  description|gen\_ir]] for more details)
 
 - Merging of independent uniform loads (and samples). This is a major
   performance improvement once the uniform analysis is done. Basically, several
@@ -78,19 +73,20 @@ Backend
 
 The code is defined in `src/backend`. Main things to do are:
 
-- Optimize register spilling (see the [[compiler backend description|compiler_backend]] for more details)
+- Optimize register spilling (see the [[compiler backend description|compiler\_backend]] for more details)
 
 - Implementing proper instruction selection. A "simple" tree matching algorithm
   should provide good results for Gen
 
-- Improving the instruction scheduling pass. The current scheduling code has some bugs,
-  we disable it by default currently. We need to fix them in the future.
+- Improving the instruction scheduling pass. Need to implement proper pre register
+  allocation scheduling to lower register pressure.
+
+- Reduce the macro instructions in gen\_context. The macro instructions added in
+  gen\_context will not get a chance to do post register allocation scheduling.
 
-- Some instructions are introduced in the last code generation stage. We need to
-  introduce a pass after that to eliminate dead instruction or duplicate MOVs and
-  some instructions with zero operands.
+- leverage the structured if/endif for branching processing.
 
-- leverage the structured if/endif for branching processing ?
+- Peephole optimization. There are many chances to do further peephole optimization.
 
 General plumbing
 ----------------
@@ -110,5 +106,5 @@ All of those code should be improved and cleaned up are tracked with "XXX"
 comments in the code.
 
 Parts of the code leaks memory when exceptions are used. There are some pointers
-to track and replace with std::unique_ptr. Note that we also add a custom memory
+to track and replace with std::unique\_ptr. Note that we also add a custom memory
 debugger that nicely complements (i.e. it is fast) Valgrind.
diff --git a/docs/Beignet/Backend/compiler_backend.mdwn b/docs/Beignet/Backend/compiler_backend.mdwn
index 3c489b2f..c291fe48 100644
--- a/docs/Beignet/Backend/compiler_backend.mdwn
+++ b/docs/Beignet/Backend/compiler_backend.mdwn
@@ -5,7 +5,7 @@ Well, the complete code base is somehow a compiler backend for LLVM. Here, we
 really speak about the final code generation passes that you may find in
 `src/backend`.
 
-As explained in [[the scalar IR presentation|gen_ir]], we bet on a very
+As explained in [[the scalar IR presentation|gen\_ir]], we bet on a very
 simple scalar IR to make it easy to parse and modify. The idea is to fix the
 unrelated problem (very Gen specific) where we can i.e. when the code is
 generated.
diff --git a/docs/Beignet/Backend/gen_ir.mdwn b/docs/Beignet/Backend/gen_ir.mdwn
index 424e5967..635cbb4f 100644
--- a/docs/Beignet/Backend/gen_ir.mdwn
+++ b/docs/Beignet/Backend/gen_ir.mdwn
@@ -22,7 +22,7 @@ One the HW side, the situation is completely different:
   for the EU. This is a SIMD scalar mode.
 
 - The only source of vectors we are going to have is on the sends instructions
-  (and marginally for some other instructions like the div_rem math instruction)
+  (and marginally for some other instructions like the div\_rem math instruction)
 
 One may therefore argue that we need vector instructions to handle the sends.
 Send will indeed require both vector destinations and sources. This may be a
@@ -33,7 +33,7 @@ Indeed, if we look carefully at the send instructions we see that they will
 require vectors that are *not* vectors in LLVM IR. This code for example:
 
 <code>
-__global uint4 *src;<br/>
+\_\_global uint4 \*src;<br/>
 uint4 x = src[get\_global\_id(0)];<br/>
 </code>
 
@@ -190,7 +190,7 @@ Look at these three examples:
 
 <code>
 struct foo { int x; int y; }; </br>
-\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br>
+\_\_kernel void case1(\_\_global int \*dst, struct foo bar) </br>
 {<br/>
 &nbsp;&nbsp;dst[get\_global\_id(0)] = bar.x + bar.y;<br/>
 }
@@ -203,7 +203,7 @@ pushed into registers and we can replace the loads by register reads.
 
 <code>
 struct foo { int x[16]; }; </br>
-\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br>
+\_\_kernel void case1(\_\_global int \*dst, struct foo bar) </br>
 {<br/>
 &nbsp;&nbsp;dst[get\_global\_id(0)] = bar.x[get\_local\_id(0)];<br/>
 }
@@ -217,7 +217,7 @@ not supported yet).
 
 <code>
 struct foo { int x[16]; }; </br>
-\_\_kernel void case1(\_\_global int *dst, struct foo bar) </br>
+\_\_kernel void case1(\_\_global int \*dst, struct foo bar) </br>
 {<br/>
 bar.x[0] = get\_global\_id(1);<br/>
 &nbsp;&nbsp;dst[get\_global\_id(0)] = bar.x[get\_local\_id(0)];<br/>