nvptx backend: new "uniform SIMT" codegen variant

This patch introduces a code generation variant for NVPTX that I'm using for SIMD work in OpenMP offloading. Let me try to explain the idea behind it... In place of SIMD vectorization, NVPTX is using SIMT (single instruction/multiple threads) execution: groups of 32 threads execute the same instruction, with some threads possibly masked off if under a divergent branch. So we are mapping OpenMP threads to such thread groups ("warps"), and hardware threads are then mapped to OpenMP SIMD lanes. We need to reach heads of SIMD regions with all hw threads active, because there's no way to "resurrect" them once masked off: they need to follow the same control flow, and reach the SIMD region entry with the same local state (registers, and stack too for OpenACC). The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31 "slaves" which just follow branches without any computation -- that requires extra jumps and broadcasting branch predicates, -- and 2) broadcast register state and stack state from master to slaves when entering "vector" regions. I'm taking a different approach. I want to execute all insns in all warp members, while ensuring that effect (on global and local state) is that same as if any single thread was executing that instruction. Most instructions automatically satisfy that: if threads have the same state, then executing an arithmetic instruction, normal memory load/store, etc. keep local state the same in all threads. The two exception insn categories are atomics and calls. For calls, we can demand recursively that they uphold this execution model, until we reach runtime-provided "syscalls": malloc/free/vprintf. Those we can handle like atomics. To handle atomics, we 1) execute the atomic conditionally only in one warp member -- so its side effect happens once; 2) copy the register that was set from that warp member to others -- so local state is kept synchronized: atom.op dest, ... becomes /* pred = (current_lane == 0); */ @pred atom.op dest, ... shuffle.idx dest, dest, /*srclane=*/0 So the overhead is one shuffle insn following each atomic, plus predicate setup in the prologue. OK, so the above handles execution out of SIMD regions nicely, but then we'd also need to run code inside of SIMD regions, where we need to turn off this synching effect. Turns out we can keep atomics decorated almost like before: @pred atom.op dest, ... shuffle.idx dest, dest, master_lane and compute 'pred' and 'master_lane' accordingly: outside of SIMD regions we need (master_lane == 0 && pred == (current_lane == 0)), and inside we need (master_lane == current_lane && pred == true) (so that shuffle is no-op, and predicate is 'true' for all lanes). Then, (pred = (current_lane == master_lane) works in both cases, and we just need to set up master_lane accordingly: master_lane = current_lane & mask, where mask is all-0 outside of SIMD regions, and all-1 inside. To store these per-warp masks, I've introduced another shared memory array, __nvptx_uni. * config/nvptx/nvptx.c (need_unisimt_decl): New variable. Set it... (nvptx_init_unisimt_predicate): ...here (new function) and use it... (nvptx_file_end): ...here to emit declaration of __nvptx_uni array. (nvptx_declare_function_name): Call nvptx_init_unisimt_predicate. (nvptx_get_unisimt_master): New helper function. (nvptx_get_unisimt_predicate): Ditto. (nvptx_call_insn_is_syscall_p): Ditto. (nvptx_unisimt_handle_set): Ditto. (nvptx_reorg_uniform_simt): New. Transform code for -muniform-simt. (nvptx_get_axis_predicate): New helper function, factored out from... (nvptx_single): ...here. (nvptx_reorg): Call nvptx_reorg_uniform_simt. * config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define __nvptx_unisimt__ when -muniform-simt option is active. (struct machine_function): Add unisimt_master, unisimt_predicate rtx fields. * config/nvptx/nvptx.md (divergent): New attribute. (atomic_compare_and_swap<mode>_1): Mark as divergent. (atomic_exchange<mode>): Ditto. (atomic_fetch_add<mode>): Ditto. (atomic_fetch_addsf): Ditto. (atomic_fetch_<logic><mode>): Ditto. * config/nvptx/nvptx.opt (muniform-simt): New option. * doc/invoke.texi (-muniform-simt): Document.
author: Alexander Monakov <amonakov@ispras.ru> 2015-11-24 16:16:15 +0300
committer: Alexander Monakov <amonakov@ispras.ru> 2015-12-09 19:31:35 +0300
commit: abc4adc5765e7024549f313579e5400604ddc52d (patch)
tree: 785981d9cb1efde32ca27eea2a23da9a32b4cd98
parent: c4f277e0a29c2dd322295b4a3b20299054347e61 (diff)
download: gcc-abc4adc5765e7024549f313579e5400604ddc52d.tar.gz
6 files changed, 192 insertions, 13 deletions
diff --git a/gcc/ChangeLog.gomp-nvptx b/gcc/ChangeLog.gomp-nvptx
index 260a2b6e78c..43f712e2de7 100644
--- a/gcc/ChangeLog.gomp-nvptx
+++ b/gcc/ChangeLog.gomp-nvptx
@@ -1,5 +1,32 @@
 2015-12-09  Alexander Monakov  <amonakov@ispras.ru>
 
+	* config/nvptx/nvptx.c (need_unisimt_decl): New variable.  Set it...
+	(nvptx_init_unisimt_predicate): ...here (new function) and use it...
+	(nvptx_file_end): ...here to emit declaration of __nvptx_uni array.
+	(nvptx_declare_function_name): Call nvptx_init_unisimt_predicate.
+	(nvptx_get_unisimt_master): New helper function.
+	(nvptx_get_unisimt_predicate): Ditto.
+	(nvptx_call_insn_is_syscall_p): Ditto.
+	(nvptx_unisimt_handle_set): Ditto.
+	(nvptx_reorg_uniform_simt): New.  Transform code for -muniform-simt.
+	(nvptx_get_axis_predicate): New helper function, factored out from...
+	(nvptx_single): ...here.
+	(nvptx_reorg): Call nvptx_reorg_uniform_simt.
+	* config/nvptx/nvptx.h (TARGET_CPU_CPP_BUILTINS): Define
+	__nvptx_unisimt__ when -muniform-simt option is active.
+	(struct machine_function): Add unisimt_master, unisimt_predicate
+	rtx fields.
+	* config/nvptx/nvptx.md (divergent): New attribute.
+	(atomic_compare_and_swap<mode>_1): Mark as divergent.
+	(atomic_exchange<mode>): Ditto.
+	(atomic_fetch_add<mode>): Ditto.
+	(atomic_fetch_addsf): Ditto.
+	(atomic_fetch_<logic><mode>): Ditto.
+	* config/nvptx/nvptx.opt (muniform-simt): New option.
+	* doc/invoke.texi (-muniform-simt): Document.
+
+2015-12-09  Alexander Monakov  <amonakov@ispras.ru>
+
 	* config/nvptx/nvptx.c (nvptx_output_call_insn): Handle COND_EXEC
 	patterns.  Emit instruction predicate.
 	(nvptx_print_operand): Unbreak handling of instruction predicates.
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index c43543d7802..f9e12701d5c 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -144,6 +144,9 @@ static GTY(()) tree global_lock_var;
 /* True if any function references __nvptx_stacks.  */
 static bool need_softstack_decl;
 
+/* True if any function references __nvptx_uni.  */
+static bool need_unisimt_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -729,6 +732,33 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }
 
+/* Emit code to initialize predicate and master lane index registers for
+   -muniform-simt code generation variant.  */
+
+static void
+nvptx_init_unisimt_predicate (FILE *file)
+{
+  int bits = BITS_PER_WORD;
+  int master = REGNO (cfun->machine->unisimt_master);
+  int pred = REGNO (cfun->machine->unisimt_predicate);
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32 %%ustmp0;\n");
+  fprintf (file, "\t\t.reg.u%d %%ustmp1;\n", bits);
+  fprintf (file, "\t\t.reg.u%d %%ustmp2;\n", bits);
+  fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.y;\n");
+  fprintf (file, "\t\tmul%s.u32 %%ustmp1, %%ustmp0, 4;\n",
+	   bits == 64 ? ".wide" : "");
+  fprintf (file, "\t\tmov.u%d %%ustmp2, __nvptx_uni;\n", bits);
+  fprintf (file, "\t\tadd.u%d %%ustmp2, %%ustmp2, %%ustmp1;\n", bits);
+  fprintf (file, "\t\tld.shared.u32 %%r%d, [%%ustmp2];\n", master);
+  fprintf (file, "\t\tmov.u32 %%ustmp0, %%tid.x;\n");
+  /* rNN = tid.x & __nvptx_uni[tid.y];  */
+  fprintf (file, "\t\tand.b32 %%r%d, %%r%d, %%ustmp0;\n", master, master);
+  fprintf (file, "\t\tsetp.eq.u32 %%r%d, %%r%d, %%ustmp0;\n", pred, master);
+  fprintf (file, "\t}\n");
+  need_unisimt_decl = true;
+}
+
 /* Emit kernel NAME for function ORIG outlined for an OpenMP 'target' region:
 
    extern void gomp_nvptx_main (void (*fn)(void*), void *fnarg);
@@ -915,6 +945,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (cfun->machine->axis_predicate[1])
     nvptx_init_axis_predicate (file,
 			       REGNO (cfun->machine->axis_predicate[1]), "x");
+  if (cfun->machine->unisimt_predicate)
+    nvptx_init_unisimt_predicate (file);
 }
 
 /* Output a return instruction.  Also copy the return value to its outgoing
@@ -2377,6 +2409,86 @@ nvptx_reorg_subreg (void)
     }
 }
 
+/* Return a SImode "master lane index" register for uniform-simt, allocating on
+   first use.  */
+
+static rtx
+nvptx_get_unisimt_master ()
+{
+  rtx &master = cfun->machine->unisimt_master;
+  return master ? master : master = gen_reg_rtx (SImode);
+}
+
+/* Return a BImode "predicate" register for uniform-simt, similar to above.  */
+
+static rtx
+nvptx_get_unisimt_predicate ()
+{
+  rtx &pred = cfun->machine->unisimt_predicate;
+  return pred ? pred : pred = gen_reg_rtx (BImode);
+}
+
+/* Return true if given call insn references one of the functions provided by
+   the CUDA runtime: malloc, free, vprintf.  */
+
+static bool
+nvptx_call_insn_is_syscall_p (rtx_insn *insn)
+{
+  rtx pat = PATTERN (insn);
+  gcc_checking_assert (GET_CODE (pat) == PARALLEL);
+  pat = XVECEXP (pat, 0, 0);
+  if (GET_CODE (pat) == SET)
+    pat = SET_SRC (pat);
+  gcc_checking_assert (GET_CODE (pat) == CALL
+		       && GET_CODE (XEXP (pat, 0)) == MEM);
+  rtx addr = XEXP (XEXP (pat, 0), 0);
+  if (GET_CODE (addr) != SYMBOL_REF)
+    return false;
+  const char *name = XSTR (addr, 0);
+  return (!strcmp (name, "vprintf")
+	  || !strcmp (name, "__nvptx_real_malloc")
+	  || !strcmp (name, "__nvptx_real_free"));
+}
+
+/* If SET subexpression of INSN sets a register, emit a shuffle instruction to
+   propagate its value from lane MASTER to current lane.  */
+
+static void
+nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
+{
+  rtx reg;
+  if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
+    emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn);
+}
+
+/* Adjust code for uniform-simt code generation variant by making atomics and
+   "syscalls" conditionally executed, and inserting shuffle-based propagation
+   for registers being set.  */
+
+static void
+nvptx_reorg_uniform_simt ()
+{
+  rtx_insn *insn, *next;
+
+  for (insn = get_insns (); insn; insn = next)
+    {
+      next = NEXT_INSN (insn);
+      if (!(CALL_P (insn) && nvptx_call_insn_is_syscall_p (insn))
+	  && !(NONJUMP_INSN_P (insn)
+	       && GET_CODE (PATTERN (insn)) == PARALLEL
+	       && get_attr_divergent (insn)))
+	continue;
+      rtx pat = PATTERN (insn);
+      rtx master = nvptx_get_unisimt_master ();
+      for (int i = 0; i < XVECLEN (pat, 0); i++)
+	nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
+      rtx pred = nvptx_get_unisimt_predicate ();
+      pred = gen_rtx_NE (BImode, pred, const0_rtx);
+      pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
+      validate_change (insn, &PATTERN (insn), pat, false);
+    }
+}
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
@@ -3480,6 +3592,15 @@ nvptx_wsync (bool after)
   return gen_nvptx_barsync (GEN_INT (after));
 }
 
+/* Return a BImode "axis predicate" register, allocating on first use.  */
+
+static rtx
+nvptx_get_axis_predicate (int axis)
+{
+  rtx &pred = cfun->machine->axis_predicate[axis];
+  return pred ? pred : pred = gen_reg_rtx (BImode);
+}
+
 /* Single neutering according to MASK.  FROM is the incoming block and
    TO is the outgoing block.  These may be the same block. Insert at
    start of FROM:
@@ -3564,14 +3685,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
     if (GOMP_DIM_MASK (mode) & skip_mask)
       {
 	rtx_code_label *label = gen_label_rtx ();
-	rtx pred = cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER];
-
-	if (!pred)
-	  {
-	    pred = gen_reg_rtx (BImode);
-	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
-	  }
-	
+	rtx pred = nvptx_get_axis_predicate (mode - GOMP_DIM_WORKER);
 	rtx br;
 	if (mode == GOMP_DIM_VECTOR)
 	  br = gen_br_true (pred, label);
@@ -3898,6 +4012,9 @@ nvptx_reorg (void)
   /* Replace subregs.  */
   nvptx_reorg_subreg ();
 
+  if (TARGET_UNIFORM_SIMT)
+    nvptx_reorg_uniform_simt ();
+
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -4074,6 +4191,11 @@ nvptx_file_end (void)
       fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n",
 	       BITS_PER_WORD);
     }
+  if (need_unisimt_decl)
+    {
+      fprintf (asm_out_file, "// BEGIN GLOBAL VAR DECL: __nvptx_uni\n");
+      fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n");
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index aced242002e..c631d997a1b 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -33,6 +33,8 @@
       builtin_define ("__nvptx__");		\
       if (TARGET_SOFT_STACK)			\
         builtin_define ("__nvptx_softstack__");	\
+      if (TARGET_UNIFORM_SIMT)			\
+        builtin_define ("__nvptx_unisimt__");	\
     } while (0)
 
 /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@@ -227,6 +229,8 @@ struct GTY(()) machine_function
   HOST_WIDE_INT outgoing_stdarg_size;
   int ret_reg_mode; /* machine_mode not defined yet. */
   rtx axis_predicate[2];
+  rtx unisimt_master;
+  rtx unisimt_predicate;
 };
 #endif
 
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 0fc853add34..ae1909d9570 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -63,6 +63,9 @@
 (define_attr "subregs_ok" "false,true"
   (const_string "false"))
 
+(define_attr "divergent" "false,true"
+  (const_string "false"))
+
 (define_predicate "nvptx_register_operand"
   (match_code "reg,subreg")
 {
@@ -1288,7 +1291,8 @@
    (set (match_dup 1)
 	(unspec_volatile:SDIM [(const_int 0)] UNSPECV_CAS))]
   ""
-  "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;")
+  "%.\\tatom%A1.cas.b%T0\\t%0, %1, %2, %3;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_exchange<mode>"
   [(set (match_operand:SDIM 0 "nvptx_register_operand" "=R")	;; output
@@ -1299,7 +1303,8 @@
    (set (match_dup 1)
 	(match_operand:SDIM 2 "nvptx_nonmemory_operand" "Ri"))]	;; input
   ""
-  "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;")
+  "%.\\tatom%A1.exch.b%T0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_fetch_add<mode>"
   [(set (match_operand:SDIM 1 "memory_operand" "+m")
@@ -1311,7 +1316,8 @@
    (set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   ""
-  "%.\\tatom%A1.add%t0\\t%0, %1, %2;")
+  "%.\\tatom%A1.add%t0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "atomic_fetch_addsf"
   [(set (match_operand:SF 1 "memory_operand" "+m")
@@ -1323,7 +1329,8 @@
    (set (match_operand:SF 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   ""
-  "%.\\tatom%A1.add%t0\\t%0, %1, %2;")
+  "%.\\tatom%A1.add%t0\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_code_iterator any_logic [and ior xor])
 (define_code_attr logic [(and "and") (ior "or") (xor "xor")])
@@ -1339,7 +1346,8 @@
    (set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
 	(match_dup 1))]
   "0"
-  "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
+  "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;"
+  [(set_attr "divergent" "true")])
 
 (define_insn "nvptx_barsync"
   [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 79ad3a21314..f8508d80910 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -36,3 +36,7 @@ Optimize partition neutering
 msoft-stack
 Target Report Mask(SOFT_STACK)
 Use custom stacks instead of local memory for automatic storage.
+
+muniform-simt
+Target Report Mask(UNIFORM_SIMT)
+Generate code that executes all threads in a warp as if one was active.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 8e0ee9099e5..c971d83038f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -19074,6 +19074,20 @@ in shared memory array @code{char *__nvptx_stacks[]} at position @code{tid.y}
 as the stack pointer.  This is for placing automatic variables into storage
 that can be accessed from other threads, or modified with atomic instructions.
 
+@item -muniform-simt
+@opindex muniform-simt
+Switch to code generation variant that allows to execute all threads in each
+warp, while maintaining memory state and side effects as if only one thread
+in each warp was active outside of OpenMP SIMD regions.  All atomic operations
+and calls to runtime (malloc, free, vprintf) are conditionally executed (iff
+current lane index equals the master lane index), and the register being
+assigned is copied via a shuffle instruction from the master lane.  Outside of
+SIMD regions lane 0 is the master; inside, each thread sees itself as the
+master.  Shared memory array @code{int __nvptx_uni[]} stores all-zeros or
+all-ones bitmasks for each warp, indicating current mode (0 outside of SIMD
+regions).  Each thread can bitwise-and the bitmask at position @code{tid.y}
+with current lane index to compute the master lane index.
+
 @end table
 
 @node PDP-11 Options
author	Alexander Monakov <amonakov@ispras.ru>	2015-11-24 16:16:15 +0300
committer	Alexander Monakov <amonakov@ispras.ru>	2015-12-09 19:31:35 +0300
commit	abc4adc5765e7024549f313579e5400604ddc52d (patch)
tree	785981d9cb1efde32ca27eea2a23da9a32b4cd98
parent	c4f277e0a29c2dd322295b4a3b20299054347e61 (diff)
download	gcc-abc4adc5765e7024549f313579e5400604ddc52d.tar.gz