X86-64: Use non-temporal store in memcpy on large data

The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used, but avoid non-temporal store if there is overlap between destination and source since destination may be in cache when source is loaded. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. [BZ #19928] * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold and there is no overlap between destination and source.
author: H.J. Lu <hjl.tools@gmail.com> 2016-04-12 08:10:31 -0700
committer: H.J. Lu <hjl.tools@gmail.com> 2016-04-12 08:10:47 -0700
commit: a057f5f8cd1becc5ae8b51220283095bc808d72a (patch)
tree: 55776aef6bb9282910da90fe522d54f2c6aa0d37 /sysdeps/x86_64/cacheinfo.c
parent: b39d84adff832bddc3e2fc4a1878a7fba6bbb2a1 (diff)
download: glibc-a057f5f8cd1becc5ae8b51220283095bc808d72a.tar.gz
1 files changed, 8 insertions, 0 deletions
diff --git a/sysdeps/x86_64/cacheinfo.c b/sysdeps/x86_64/cacheinfo.c
index 96463df064..143b3333a8 100644
--- a/sysdeps/x86_64/cacheinfo.c
+++ b/sysdeps/x86_64/cacheinfo.c
@@ -464,6 +464,9 @@ long int __x86_raw_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2;
 /* Similar to __x86_shared_cache_size, but not rounded.  */
 long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024;
 
+/* Threshold to use non temporal store.  */
+long int __x86_shared_non_temporal_threshold attribute_hidden;
+
 #ifndef DISABLE_PREFETCHW
 /* PREFETCHW support flag for use in memory and string routines.  */
 int __x86_prefetchw attribute_hidden;
@@ -662,4 +665,9 @@ init_cacheinfo (void)
       __x86_shared_cache_size_half = shared / 2;
       __x86_shared_cache_size = shared;
     }
+
+  /* The large memcpy micro benchmark in glibc shows that 6 times of
+     shared cache size is the approximate value above which non-temporal
+     store becomes faster.  */
+  __x86_shared_non_temporal_threshold = __x86_shared_cache_size * 6;
 }
author	H.J. Lu <hjl.tools@gmail.com>	2016-04-12 08:10:31 -0700
committer	H.J. Lu <hjl.tools@gmail.com>	2016-04-12 08:10:47 -0700
commit	a057f5f8cd1becc5ae8b51220283095bc808d72a (patch)
tree	55776aef6bb9282910da90fe522d54f2c6aa0d37 /sysdeps/x86_64/cacheinfo.c
parent	b39d84adff832bddc3e2fc4a1878a7fba6bbb2a1 (diff)
download	glibc-a057f5f8cd1becc5ae8b51220283095bc808d72a.tar.gz