| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
No bug.
This commit adds a new implementation for EVEX memchr that is not safe
for RTM because it uses vzeroupper. The benefit is that by using
ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is
faster than the RTM safe version which cannot use vpcmpeq because
there is no EVEX encoding for the instruction. All parts of the
implementation aside from the 4x loop are the same for the two
versions and the optimization is only relevant for large sizes.
Tigerlake:
size , algn , Pos , Cur T , New T , Win , Dif
512 , 6 , 192 , 9.2 , 9.04 , no-RTM , 0.16
512 , 7 , 224 , 9.19 , 8.98 , no-RTM , 0.21
2048 , 0 , 256 , 10.74 , 10.54 , no-RTM , 0.2
2048 , 0 , 512 , 14.81 , 14.87 , RTM , 0.06
2048 , 0 , 1024 , 22.97 , 22.57 , no-RTM , 0.4
2048 , 0 , 2048 , 37.49 , 34.51 , no-RTM , 2.98 <--
Icelake:
size , algn , Pos , Cur T , New T , Win , Dif
512 , 6 , 192 , 7.6 , 7.3 , no-RTM , 0.3
512 , 7 , 224 , 7.63 , 7.27 , no-RTM , 0.36
2048 , 0 , 256 , 8.48 , 8.38 , no-RTM , 0.1
2048 , 0 , 512 , 11.57 , 11.42 , no-RTM , 0.15
2048 , 0 , 1024 , 17.92 , 17.38 , no-RTM , 0.54
2048 , 0 , 2048 , 30.37 , 27.34 , no-RTM , 3.03 <--
test-memchr, test-wmemchr, and test-rawmemchr are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 104c7b1967c3e78435c6f7eab5e225a7eddf9c6e)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The glibc memcpy benchmark on Intel Core i7-1065G7 (Ice Lake) showed
that REP MOVSB became faster after 2112 bytes:
Vector Move REP MOVSB
length=2112, align1=0, align2=0: 24.20 24.40
length=2112, align1=1, align2=0: 26.07 23.13
length=2112, align1=0, align2=1: 27.18 28.13
length=2112, align1=1, align2=1: 26.23 25.16
length=2176, align1=0, align2=0: 23.18 22.52
length=2176, align1=2, align2=0: 25.45 22.52
length=2176, align1=0, align2=2: 27.14 27.82
length=2176, align1=2, align2=2: 22.73 25.56
length=2240, align1=0, align2=0: 24.62 24.25
length=2240, align1=3, align2=0: 29.77 27.15
length=2240, align1=0, align2=3: 35.55 29.93
length=2240, align1=3, align2=3: 34.49 25.15
length=2304, align1=0, align2=0: 34.75 26.64
length=2304, align1=4, align2=0: 32.09 22.63
length=2304, align1=0, align2=4: 28.43 31.24
Use REP MOVSB for data size > 2112 bytes in memcpy on processors with
fast short REP MOVSB (FSRM).
* sysdeps/x86/dl-cacheinfo.h (dl_init_cacheinfo): Set
rep_movsb_threshold to 2112 on processors with fast short REP
MOVSB (FSRM).
(cherry picked from commit cf2c57526ba4b57e6863ad4db8a868e2678adce8)
|
|
|
|
|
|
|
|
|
|
|
| |
No bug. This commit optimizes strchr-evex.S. The optimizations are
mostly small things such as save an ALU in the alignment process,
saving a few instructions in the loop return. The one significant
change is saving 2 instructions in the 4x loop. test-strchr,
test-strchrnul, test-wcschr, and test-wcschrnul are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 7f3e7c262cab4e2401e4331a6ef29c428de02044)
|
|
|
|
|
|
|
|
|
|
|
| |
No bug. This commit optimizes strchr-avx2.S. The optimizations are all
small things such as save an ALU in the alignment process, saving a
few instructions in the loop return, saving some bytes in the main
loop, and increasing the ILP in the return cases. test-strchr,
test-strchrnul, test-wcschr, and test-wcschrnul are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit ccabe7971f508709d034b63b8672f6f751a3d356)
|
|
|
|
|
|
|
|
|
| |
No bug. This commit adds optimized cased for less_vec memset case that
uses the avx512vl/avx512bw mask store avoiding the excessive
branches. test-memset and test-wmemset are passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit f53790272ce7bdc5ecd14b45f65d0464d2a61a3a)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since strchr-avx2.S updated by
commit 1f745ecc2109890886b161d4791e1406fdfc29b8
Author: noah <goldstein.w.n@gmail.com>
Date: Wed Feb 3 00:38:59 2021 -0500
x86-64: Refactor and improve performance of strchr-avx2.S
uses sarx:
c4 e2 72 f7 c0 sarx %ecx,%eax,%eax
for strchr-avx2 family functions, require BMI2 in ifunc-impl-list.c and
ifunc-avx2.h.
(cherry picked from commit 83c5b368226c34a2f0a5287df40fc290b2b34359)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
No Bug. This commit updates the large memcpy case (no overlap). The
update is to perform memcpy on either 2 or 4 contiguous pages at
once. This 1) helps to alleviate the affects of false memory aliasing
when destination and source have a close 4k alignment and 2) In most
cases and for most DRAM units is a modestly more efficient access
pattern. These changes are a clear performance improvement for
VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy,
test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all
pass.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 1a8605b6cd257e8a74e29b5b71c057211f5fb847)
|
|
|
|
|
|
|
|
|
|
| |
No bug. Just seemed the performance could be improved a bit. Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. Both test-strchrnul and test-strchr passed.
(cherry picked from commit 1f745ecc2109890886b161d4791e1406fdfc29b8)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the process of optimizing memcpy for AMD machines, we have found the
vector move operations are outperforming enhanced REP MOVSB for data
transfers above the L2 cache size on Zen3 architectures.
To handle this use case, we are adding an upper bound parameter on
enhanced REP MOVSB:'__x86_rep_movsb_stop_threshold'.
As per large-bench results, we are configuring this parameter to the
L2 cache size for AMD machines and applicable from Zen3 architecture
supporting the ERMS feature.
For architectures other than AMD, it is the computed value of
non-temporal threshold parameter.
Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
(cherry picked from commit 6e02b3e9327b7dbb063958d2b124b64fcb4bbe3f)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new IBM z16 is added to platform string array.
The macro _DL_PLATFORMS_COUNT is incremented.
_dl_hwcaps_subdir is extended by "z16" if HWCAP_S390_VXRS_PDE2
is set. HWCAP_S390_NNPA is not tested in _dl_hwcaps_subdirs_active
as those instructions may be replaced or removed in future.
tst-glibc-hwcaps.c is extended in order to test z16 via new marker5.
A fatal glibc error is dumped if glibc was build with architecture
level set for z16, but run on an older machine. (See dl-hwcap-check.h)
(cherry picked from commit 2376944b9e5c0364b9fb473e4d8dabca31b57167)
|
|
|
|
| |
(cherry picked from commit 7a5c440102d4ec7fafd9bbd98eca9bd90ecaaafd)
|
|
|
|
|
|
|
|
|
| |
When swapcontext.c is compiled without -g, the following error occurs:
Error: CFI instruction used without previous .cfi_startproc
Fix by converting swapcontext routine to assembler.
(cherry picked from commit 738ee53f0ce5e39b9b7a6777f5d3057afbaac498)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change fixes two warnings from _dl_lookup_address.
The first warning comes from dropping the volatile keyword from
desc in the call to _dl_read_access_allowed. We now have a full
atomic barrier between loading desc[0] and the access check, so
desc no longer needs to be declared as volatile.
The second warning comes from the implicit declaration of
_dl_fix_reloc_arg. This is fixed by including dl-runtime.h and
declaring _dl_fix_reloc_arg in dl-runtime.h.
(cherry picked from commit 6c9c2307657529e52c5fa7037618835f2a50b916)
|
|
|
|
|
|
|
|
|
|
| |
_STACK_GROWS_DOWN is defined to 0 when the stack grows up. The
code in unwind.c used `#ifdef _STACK_GROWS_DOWN' to selct the
stack grows down define for FRAME_LEFT. As a result, the
_STACK_GROWS_DOWN define was always selected and cleanups were
incorrectly sequenced when the stack grows up.
(cherry picked from commit 2bbc694df279020a6620096d31c1e05c93966f9b)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current getcontext return trampoline is overly complex and it
unnecessarily clobbers several registers. By saving the context
pointer (r26) in the context, __getcontext_ret can restore any
registers not restored by setcontext. This allows getcontext to
save and restore the entire register context present when getcontext
is entered. We use the unused oR0 context slot for the return
from __getcontext_ret.
While this is not directly useful in C, it can be exploited in
assembly code. Registers r20, r23, r24 and r25 are not clobbered
in the call path to getcontext. This allows a small simplification
of swapcontext.
It also allows saving and restoring the 6-bit SAR register in the
LSB of the oSAR context slot. The getcontext flag value can be
stored in the MSB of the oSAR slot.
(cherry picked from commit 9e7e5fda38471e00d1190479ea91d7b08ae3e304)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change fixes the failure of stdlib/tst-setcontext2 and
stdlib/tst-setcontext7 on hppa. The implementation of swapcontext
in C is broken. C saves the return pointer (rp) and any non
call-clobbered registers (in this case r3, r4 and r5) on the
stack. However, the setcontext call in swapcontext pops the
stack and subsequent calls clobber the saved registers. When
the context in oucp is restored, both tests fault.
Here we rewrite swapcontext in assembly code to avoid using
the stack for register values that need to be used after
restoration. The getcontext and setcontext routines are
revised to save and restore register ret1 for normal returns.
We copy the oucp pointer to ret1. This allows access to
the old context after calling getcontext and setcontext.
(cherry picked from commit 71b108d7eb33b2bf3e61d5e92d2a47f74c1f7d96)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The test elf/tst-audit2 fails on hppa with a segmentation fault in the
long branch stub used to call malloc from calloc. This occurs because
the test is not a PIC executable and calloc is called from the dynamic
linker before the dp register is initialized in _dl_start_user.
The fix is to move the dp register initialization into
elf_machine_runtime_setup. Since the address of $global$ can't be
loaded directly, we continue to use the DT_PLTGOT value from the
the main_map to initialize dp. Since l_main_map is not available
in v2.34 and earlier, we use a new function, elf_machine_main_map,
to find the main map.
(cherry picked from commit 3be79b72d556e3ac37075ad6b99eb5eac18e1402)
|
| |
|
|
|
|
|
|
|
|
|
| |
Previously TEST_NAME was passing a function pointer. This didn't fail
because of the -Wno-error flag (to allow for overflow sizes passed
to strncmp/wcsncmp)
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b98d0bbf747f39770e0caba7e984ce9f8f900330)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7835d611af0854e69a0c71e3806f8fe379282d6f)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c6272098323153db373f2986c67786ea8c85f1cf)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Verify that wcsncmp (L("abc"), L("abd"), SIZE_MAX) == 0. The new test
fails without
commit ddf0992cf57a93200e0c782e2a94d0733a5a0b87
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Sun Jan 9 16:02:21 2022 -0600
x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
and
commit 7e08db3359c86c94918feb33a1182cd0ff3bb10b
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Sun Jan 9 16:02:28 2022 -0600
x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]
This is for BZ #28755.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit aa5a720056d37cf24924c138a3dbe6dace98e97c)
|
|
|
|
|
|
| |
x86_cpu_TBM should be x86_cpu_index_80000001_ecx + 21.
(cherry picked from commit ba230b6387fc0ccba60d2ff6759f7e326ba7bf3e)
|
|
|
|
|
|
|
|
| |
It is not possible to use interface ioctls with netlink sockets
on all Linux kernels.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 3d981795cd00cc9b73c3ee5087c308361acd62e5)
|
|
|
|
|
|
|
|
|
| |
5bf07e1b3a74 ("Linux: Simplify __opensock and fix race condition [BZ #28353]")
made __opensock try NETLINK then UNIX then INET. On the Hurd, only INET
knows about network interfaces, so better actually specify that in
if_index.
(cherry picked from commit 1d3decee997ba2fc24af81803299b2f4f3c47063)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
AF_NETLINK support is not quite optional on modern Linux systems
anymore, so it is likely that the first attempt will always succeed.
Consequently, there is no need to cache the result. Keep AF_UNIX
and the Internet address families as a fallback, for the rare case
that AF_NETLINK is missing. The other address families previously
probed are totally obsolete be now, so remove them.
Use this simplified version as the generic implementation, disabling
Netlink support as needed.
(cherry picked from commit 5bf07e1b3a74232bfb8332275110be1a5da50f83)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 6f573a27b6c8b4236445810a44660612323f5a73
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed Jun 23 01:19:34 2021 -0400
x86-64: Add wcslen optimize for sse4.1
added wcsnlen-sse4.1 to the wcslen ifunc implementation list. Since the
random value in the the RSI register is larger than the wide-character
string length in the existing wcslen test, it didn't trigger the wcslen
test failure. Add a test to force 0 into the RSI register before calling
wcslen.
(cherry picked from commit a6e7c3745d73ff876b4ba6991fb00768a938aef5)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following commit
commit 6f573a27b6c8b4236445810a44660612323f5a73
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed Jun 23 01:19:34 2021 -0400
x86-64: Add wcslen optimize for sse4.1
Added wcsnlen-sse4.1 to the wcslen ifunc implementation list and did
not add wcslen-sse4.1 to wcslen ifunc implementation list. This commit
fixes that by removing wcsnlen-sse4.1 from the wcslen ifunc
implementation list and adding wcslen-sse4.1 to the ifunc
implementation list.
Testing:
test-wcslen.c, test-rsi-wcslen.c, and test-rsi-strlen.c are passing as
well as all other tests in wcsmbs and string.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 0679442defedf7e52a94264975880ab8674736b2)
|
|
|
|
|
|
|
| |
HLE is disabled on blacklisted CPUs. Use CHECK_FEATURE_PRESENT, instead
of CHECK_FEATURE_ACTIVE, to check HLE.
(cherry picked from commit 501246c5e2dfcc278f0ebbdb72345cdd239521c7)
|
|
|
|
|
|
|
|
|
|
|
| |
Disable TSX and enable RTM_ALWAYS_ABORT for Intel CPUs listed in:
https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html
This fixes BZ #27398.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 1e000d3d33211d5a954300e2a69b90f93f18a1a1)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
From
https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html
* Intel TSX will be disabled by default.
* The processor will force abort all Restricted Transactional Memory (RTM)
transactions by default.
* A new CPUID bit CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT) will be enumerated,
which is set to indicate to updated software that the loaded microcode is
forcing RTM abort.
* On processors that enumerate support for RTM, the CPUID enumeration bits
for Intel TSX (CPUID.07H.0H.EBX[11] and CPUID.07H.0H.EBX[4]) continue to
be set by default after microcode update.
* Workloads that were benefited from Intel TSX might experience a change
in performance.
* System software may use a new bit in Model-Specific Register (MSR) 0x10F
TSX_FORCE_ABORT[TSX_CPUID_CLEAR] functionality to clear the Hardware Lock
Elision (HLE) and RTM bits to indicate to software that Intel TSX is
disabled.
1. Add RTM_ALWAYS_ABORT to CPUID features.
2. Set RTM usable only if RTM_ALWAYS_ABORT isn't set. This skips the
string/tst-memchr-rtm etc. testcases on the affected processors, which
always fail after a microcde update.
3. Check RTM feature, instead of usability, against /proc/cpuinfo.
This fixes BZ #28033.
(cherry picked from commit ea8e465a6b8d0f26c72bcbe453a854de3abf68ec)
|
|
|
|
|
|
|
|
|
| |
IBT and SHSTK usable bits are copied from CPUID feature bits and later
cleared if kernel doesn't support CET. Copy IBT and SHSTK usable only
if CET is enabled so that they aren't set on CET capable processors
with non-CET enabled glibc.
(cherry picked from commit ea26ff03227d7cacef5de6036df57734373449b4)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since __strlen_evex and __strnlen_evex added by
commit 1fd8c163a83d96ace1ff78fa6bac7aee084f6f77
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Mar 5 06:24:52 2021 -0800
x86-64: Add ifunc-avx2.h functions with 256-bit EVEX
use sarx:
c4 e2 6a f7 c0 sarx %edx,%eax,%eax
require BMI2 for __strlen_evex and __strnlen_evex in ifunc-impl-list.c.
ifunc-avx2.h already requires BMI2 for EVEX implementation.
(cherry picked from commit 55bf411b451c13f0fb7ff3d3bf9a820020b45df1)
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds tests for a bug in the wide char variant of the
functions where the implementation may assume that maxlen for wcsnlen
or n for wmemchr/strncat will not overflow when multiplied by
sizeof(wchar_t).
These tests show the following implementations failing on x86_64:
wcsnlen-sse4_1
wcsnlen-avx2
wmemchr-sse2
wmemchr-avx2
strncat would fail as well if it where on a system that prefered
either of the wcsnlen implementations that failed as it relies on
wcsnlen.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit da5a6fba0febbfc90896ce1b2eb75c6d8a88a72d)
|
|
|
|
|
|
|
|
|
|
|
| |
No bug. This commit optimizes strlen-evex.S. The
optimizations are mostly small things but they add up to roughly
10-30% performance improvement for strlen. The results for strnlen are
bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and
test-wcsnlen are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 4ba65586847751372520a36757c17f114588794e)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit fixes the bug mentioned in the previous commit.
The previous implementations of wmemchr in these files relied
on maxlen * sizeof(wchar_t) which was not guranteed by the standard.
The new overflow tests added in the previous commit now
pass (As well as all the other tests).
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit a775a7a3eb1e85b54af0b4ee5ff4dcf66772a1fb)
|
|
|
|
|
|
|
|
|
|
| |
No bug. This comment adds the ifunc / build infrastructure
necessary for wcslen to prefer the sse4.1 implementation
in strlen-vec.S. test-wcslen.c is passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6f573a27b6c8b4236445810a44660612323f5a73)
|
|
|
|
|
|
|
|
|
|
| |
Since strlen.S contains SSE2 version of strlen/strnlen and SSE4.1
version of wcslen/wcsnlen, move strlen.S to multiarch/strlen-vec.S
and include multiarch/strlen-vec.S from SSE2 and SSE4.1 variants.
This also removes the unused symbols, __GI___strlen_sse2 and
__GI___wcsnlen_sse4_1.
(cherry picked from commit a0db678071c60b6c47c468d231dd0b3694ba7a98)
|
|
|
|
|
|
|
|
| |
An unknown vector operation occurred in commit 2a76821c308. Fixed it
by using "ymm{k1}{z}" but not "ymm {k1} {z}".
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6ea916adfa0ab9af6e7dc6adcf6f977dfe017835)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
No bug. This commit optimizes memchr-evex.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
saving some ALU in the alignment process, and most importantly
increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and
test-wmemchr are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2a76821c3081d2c0231ecd2618f52662cb48fccd)
|
|
|
|
|
|
|
|
|
|
|
| |
No bug. This commit optimizes strlen-avx2.S. The optimizations are
mostly small things but they add up to roughly 10-30% performance
improvement for strlen. The results for strnlen are bit more
ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen
are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit aaa23c35071537e2dcf5807e956802ed215210aa)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit fixes the bug mentioned in the previous commit.
The previous implementations of wmemchr in these files relied
on n * sizeof(wchar_t) which was not guranteed by the standard.
The new overflow tests added in the previous commit now
pass (As well as all the other tests).
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 645a158978f9520e74074e8c14047503be4db0f0)
|
|
|
|
|
|
|
|
|
|
|
|
| |
No bug. This commit optimizes memchr-avx2.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
asaving a few instructions the in loop return loop. test-memchr,
test-rawmemchr, and test-wmemchr are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit acfd088a1963ba51cd83c78f95c0ab25ead79e04)
|
| |
|
|
|
|
|
|
|
| |
Fix some indentations of ifdef in file strlen-evex.S which are off by 1
and confusing to read.
(cherry picked from commit 595c22ecd8e87a27fd19270ed30fdbae9ad25426)
|
|
|
|
|
|
|
|
| |
Update ifunc-memmove.h to select the function optimized with AVX512
instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
(cherry picked from commit e4fda4631017e49d4ee5a2755db34289b6860fa4)
|
|
|
|
|
|
|
|
|
| |
Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort
with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
(cherry picked from commit 4e2d8f352774b56078c34648b14a2412c38384f4)
|
|
|
|
|
|
|
|
|
|
| |
At function exit, AVX optimized string/memory functions have VZEROUPPER
which triggers RTM abort. When such functions are called inside a
transactionally executing RTM region, RTM abort causes severe performance
degradation. Add tests to verify that string/memory functions won't
cause RTM abort in RTM region.
(cherry picked from commit 4bd660be40967cd69072f69ebc2ad32bfcc1f206)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX
optimized string/memory functions with
xtest
jz 1f
vzeroall
ret
1:
vzeroupper
ret
at function exit on processors with usable RTM, but without 256-bit EVEX
instructions to avoid VZEROUPPER inside a transactionally executing RTM
region.
(cherry picked from commit 7ebba91361badf7531d4e75050627a88d424872f)
|