summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* x86: Add EVEX optimized memchr family not safe for RTMNoah Goldstein2022-05-0210-41/+217
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. This commit adds a new implementation for EVEX memchr that is not safe for RTM because it uses vzeroupper. The benefit is that by using ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is faster than the RTM safe version which cannot use vpcmpeq because there is no EVEX encoding for the instruction. All parts of the implementation aside from the 4x loop are the same for the two versions and the optimization is only relevant for large sizes. Tigerlake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 9.2 , 9.04 , no-RTM , 0.16 512 , 7 , 224 , 9.19 , 8.98 , no-RTM , 0.21 2048 , 0 , 256 , 10.74 , 10.54 , no-RTM , 0.2 2048 , 0 , 512 , 14.81 , 14.87 , RTM , 0.06 2048 , 0 , 1024 , 22.97 , 22.57 , no-RTM , 0.4 2048 , 0 , 2048 , 37.49 , 34.51 , no-RTM , 2.98 <-- Icelake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 7.6 , 7.3 , no-RTM , 0.3 512 , 7 , 224 , 7.63 , 7.27 , no-RTM , 0.36 2048 , 0 , 256 , 8.48 , 8.38 , no-RTM , 0.1 2048 , 0 , 512 , 11.57 , 11.42 , no-RTM , 0.15 2048 , 0 , 1024 , 17.92 , 17.38 , no-RTM , 0.54 2048 , 0 , 2048 , 30.37 , 27.34 , no-RTM , 3.03 <-- test-memchr, test-wmemchr, and test-rawmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 104c7b1967c3e78435c6f7eab5e225a7eddf9c6e)
* x86: Set rep_movsb_threshold to 2112 on processors with FSRMH.J. Lu2022-05-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The glibc memcpy benchmark on Intel Core i7-1065G7 (Ice Lake) showed that REP MOVSB became faster after 2112 bytes: Vector Move REP MOVSB length=2112, align1=0, align2=0: 24.20 24.40 length=2112, align1=1, align2=0: 26.07 23.13 length=2112, align1=0, align2=1: 27.18 28.13 length=2112, align1=1, align2=1: 26.23 25.16 length=2176, align1=0, align2=0: 23.18 22.52 length=2176, align1=2, align2=0: 25.45 22.52 length=2176, align1=0, align2=2: 27.14 27.82 length=2176, align1=2, align2=2: 22.73 25.56 length=2240, align1=0, align2=0: 24.62 24.25 length=2240, align1=3, align2=0: 29.77 27.15 length=2240, align1=0, align2=3: 35.55 29.93 length=2240, align1=3, align2=3: 34.49 25.15 length=2304, align1=0, align2=0: 34.75 26.64 length=2304, align1=4, align2=0: 32.09 22.63 length=2304, align1=0, align2=4: 28.43 31.24 Use REP MOVSB for data size > 2112 bytes in memcpy on processors with fast short REP MOVSB (FSRM). * sysdeps/x86/dl-cacheinfo.h (dl_init_cacheinfo): Set rep_movsb_threshold to 2112 on processors with fast short REP MOVSB (FSRM). (cherry picked from commit cf2c57526ba4b57e6863ad4db8a868e2678adce8)
* x86: Optimize strchr-evex.SNoah Goldstein2022-05-021-174/+218
| | | | | | | | | | | No bug. This commit optimizes strchr-evex.S. The optimizations are mostly small things such as save an ALU in the alignment process, saving a few instructions in the loop return. The one significant change is saving 2 instructions in the 4x loop. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 7f3e7c262cab4e2401e4331a6ef29c428de02044)
* x86: Optimize strchr-avx2.SNoah Goldstein2022-05-021-117/+169
| | | | | | | | | | | No bug. This commit optimizes strchr-avx2.S. The optimizations are all small things such as save an ALU in the alignment process, saving a few instructions in the loop return, saving some bytes in the main loop, and increasing the ILP in the return cases. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit ccabe7971f508709d034b63b8672f6f751a3d356)
* x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.SNoah Goldstein2022-05-025-27/+74
| | | | | | | | | No bug. This commit adds optimized cased for less_vec memset case that uses the avx512vl/avx512bw mask store avoiding the excessive branches. test-memset and test-wmemset are passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit f53790272ce7bdc5ecd14b45f65d0464d2a61a3a)
* x86-64: Require BMI2 for strchr-avx2.SH.J. Lu2022-05-022-5/+11
| | | | | | | | | | | | | | | | | | | Since strchr-avx2.S updated by commit 1f745ecc2109890886b161d4791e1406fdfc29b8 Author: noah <goldstein.w.n@gmail.com> Date: Wed Feb 3 00:38:59 2021 -0500 x86-64: Refactor and improve performance of strchr-avx2.S uses sarx: c4 e2 72 f7 c0 sarx %ecx,%eax,%eax for strchr-avx2 family functions, require BMI2 in ifunc-impl-list.c and ifunc-avx2.h. (cherry picked from commit 83c5b368226c34a2f0a5287df40fc290b2b34359)
* x86: Update large memcpy case in memmove-vec-unaligned-erms.Snoah2022-05-021-73/+265
| | | | | | | | | | | | | | | No Bug. This commit updates the large memcpy case (no overlap). The update is to perform memcpy on either 2 or 4 contiguous pages at once. This 1) helps to alleviate the affects of false memory aliasing when destination and source have a close 4k alignment and 2) In most cases and for most DRAM units is a modestly more efficient access pattern. These changes are a clear performance improvement for VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy, test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 1a8605b6cd257e8a74e29b5b71c057211f5fb847)
* x86-64: Refactor and improve performance of strchr-avx2.Snoah2022-05-022-115/+112
| | | | | | | | | | No bug. Just seemed the performance could be improved a bit. Observed and expected behavior are unchanged. Optimized body of main loop. Updated page cross logic and optimized accordingly. Made a few minor instruction selection modifications. No regressions in test suite. Both test-strchrnul and test-strchr passed. (cherry picked from commit 1f745ecc2109890886b161d4791e1406fdfc29b8)
* x86: Adding an upper bound for Enhanced REP MOVSB.Sajan Karumanchi2022-05-024-3/+25
| | | | | | | | | | | | | | | | In the process of optimizing memcpy for AMD machines, we have found the vector move operations are outperforming enhanced REP MOVSB for data transfers above the L2 cache size on Zen3 architectures. To handle this use case, we are adding an upper bound parameter on enhanced REP MOVSB:'__x86_rep_movsb_stop_threshold'. As per large-bench results, we are configuring this parameter to the L2 cache size for AMD machines and applicable from Zen3 architecture supporting the ERMS feature. For architectures other than AMD, it is the computed value of non-temporal threshold parameter. Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> (cherry picked from commit 6e02b3e9327b7dbb063958d2b124b64fcb4bbe3f)
* S390: Add new s390 platform z16.Stefan Liebler2022-04-147-9/+53
| | | | | | | | | | | | | | | | The new IBM z16 is added to platform string array. The macro _DL_PLATFORMS_COUNT is incremented. _dl_hwcaps_subdir is extended by "z16" if HWCAP_S390_VXRS_PDE2 is set. HWCAP_S390_NNPA is not tested in _dl_hwcaps_subdirs_active as those instructions may be replaced or removed in future. tst-glibc-hwcaps.c is extended in order to test z16 via new marker5. A fatal glibc error is dumped if glibc was build with architecture level set for z16, but run on an older machine. (See dl-hwcap-check.h) (cherry picked from commit 2376944b9e5c0364b9fb473e4d8dabca31b57167)
* hppa: Use END instead of PSEUDO_END in swapcontext.SJohn David Anglin2022-03-151-1/+1
| | | | (cherry picked from commit 7a5c440102d4ec7fafd9bbd98eca9bd90ecaaafd)
* hppa: Implement swapcontext in assembler (bug 28960)John David Anglin2022-03-152-83/+72
| | | | | | | | | When swapcontext.c is compiled without -g, the following error occurs: Error: CFI instruction used without previous .cfi_startproc Fix by converting swapcontext routine to assembler. (cherry picked from commit 738ee53f0ce5e39b9b7a6777f5d3057afbaac498)
* hppa: Fix warnings from _dl_lookup_addressJohn David Anglin2022-03-084-9/+14
| | | | | | | | | | | | | | | This change fixes two warnings from _dl_lookup_address. The first warning comes from dropping the volatile keyword from desc in the call to _dl_read_access_allowed. We now have a full atomic barrier between loading desc[0] and the access check, so desc no longer needs to be declared as volatile. The second warning comes from the implicit declaration of _dl_fix_reloc_arg. This is fixed by including dl-runtime.h and declaring _dl_fix_reloc_arg in dl-runtime.h. (cherry picked from commit 6c9c2307657529e52c5fa7037618835f2a50b916)
* nptl: Fix cleanups for stack grows up [BZ# 28899]John David Anglin2022-03-081-1/+1
| | | | | | | | | | _STACK_GROWS_DOWN is defined to 0 when the stack grows up. The code in unwind.c used `#ifdef _STACK_GROWS_DOWN' to selct the stack grows down define for FRAME_LEFT. As a result, the _STACK_GROWS_DOWN define was always selected and cleanups were incorrectly sequenced when the stack grows up. (cherry picked from commit 2bbc694df279020a6620096d31c1e05c93966f9b)
* hppa: Revise gettext trampoline designJohn David Anglin2022-03-083-31/+35
| | | | | | | | | | | | | | | | | | | | | The current getcontext return trampoline is overly complex and it unnecessarily clobbers several registers. By saving the context pointer (r26) in the context, __getcontext_ret can restore any registers not restored by setcontext. This allows getcontext to save and restore the entire register context present when getcontext is entered. We use the unused oR0 context slot for the return from __getcontext_ret. While this is not directly useful in C, it can be exploited in assembly code. Registers r20, r23, r24 and r25 are not clobbered in the call path to getcontext. This allows a small simplification of swapcontext. It also allows saving and restoring the 6-bit SAR register in the LSB of the oSAR context slot. The getcontext flag value can be stored in the MSB of the oSAR slot. (cherry picked from commit 9e7e5fda38471e00d1190479ea91d7b08ae3e304)
* hppa: Fix swapcontextJohn David Anglin2022-03-083-7/+58
| | | | | | | | | | | | | | | | | | | This change fixes the failure of stdlib/tst-setcontext2 and stdlib/tst-setcontext7 on hppa. The implementation of swapcontext in C is broken. C saves the return pointer (rp) and any non call-clobbered registers (in this case r3, r4 and r5) on the stack. However, the setcontext call in swapcontext pops the stack and subsequent calls clobber the saved registers. When the context in oucp is restored, both tests fault. Here we rewrite swapcontext in assembly code to avoid using the stack for register values that need to be used after restoration. The getcontext and setcontext routines are revised to save and restore register ret1 for normal returns. We copy the oucp pointer to ret1. This allows access to the old context after calling getcontext and setcontext. (cherry picked from commit 71b108d7eb33b2bf3e61d5e92d2a47f74c1f7d96)
* Fix elf/tst-audit2 on hppaJohn David Anglin2022-03-081-14/+29
| | | | | | | | | | | | | | | | The test elf/tst-audit2 fails on hppa with a segmentation fault in the long branch stub used to call malloc from calloc. This occurs because the test is not a PIC executable and calloc is called from the dynamic linker before the dp register is initialized in _dl_start_user. The fix is to move the dp register initialization into elf_machine_runtime_setup. Since the address of $global$ can't be loaded directly, we continue to use the DT_PLTGOT value from the the main_map to initialize dp. Since l_main_map is not available in v2.34 and earlier, we use a new function, elf_machine_main_map, to find the main map. (cherry picked from commit 3be79b72d556e3ac37075ad6b99eb5eac18e1402)
* NEWS: Add a bug fix entry for BZ #28896H.J. Lu2022-02-181-0/+2
|
* x86: Fix TEST_NAME to make it a string in tst-strncmp-rtm.cNoah Goldstein2022-02-181-2/+2
| | | | | | | | | Previously TEST_NAME was passing a function pointer. This didn't fail because of the -Wno-error flag (to allow for overflow sizes passed to strncmp/wcsncmp) Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit b98d0bbf747f39770e0caba7e984ce9f8f900330)
* x86: Test wcscmp RTM in the wcsncmp overflow case [BZ #28896]Noah Goldstein2022-02-183-10/+48
| | | | | | | | | | | | | In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would call strcmp-avx2 and wcscmp-avx2 respectively. This would have not checks around vzeroupper and would trigger spurious aborts. This commit fixes that. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on AVX2 machines with and without RTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 7835d611af0854e69a0c71e3806f8fe379282d6f)
* x86: Fallback {str|wcs}cmp RTM in the ncmp overflow case [BZ #28896]Noah Goldstein2022-02-187-5/+22
| | | | | | | | | | | | | | In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would call strcmp-avx2 and wcscmp-avx2 respectively. This would have not checks around vzeroupper and would trigger spurious aborts. This commit fixes that. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on AVX2 machines with and without RTM. Co-authored-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit c6272098323153db373f2986c67786ea8c85f1cf)
* string: Add a testcase for wcsncmp with SIZE_MAX [BZ #28755]H.J. Lu2022-02-171-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | Verify that wcsncmp (L("abc"), L("abd"), SIZE_MAX) == 0. The new test fails without commit ddf0992cf57a93200e0c782e2a94d0733a5a0b87 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Jan 9 16:02:21 2022 -0600 x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] and commit 7e08db3359c86c94918feb33a1182cd0ff3bb10b Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Jan 9 16:02:28 2022 -0600 x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755] This is for BZ #28755. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com> (cherry picked from commit aa5a720056d37cf24924c138a3dbe6dace98e97c)
* <bits/platform/x86.h>: Correct x86_cpu_TBMH.J. Lu2022-02-161-1/+1
| | | | | | x86_cpu_TBM should be x86_cpu_index_80000001_ecx + 21. (cherry picked from commit ba230b6387fc0ccba60d2ff6759f7e326ba7bf3e)
* socket: Do not use AF_NETLINK in __opensockFlorian Weimer2022-02-031-8/+1
| | | | | | | | It is not possible to use interface ioctls with netlink sockets on all Linux kernels. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 3d981795cd00cc9b73c3ee5087c308361acd62e5)
* hurd if_index: Explicitly use AF_INET for if index discoverySamuel Thibault2022-02-031-3/+3
| | | | | | | | | 5bf07e1b3a74 ("Linux: Simplify __opensock and fix race condition [BZ #28353]") made __opensock try NETLINK then UNIX then INET. On the Hurd, only INET knows about network interfaces, so better actually specify that in if_index. (cherry picked from commit 1d3decee997ba2fc24af81803299b2f4f3c47063)
* Linux: Simplify __opensock and fix race condition [BZ #28353]Florian Weimer2022-02-034-159/+23
| | | | | | | | | | | | | | AF_NETLINK support is not quite optional on modern Linux systems anymore, so it is likely that the first attempt will always succeed. Consequently, there is no need to cache the result. Keep AF_UNIX and the Internet address families as a fallback, for the rare case that AF_NETLINK is missing. The other address families previously probed are totally obsolete be now, so remove them. Use this simplified version as the generic implementation, disabling Netlink support as needed. (cherry picked from commit 5bf07e1b3a74232bfb8332275110be1a5da50f83)
* x86-64: Test strlen and wcslen with 0 in the RSI register [BZ #28064]H.J. Lu2022-02-013-0/+108
| | | | | | | | | | | | | | | | commit 6f573a27b6c8b4236445810a44660612323f5a73 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:19:34 2021 -0400 x86-64: Add wcslen optimize for sse4.1 added wcsnlen-sse4.1 to the wcslen ifunc implementation list. Since the random value in the the RSI register is larger than the wide-character string length in the existing wcslen test, it didn't trigger the wcslen test failure. Add a test to force 0 into the RSI register before calling wcslen. (cherry picked from commit a6e7c3745d73ff876b4ba6991fb00768a938aef5)
* x86: Remove wcsnlen-sse4_1 from wcslen ifunc-impl-list [BZ #28064]Noah Goldstein2022-02-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | The following commit commit 6f573a27b6c8b4236445810a44660612323f5a73 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:19:34 2021 -0400 x86-64: Add wcslen optimize for sse4.1 Added wcsnlen-sse4.1 to the wcslen ifunc implementation list and did not add wcslen-sse4.1 to wcslen ifunc implementation list. This commit fixes that by removing wcsnlen-sse4.1 from the wcslen ifunc implementation list and adding wcslen-sse4.1 to the ifunc implementation list. Testing: test-wcslen.c, test-rsi-wcslen.c, and test-rsi-strlen.c are passing as well as all other tests in wcsmbs and string. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 0679442defedf7e52a94264975880ab8674736b2)
* x86: Use CHECK_FEATURE_PRESENT to check HLE [BZ #27398]H.J. Lu2022-02-011-1/+1
| | | | | | | HLE is disabled on blacklisted CPUs. Use CHECK_FEATURE_PRESENT, instead of CHECK_FEATURE_ACTIVE, to check HLE. (cherry picked from commit 501246c5e2dfcc278f0ebbdb72345cdd239521c7)
* x86: Black list more Intel CPUs for TSX [BZ #27398]H.J. Lu2022-02-011-3/+31
| | | | | | | | | | | Disable TSX and enable RTM_ALWAYS_ABORT for Intel CPUs listed in: https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html This fixes BZ #27398. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 1e000d3d33211d5a954300e2a69b90f93f18a1a1)
* x86: Check RTM_ALWAYS_ABORT for RTM [BZ #28033]H.J. Lu2022-02-016-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html * Intel TSX will be disabled by default. * The processor will force abort all Restricted Transactional Memory (RTM) transactions by default. * A new CPUID bit CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT) will be enumerated, which is set to indicate to updated software that the loaded microcode is forcing RTM abort. * On processors that enumerate support for RTM, the CPUID enumeration bits for Intel TSX (CPUID.07H.0H.EBX[11] and CPUID.07H.0H.EBX[4]) continue to be set by default after microcode update. * Workloads that were benefited from Intel TSX might experience a change in performance. * System software may use a new bit in Model-Specific Register (MSR) 0x10F TSX_FORCE_ABORT[TSX_CPUID_CLEAR] functionality to clear the Hardware Lock Elision (HLE) and RTM bits to indicate to software that Intel TSX is disabled. 1. Add RTM_ALWAYS_ABORT to CPUID features. 2. Set RTM usable only if RTM_ALWAYS_ABORT isn't set. This skips the string/tst-memchr-rtm etc. testcases on the affected processors, which always fail after a microcde update. 3. Check RTM feature, instead of usability, against /proc/cpuinfo. This fixes BZ #28033. (cherry picked from commit ea8e465a6b8d0f26c72bcbe453a854de3abf68ec)
* x86: Copy IBT and SHSTK usable only if CET is enabledH.J. Lu2022-02-011-2/+5
| | | | | | | | | IBT and SHSTK usable bits are copied from CPUID feature bits and later cleared if kernel doesn't support CET. Copy IBT and SHSTK usable only if CET is enabled so that they aren't set on CET capable processors with non-CET enabled glibc. (cherry picked from commit ea26ff03227d7cacef5de6036df57734373449b4)
* x86-64: Require BMI2 for __strlen_evex and __strnlen_evexH.J. Lu2022-01-271-2/+4
| | | | | | | | | | | | | | | | | | | Since __strlen_evex and __strnlen_evex added by commit 1fd8c163a83d96ace1ff78fa6bac7aee084f6f77 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 5 06:24:52 2021 -0800 x86-64: Add ifunc-avx2.h functions with 256-bit EVEX use sarx: c4 e2 6a f7 c0 sarx %edx,%eax,%eax require BMI2 for __strlen_evex and __strnlen_evex in ifunc-impl-list.c. ifunc-avx2.h already requires BMI2 for EVEX implementation. (cherry picked from commit 55bf411b451c13f0fb7ff3d3bf9a820020b45df1)
* NEWS: Add a bug fix entry for BZ #27974H.J. Lu2022-01-271-0/+1
|
* String: Add overflow tests for strnlen, memchr, and strncat [BZ #27974]Noah Goldstein2022-01-273-3/+130
| | | | | | | | | | | | | | | | | | | | | | | This commit adds tests for a bug in the wide char variant of the functions where the implementation may assume that maxlen for wcsnlen or n for wmemchr/strncat will not overflow when multiplied by sizeof(wchar_t). These tests show the following implementations failing on x86_64: wcsnlen-sse4_1 wcsnlen-avx2 wmemchr-sse2 wmemchr-avx2 strncat would fail as well if it where on a system that prefered either of the wcsnlen implementations that failed as it relies on wcsnlen. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit da5a6fba0febbfc90896ce1b2eb75c6d8a88a72d)
* x86: Optimize strlen-evex.SNoah Goldstein2022-01-271-264/+317
| | | | | | | | | | | No bug. This commit optimizes strlen-evex.S. The optimizations are mostly small things but they add up to roughly 10-30% performance improvement for strlen. The results for strnlen are bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 4ba65586847751372520a36757c17f114588794e)
* x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]Noah Goldstein2022-01-272-38/+107
| | | | | | | | | | | | | | This commit fixes the bug mentioned in the previous commit. The previous implementations of wmemchr in these files relied on maxlen * sizeof(wchar_t) which was not guranteed by the standard. The new overflow tests added in the previous commit now pass (As well as all the other tests). Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit a775a7a3eb1e85b54af0b4ee5ff4dcf66772a1fb)
* x86-64: Add wcslen optimize for sse4.1Noah Goldstein2022-01-276-36/+63
| | | | | | | | | | No bug. This comment adds the ifunc / build infrastructure necessary for wcslen to prefer the sse4.1 implementation in strlen-vec.S. test-wcslen.c is passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 6f573a27b6c8b4236445810a44660612323f5a73)
* x86-64: Move strlen.S to multiarch/strlen-vec.SH.J. Lu2022-01-274-242/+262
| | | | | | | | | | Since strlen.S contains SSE2 version of strlen/strnlen and SSE4.1 version of wcslen/wcsnlen, move strlen.S to multiarch/strlen-vec.S and include multiarch/strlen-vec.S from SSE2 and SSE4.1 variants. This also removes the unused symbols, __GI___strlen_sse2 and __GI___wcsnlen_sse4_1. (cherry picked from commit a0db678071c60b6c47c468d231dd0b3694ba7a98)
* x86-64: Fix an unknown vector operation in memchr-evex.SAlice Xu2022-01-271-1/+1
| | | | | | | | An unknown vector operation occurred in commit 2a76821c308. Fixed it by using "ymm{k1}{z}" but not "ymm {k1} {z}". Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 6ea916adfa0ab9af6e7dc6adcf6f977dfe017835)
* x86: Optimize memchr-evex.SNoah Goldstein2022-01-271-225/+322
| | | | | | | | | | | | | No bug. This commit optimizes memchr-evex.S. The optimizations include replacing some branches with cmovcc, avoiding some branches entirely in the less_4x_vec case, making the page cross logic less strict, saving some ALU in the alignment process, and most importantly increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and test-wmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 2a76821c3081d2c0231ecd2618f52662cb48fccd)
* x86: Optimize strlen-avx2.SNoah Goldstein2022-01-272-214/+334
| | | | | | | | | | | No bug. This commit optimizes strlen-avx2.S. The optimizations are mostly small things but they add up to roughly 10-30% performance improvement for strlen. The results for strnlen are bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit aaa23c35071537e2dcf5807e956802ed215210aa)
* x86: Fix overflow bug with wmemchr-sse2 and wmemchr-avx2 [BZ #27974]Noah Goldstein2022-01-272-37/+98
| | | | | | | | | | | | | | This commit fixes the bug mentioned in the previous commit. The previous implementations of wmemchr in these files relied on n * sizeof(wchar_t) which was not guranteed by the standard. The new overflow tests added in the previous commit now pass (As well as all the other tests). Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 645a158978f9520e74074e8c14047503be4db0f0)
* x86: Optimize memchr-avx2.SNoah Goldstein2022-01-271-178/+247
| | | | | | | | | | | | No bug. This commit optimizes memchr-avx2.S. The optimizations include replacing some branches with cmovcc, avoiding some branches entirely in the less_4x_vec case, making the page cross logic less strict, asaving a few instructions the in loop return loop. test-memchr, test-rawmemchr, and test-wmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit acfd088a1963ba51cd83c78f95c0ab25ead79e04)
* NEWS: Add a bug fix entry for BZ #27457H.J. Lu2022-01-271-0/+1
|
* x86-64: Fix ifdef indentation in strlen-evex.SSunil K Pandey2022-01-271-8/+8
| | | | | | | Fix some indentations of ifdef in file strlen-evex.S which are off by 1 and confusing to read. (cherry picked from commit 595c22ecd8e87a27fd19270ed30fdbae9ad25426)
* x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functionsH.J. Lu2022-01-273-19/+42
| | | | | | | | Update ifunc-memmove.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL since VZEROUPPER isn't needed at function exit. (cherry picked from commit e4fda4631017e49d4ee5a2755db34289b6860fa4)
* x86-64: Use ZMM16-ZMM31 in AVX512 memset family functionsH.J. Lu2022-01-274-24/+31
| | | | | | | | | Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit. (cherry picked from commit 4e2d8f352774b56078c34648b14a2412c38384f4)
* x86: Add string/memory function tests in RTM regionH.J. Lu2022-01-2712-0/+618
| | | | | | | | | | At function exit, AVX optimized string/memory functions have VZEROUPPER which triggers RTM abort. When such functions are called inside a transactionally executing RTM region, RTM abort causes severe performance degradation. Add tests to verify that string/memory functions won't cause RTM abort in RTM region. (cherry picked from commit 4bd660be40967cd69072f69ebc2ad32bfcc1f206)
* x86-64: Add AVX optimized string/memory functions for RTMH.J. Lu2022-01-2752-244/+668
| | | | | | | | | | | | | | | | | | | Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX optimized string/memory functions with xtest jz 1f vzeroall ret 1: vzeroupper ret at function exit on processors with usable RTM, but without 256-bit EVEX instructions to avoid VZEROUPPER inside a transactionally executing RTM region. (cherry picked from commit 7ebba91361badf7531d4e75050627a88d424872f)