summaryrefslogtreecommitdiff
path: root/sysdeps/x86_64/Makefile
Commit message (Collapse)AuthorAgeFilesLines
* nptl: move tst-x86-64-tls-1 to nptl-only testsSamuel Thibault2023-05-011-2/+0
| | | | It is essentially nptl-only.
* Remove --enable-tunables configure optionAdhemerval Zanella Netto2023-03-291-2/+0
| | | | | | | | | | | | And make always supported. The configure option was added on glibc 2.25 and some features require it (such as hwcap mask, huge pages support, and lock elisition tuning). It also simplifies the build permutations. Changes from v1: * Remove glibc.rtld.dynamic_sort changes, it is orthogonal and needs more discussion. * Cleanup more code. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* x86: Add evex optimized functions for the wchar_t strcpy familyNoah Goldstein2022-11-081-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implemented: wcscat-evex (+ 905 bytes) wcscpy-evex (+ 674 bytes) wcpcpy-evex (+ 709 bytes) wcsncpy-evex (+1358 bytes) wcpncpy-evex (+1467 bytes) wcsncat-evex (+1213 bytes) Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Best Old Implementation. Best Old Implementation was determined with the highest ISA implementation. wcscat-evex -> 0.991 wcscpy-evex -> 0.587 wcpcpy-evex -> 0.695 wcsncpy-evex -> 0.719 wcpncpy-evex -> 0.694 wcsncat-evex -> 0.979 Code Size Changes: This change increase the size of libc.so by ~6.3kb bytes. For reference the patch optimizing the normal strcpy family functions decreases libc.so by ~5.7kb. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.
* x86_64: Remove platform directory library loading testJavier Pello2022-10-061-16/+0
| | | | | | | | | This was to test loading of shared libraries from platform subdirectories, but this functionality is going away in the following commits. Signed-off-by: Javier Pello <devel@otheo.eu> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* arc4random: simplify design for better safetyJason A. Donenfeld2022-07-271-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than buffering 16 MiB of entropy in userspace (by way of chacha20), simply call getrandom() every time. This approach is doubtlessly slower, for now, but trying to prematurely optimize arc4random appears to be leading toward all sorts of nasty properties and gotchas. Instead, this patch takes a much more conservative approach. The interface is added as a basic loop wrapper around getrandom(), and then later, the kernel and libc together can work together on optimizing that. This prevents numerous issues in which userspace is unaware of when it really must throw away its buffer, since we avoid buffering all together. Future improvements may include userspace learning more from the kernel about when to do that, which might make these sorts of chacha20-based optimizations more possible. The current heuristic of 16 MiB is meaningless garbage that doesn't correspond to anything the kernel might know about. So for now, let's just do something conservative that we know is correct and won't lead to cryptographic issues for users of this function. This patch might be considered along the lines of, "optimization is the root of all evil," in that the much more complex implementation it replaces moves too fast without considering security implications, whereas the incremental approach done here is a much safer way of going about things. Once this lands, we can take our time in optimizing this properly using new interplay between the kernel and userspace. getrandom(0) is used, since that's the one that ensures the bytes returned are cryptographically secure. But on systems without it, we fallback to using /dev/urandom. This is unfortunate because it means opening a file descriptor, but there's not much of a choice. Secondly, as part of the fallback, in order to get more or less the same properties of getrandom(0), we poll on /dev/random, and if the poll succeeds at least once, then we assume the RNG is initialized. This is a rough approximation, as the ancient "non-blocking pool" initialized after the "blocking pool", not before, and it may not port back to all ancient kernels, though it does to all kernels supported by glibc (≥3.2), so generally it's the best approximation we can do. The motivation for including arc4random, in the first place, is to have source-level compatibility with existing code. That means this patch doesn't attempt to litigate the interface itself. It does, however, choose a conservative approach for implementing it. Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Cristian Rodríguez <crrodriguez@opensuse.org> Cc: Paul Eggert <eggert@cs.ucla.edu> Cc: Mark Harris <mark.hsj@gmail.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: linux-crypto@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86: Add AVX2 optimized chacha20Adhemerval Zanella Netto2022-07-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It adds vectorized ChaCha20 implementation based on libgcrypt cipher/chacha20-amd64-avx2.S. It is used only if AVX2 is supported and enabled by the architecture. As for generic implementation, the last step that XOR with the input is omited. The final state register clearing is also omitted. On a Ryzen 9 5900X it shows the following improvements (using formatted bench-arc4random data): SSE MB/s ----------------------------------------------- arc4random [single-thread] 704.25 arc4random_buf(16) [single-thread] 1018.17 arc4random_buf(32) [single-thread] 1315.27 arc4random_buf(48) [single-thread] 1449.36 arc4random_buf(64) [single-thread] 1511.16 arc4random_buf(80) [single-thread] 1539.48 arc4random_buf(96) [single-thread] 1571.06 arc4random_buf(112) [single-thread] 1596.16 arc4random_buf(128) [single-thread] 1613.48 ----------------------------------------------- AVX2 MB/s ----------------------------------------------- arc4random [single-thread] 922.61 arc4random_buf(16) [single-thread] 1478.70 arc4random_buf(32) [single-thread] 2241.80 arc4random_buf(48) [single-thread] 2681.28 arc4random_buf(64) [single-thread] 2913.43 arc4random_buf(80) [single-thread] 3009.73 arc4random_buf(96) [single-thread] 3141.16 arc4random_buf(112) [single-thread] 3254.46 arc4random_buf(128) [single-thread] 3305.02 ----------------------------------------------- Checked on x86_64-linux-gnu.
* x86: Add SSE2 optimized chacha20Adhemerval Zanella Netto2022-07-221-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It adds vectorized ChaCha20 implementation based on libgcrypt cipher/chacha20-amd64-ssse3.S. It replaces the ROTATE_SHUF_2 (which uses pshufb) by ROTATE2 and thus making the original implementation SSE2. As for generic implementation, the last step that XOR with the input is omited. The final state register clearing is also omitted. On a Ryzen 9 5900X it shows the following improvements (using formatted bench-arc4random data): GENERIC MB/s ----------------------------------------------- arc4random [single-thread] 443.11 arc4random_buf(16) [single-thread] 552.27 arc4random_buf(32) [single-thread] 626.86 arc4random_buf(48) [single-thread] 649.81 arc4random_buf(64) [single-thread] 663.95 arc4random_buf(80) [single-thread] 674.78 arc4random_buf(96) [single-thread] 675.17 arc4random_buf(112) [single-thread] 680.69 arc4random_buf(128) [single-thread] 683.20 ----------------------------------------------- SSE MB/s ----------------------------------------------- arc4random [single-thread] 704.25 arc4random_buf(16) [single-thread] 1018.17 arc4random_buf(32) [single-thread] 1315.27 arc4random_buf(48) [single-thread] 1449.36 arc4random_buf(64) [single-thread] 1511.16 arc4random_buf(80) [single-thread] 1539.48 arc4random_buf(96) [single-thread] 1571.06 arc4random_buf(112) [single-thread] 1596.16 arc4random_buf(128) [single-thread] 1613.48 ----------------------------------------------- Checked on x86_64-linux-gnu.
* x86: Add support to build wcscpy with explicit ISA levelNoah Goldstein2022-07-161-0/+1
| | | | | | | | | | | | | | | | | | 1. Add ISA level build guards to different implementations. - wcscpy-ssse3.S is used as ISA level 2/3/4. - wcscpy-generic.c is only used at ISA level 1 and will only build if compiled with ISA level == 1. Otherwise there is no reason to include it as we will always use wcscpy-ssse3.S 2. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add support to build strcmp/strlen/strchr with explicit ISA levelNoah Goldstein2022-07-161-0/+6
| | | | | | | | | | | | | | | | | | | | 1. Add default ISA level selection in non-multiarch/rtld implementations. 2. Add ISA level build guards to different implementations. - I.e strcmp-avx2.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (strcmp-evex.S). 3. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add support for building str{c|p}{brk|spn} with explicit ISA levelNoah Goldstein2022-07-051-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changes for these functions are different than the others because the best implementation (sse4_2) requires the generic implementation as a fallback to be built as well. Changes are: 1. Add non-multiarch functions for str{c|p}{brk|spn}.c to statically select the best implementation based on the configured ISA build level. 2. Add stubs for str{c|p}{brk|spn}-generic and varshift.c to in the sysdeps/x86_64 directory so that the the sse4 implementation will have all of its dependencies for the non-multiarch / rtld build when ISA level >= 2. 3. Add new multiarch/rtld-strcspn.c that just include the non-multiarch strcspn.c which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* build: Properly generate .d dependency files [BZ #28922]H.J. Lu2022-02-251-0/+2
| | | | | | | | | | | | | 1. Also generate .d dependency files for $(tests-container) and $(tests-printers). 2. elf: Add tst-auditmod17.os to extra-test-objs. 3. iconv: Add tst-gconv-init-failure-mod.os to extra-test-objs. 4. malloc: Rename extra-tests-objs to extra-test-objs. 5. linux: Add tst-sysconf-iov_max-uapi.o to extra-test-objs. 6. x86_64: Add tst-x86_64mod-1.o, tst-platformmod-2.o, test-libmvec.o, test-libmvec-avx.o, test-libmvec-avx2.o and test-libmvec-avx512f.o to extra-test-objs. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* x86-64: Remove compiler -mavx512f checkH.J. Lu2021-08-241-2/+0
| | | | | The minimum GCC requirement is GCC 6.2 which supports -mavx512f. Remove compiler -mavx512f check. Tested with GCC 6.4.1 on Linux/x86-64.
* Add a generic malloc test for MALLOC_ALIGNMENTH.J. Lu2021-07-091-4/+0
| | | | | | | | | 1. Add sysdeps/generic/malloc-size.h to define size related macros for malloc. 2. Move x86_64/tst-mallocalign1.c to malloc and replace ALIGN_MASK with MALLOC_ALIGN_MASK. 3. Add tst-mallocalign1 to tests-exclude-mcheck for i386 and x32 since mcheck doesn't honor MALLOC_ALIGNMENT.
* x86-64: Test strlen and wcslen with 0 in the RSI register [BZ #28064]H.J. Lu2021-07-081-0/+7
| | | | | | | | | | | | | | commit 6f573a27b6c8b4236445810a44660612323f5a73 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:19:34 2021 -0400 x86-64: Add wcslen optimize for sse4.1 added wcsnlen-sse4.1 to the wcslen ifunc implementation list. Since the random value in the the RSI register is larger than the wide-character string length in the existing wcslen test, it didn't trigger the wcslen test failure. Add a test to force 0 into the RSI register before calling wcslen.
* x86_64: Correct THREAD_SETMEM/THREAD_SETMEM_NC for movq [BZ #27591]H.J. Lu2021-04-011-0/+2
| | | | | | | | | | | | | | | | config/i386/constraints.md in GCC has (define_constraint "e" "32-bit signed integer constant, or a symbolic reference known to fit that range (for immediate operands in sign-extending x86-64 instructions)." (match_operand 0 "x86_64_immediate_operand")) Since movq takes a signed 32-bit immediate or a register source operand, use "er", instead of "nr"/"ir", constraint for 32-bit signed integer constant or register on movq. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* x86_64: Add glibc-hwcaps supportFlorian Weimer2020-12-041-0/+39
| | | | | | | | The subdirectories match those in the x86-64 psABI: https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/77566eb03bc6a326811cb7e9a6b9396884b67c7c Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86: Support usable check for all CPU featuresH.J. Lu2020-07-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support usable check for all CPU features with the following changes: 1. Change struct cpu_features to struct cpuid_features { struct cpuid_registers cpuid; struct cpuid_registers usable; }; struct cpu_features { struct cpu_features_basic basic; struct cpuid_features features[COMMON_CPUID_INDEX_MAX]; unsigned int preferred[PREFERRED_FEATURE_INDEX_MAX]; ... }; so that there is a usable bit for each cpuid bit. 2. After the cpuid bits have been initialized, copy the known bits to the usable bits. EAX/EBX from INDEX_1 and EAX from INDEX_7 aren't used for CPU feature detection. 3. Clear the usable bits which require OS support. 4. If the feature is supported by OS, copy its cpuid bit to its usable bit. 5. Replace HAS_CPU_FEATURE and CPU_FEATURES_CPU_P with CPU_FEATURE_USABLE and CPU_FEATURE_USABLE_P to check if a feature is usable. 6. Add DEPR_FPU_CS_DS for INDEX_7_EBX_13. 7. Unset MPX feature since it has been deprecated. The results are 1. If the feature is known and doesn't requre OS support, its usable bit is copied from the cpuid bit. 2. Otherwise, its usable bit is copied from the cpuid bit only if the feature is known to supported by OS. 3. CPU_FEATURE_USABLE/CPU_FEATURE_USABLE_P are used to check if the feature can be used. 4. HAS_CPU_FEATURE/CPU_FEATURE_CPU_P are used to check if CPU supports the feature.
* x86: Remove the unused __x86_prefetchwH.J. Lu2020-07-111-1/+1
| | | | | | | | | | | | | Since commit c867597bff2562180a18da4b8dba89d24e8b65c4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Jun 8 13:57:50 2016 -0700 X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove removed the only usage of __x86_prefetchw, we can remove the unused __x86_prefetchw.
* Rename the glibc.tune namespace to glibc.cpuSiddhesh Poyarekar2018-08-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The glibc.tune namespace is vaguely named since it is a 'tunable', so give it a more specific name that describes what it refers to. Rename the tunable namespace to 'cpu' to more accurately reflect what it encompasses. Also rename glibc.tune.cpu to glibc.cpu.name since glibc.cpu.cpu is weird. * NEWS: Mention the change. * elf/dl-tunables.list: Rename tune namespace to cpu. * sysdeps/powerpc/dl-tunables.list: Likewise. * sysdeps/x86/dl-tunables.list: Likewise. * sysdeps/aarch64/dl-tunables.list: Rename tune.cpu to cpu.name. * elf/dl-hwcaps.c (_dl_important_hwcaps): Adjust. * elf/dl-hwcaps.h (GET_HWCAP_MASK): Likewise. * manual/README.tunables: Likewise. * manual/tunables.texi: Likewise. * sysdeps/powerpc/cpu-features.c: Likewise. * sysdeps/unix/sysv/linux/aarch64/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.c: Likewise. * sysdeps/x86/cpu-features.h: Likewise. * sysdeps/x86/cpu-tunables.c: Likewise. * sysdeps/x86_64/Makefile: Likewise. * sysdeps/x86/dl-cet.c: Likewise. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* x86-64: Use fxsave/xsave/xsavec in _dl_runtime_resolve [BZ #21265]H.J. Lu2017-10-201-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In _dl_runtime_resolve, use fxsave/xsave/xsavec to preserve all vector, mask and bound registers. It simplifies _dl_runtime_resolve and supports different calling conventions. ld.so code size is reduced by more than 1 KB. However, use fxsave/xsave/xsavec takes a little bit more cycles than saving and restoring vector and bound registers individually. Latency for _dl_runtime_resolve to lookup the function, foo, from one shared library plus libc.so: Before After Change Westmere (SSE)/fxsave 345 866 151% IvyBridge (AVX)/xsave 420 643 53% Haswell (AVX)/xsave 713 1252 75% Skylake (AVX+MPX)/xsavec 559 719 28% Skylake (AVX512+MPX)/xsavec 145 272 87% Ryzen (AVX)/xsavec 280 553 97% This is the worst case where portion of time spent for saving and restoring registers is bigger than majority of cases. With smaller _dl_runtime_resolve code size, overall performance impact is negligible. On IvyBridge, differences in build and test time of binutils with lazy binding GCC and binutils are noises. On Westmere, differences in bootstrap and "makc check" time of GCC 7 with lazy binding GCC and binutils are also noises. [BZ #21265] * sysdeps/x86/cpu-features-offsets.sym (XSAVE_STATE_SIZE_OFFSET): New. * sysdeps/x86/cpu-features.c: Include <libc-pointer-arith.h>. (get_common_indeces): Set xsave_state_size, xsave_state_full_size and bit_arch_XSAVEC_Usable if needed. (init_cpu_features): Remove bit_arch_Use_dl_runtime_resolve_slow and bit_arch_Use_dl_runtime_resolve_opt. * sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt): Removed. (bit_arch_Use_dl_runtime_resolve_slow): Likewise. (bit_arch_Prefer_No_AVX512): Updated. (bit_arch_MathVec_Prefer_No_AVX512): Likewise. (bit_arch_XSAVEC_Usable): New. (STATE_SAVE_OFFSET): Likewise. (STATE_SAVE_MASK): Likewise. [__ASSEMBLER__]: Include <cpu-features-offsets.h>. (cpu_features): Add xsave_state_size and xsave_state_full_size. (index_arch_Use_dl_runtime_resolve_opt): Removed. (index_arch_Use_dl_runtime_resolve_slow): Likewise. (index_arch_XSAVEC_Usable): New. * sysdeps/x86/cpu-tunables.c (TUNABLE_CALLBACK (set_hwcaps)): Support XSAVEC_Usable. Remove Use_dl_runtime_resolve_slow. * sysdeps/x86_64/Makefile (tst-x86_64-1-ENV): New if tunables is enabled. * sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Replace _dl_runtime_resolve_sse, _dl_runtime_resolve_avx, _dl_runtime_resolve_avx_slow, _dl_runtime_resolve_avx_opt, _dl_runtime_resolve_avx512 and _dl_runtime_resolve_avx512_opt with _dl_runtime_resolve_fxsave, _dl_runtime_resolve_xsave and _dl_runtime_resolve_xsavec. * sysdeps/x86_64/dl-trampoline.S (DL_RUNTIME_UNALIGNED_VEC_SIZE): Removed. (DL_RUNTIME_RESOLVE_REALIGN_STACK): Check STATE_SAVE_ALIGNMENT instead of VEC_SIZE. (REGISTER_SAVE_BND0): Removed. (REGISTER_SAVE_BND1): Likewise. (REGISTER_SAVE_BND3): Likewise. (REGISTER_SAVE_RAX): Always defined to 0. (VMOV): Removed. (_dl_runtime_resolve_avx): Likewise. (_dl_runtime_resolve_avx_slow): Likewise. (_dl_runtime_resolve_avx_opt): Likewise. (_dl_runtime_resolve_avx512): Likewise. (_dl_runtime_resolve_avx512_opt): Likewise. (_dl_runtime_resolve_sse): Likewise. (_dl_runtime_resolve_sse_vex): Likewise. (USE_FXSAVE): New. (_dl_runtime_resolve_fxsave): Likewise. (USE_XSAVE): Likewise. (_dl_runtime_resolve_xsave): Likewise. (USE_XSAVEC): Likewise. (_dl_runtime_resolve_xsavec): Likewise. * sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx512): Removed. (_dl_runtime_resolve_avx512_opt): Likewise. (_dl_runtime_resolve_avx): Likewise. (_dl_runtime_resolve_avx_opt): Likewise. (_dl_runtime_resolve_sse): Likewise. (_dl_runtime_resolve_sse_vex): Likewise. (_dl_runtime_resolve_fxsave): New. (_dl_runtime_resolve_xsave): Likewise. (_dl_runtime_resolve_xsavec): Likewise.
* x86-64: Don't set GLRO(dl_platform) to NULL [BZ #22299]H.J. Lu2017-10-191-0/+20
| | | | | | | | | | | | | | | | | | | | | | | Since ld.so expands $PLATFORM with GLRO(dl_platform), don't set GLRO(dl_platform) to NULL. [BZ #22299] * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set GLRO(dl_platform) to NULL. * sysdeps/x86_64/Makefile (tests): Add tst-platform-1. (modules-names): Add tst-platformmod-1 and x86_64/tst-platformmod-2. (CFLAGS-tst-platform-1.c): New. (CFLAGS-tst-platformmod-1.c): Likewise. (CFLAGS-tst-platformmod-2.c): Likewise. (LDFLAGS-tst-platformmod-2.so): Likewise. ($(objpfx)tst-platform-1): Likewise. ($(objpfx)tst-platform-1.out): Likewise. (tst-platform-1-ENV): Likewise. ($(objpfx)x86_64/tst-platformmod-2.os): Likewise. * sysdeps/x86_64/tst-platform-1.c: New file. * sysdeps/x86_64/tst-platformmod-1.c: Likewise. * sysdeps/x86_64/tst-platformmod-2.c: Likewise.
* x86: Add x86_64 to x86-64 HWCAP [BZ #22093]H.J. Lu2017-09-111-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before glibc 2.26, ld.so set dl_platform to "x86_64" and searched the "x86_64" subdirectory when loading a shared library. ld.so in glibc 2.26 was changed to set dl_platform to "haswell" or "xeon_phi", based on supported ISAs. This led to shared library loading failure for shared libraries placed under the "x86_64" subdirectory. This patch adds "x86_64" to x86-64 dl_hwcap so that ld.so will always search the "x86_64" subdirectory when loading a shared library. NB: We can't set x86-64 dl_platform to "x86-64" since ld.so will skip the "haswell" and "xeon_phi" subdirectories on "haswell" and "xeon_phi" machines. Tested on i686 and x86-64. [BZ #22093] * sysdeps/x86/cpu-features.c (init_cpu_features): Initialize GLRO(dl_hwcap) to HWCAP_X86_64 for x86-64. * sysdeps/x86/dl-hwcap.h (HWCAP_COUNT): Updated. (HWCAP_IMPORTANT): Likewise. (HWCAP_X86_64): New enum. (HWCAP_X86_AVX512_1): Updated. * sysdeps/x86/dl-procinfo.c (_dl_x86_hwcap_flags): Add "x86_64". * sysdeps/x86_64/Makefile (tests): Add tst-x86_64-1. (modules-names): Add x86_64/tst-x86_64mod-1. (LDFLAGS-tst-x86_64mod-1.so): New. ($(objpfx)tst-x86_64-1): Likewise. ($(objpfx)x86_64/tst-x86_64mod-1.os): Likewise. (tst-x86_64-1-clean): Likewise. * sysdeps/x86_64/tst-x86_64-1.c: New file. * sysdeps/x86_64/tst-x86_64mod-1.c: Likewise.
* Check linker support for INSERT in linker scriptH.J. Lu2017-08-041-0/+2
| | | | | | | | | | | | | | | | | Since gold doesn't support INSERT in linker script: https://sourceware.org/bugzilla/show_bug.cgi?id=21676 tst-split-dynreloc fails to link with gold. Check if linker supports INSERT in linker script before using it. * config.make.in (have-insert): New. * configure.ac (libc_cv_insert): New. Set to yes if linker supports INSERT in linker script. (AC_SUBST(libc_cv_insert): New. * configure: Regenerated. * sysdeps/x86_64/Makefile (tests): Add tst-split-dynreloc only if $(have-insert) == yes.
* x86-64: Align the stack in __tls_get_addr [BZ #21609]H.J. Lu2017-07-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | This change forces realignment of the stack pointer in __tls_get_addr, so that binaries compiled by GCCs older than GCC 4.9: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066 continue to work even if vector instructions are used in glibc which require the ABI stack realignment. __tls_get_addr_slow is added to handle the slow paths in the default implementation of__tls_get_addr in elf/dl-tls.c. The new __tls_get_addr calls __tls_get_addr_slow after realigning the stack. Internal calls within ld.so go directly to the default implementation of __tls_get_addr because they do not need stack realignment. [BZ #21609] * sysdeps/x86_64/Makefile (sysdep-dl-routines): Add tls_get_addr. (gen-as-const-headers): Add rtld-offsets.sym. * sysdeps/x86_64/dl-tls.c: New file. * sysdeps/x86_64/rtld-offsets.sym: Likwise. * sysdeps/x86_64/tls_get_addr.S: Likewise. * sysdeps/x86_64/dl-tls.h: Add multiple inclusion guards. * sysdeps/x86_64/tlsdesc.sym (TI_MODULE_OFFSET): New. (TI_OFFSET_OFFSET): Likwise.
* x86-64: Verify that _dl_runtime_resolve preserves vector registersH.J. Lu2017-02-091-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | On x86-64, _dl_runtime_resolve must preserve the first 8 vector registers. Add 3 _dl_runtime_resolve tests to verify that SSE, AVX and AVX512 registers are preserved. * sysdeps/x86_64/Makefile (tests): Add tst-sse, tst-avx and tst-avx512. (test-extras): Add tst-avx-aux and tst-avx512-aux. (extra-test-objs): Add tst-avx-aux.o and tst-avx512-aux.o. (modules-names): Add tst-ssemod, tst-avxmod and tst-avx512mod. ($(objpfx)tst-sse): New rule. ($(objpfx)tst-avx): Likewise. ($(objpfx)tst-avx512): Likewise. (CFLAGS-tst-avx-aux.c): New. (CFLAGS-tst-avxmod.c): Likewise. (CFLAGS-tst-avx512-aux.c): Likewise. (CFLAGS-tst-avx512mod.c): Likewise. * sysdeps/x86_64/tst-avx-aux.c: New file. * sysdeps/x86_64/tst-avx.c: Likewise. * sysdeps/x86_64/tst-avx512-aux.c: Likewise. * sysdeps/x86_64/tst-avx512.c: Likewise. * sysdeps/x86_64/tst-avx512mod.c: Likewise. * sysdeps/x86_64/tst-avxmod.c: Likewise. * sysdeps/x86_64/tst-sse.c: Likewise. * sysdeps/x86_64/tst-ssemod.c: Likewise.
* x86_64: tst-quad1pie, tst-quad2pie: compile with -fPIE [BZ #7065]Nick Alcock2016-12-211-0/+3
| | | | | | With stack protection enabled, these files have external symbol references for the first time, so the fact that they are not compiled with -fPIE and are then linked into a -pie binary starts to hurt.
* Register extra test objectsAndreas Schwab2016-04-131-0/+4
| | | | | This makes sure that the extra test objects are compiled with the correct MODULE_NAME and dependencies are tracked.
* tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]Florian Weimer2016-03-071-4/+4
| | | | | This ensures that GCC will not use unsupported instructions before the run-time check to ensure support.
* Add a comment in sysdeps/x86_64/MakefileH.J. Lu2016-03-041-0/+3
| | | | | | Mention recursive calls when ENTRY is used in _mcount.S. * sysdeps/x86_64/Makefile (sysdep_noprof): Add a comment.
* Copy x86_64 _mcount.op from _mcount.oH.J. Lu2016-03-031-0/+1
| | | | | | | | No need to compile x86_64 _mcount.S with -pg. We can just copy the normal static object. * gmon/Makefile (noprof): Add $(sysdep_noprof). * sysdeps/x86_64/Makefile (sysdep_noprof): Add _mcount.
* Remove configure tests for -mno-vzeroupper support.Joseph Myers2015-10-091-4/+1
| | | | | | | | | | | | | | | | | | GCC added support for -mno-vzeroupper in version 4.6. Thus the configure tests for this support are obsolete, and this patch removes them. Tested for x86_64 and x86 (testsuite, and that installed stripped shared libraries are unchanged by this patch). * sysdeps/i386/configure.ac (libc_cv_cc_novzeroupper): Remove configure test. * sysdeps/i386/configure: Regenerated. * sysdeps/x86_64/configure.ac (libc_cv_cc_novzeroupper): Remove configure test. * sysdeps/x86_64/configure: Regenerated. * sysdeps/x86_64/Makefile [$(config-cflags-novzeroupper) = yes]: Make code unconditional.
* Remove configure tests for AVX support.Joseph Myers2015-10-081-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC added support for -mavx and -msse2avx in version 4.4. Thus the configure tests for this support are obsolete, and this patch removes them. Tested for x86_64 and x86 (testsuite, and that installed stripped shared libraries are unchanged by this patch). * sysdeps/i386/configure.ac (libc_cv_cc_avx): Remove configure test. (libc_cv_cc_sse2avx): Likewise. * sysdeps/i386/configure: Regenerated. * sysdeps/i386/i686/multiarch/Makefile [$(subdir)$(config-cflags-avx) = mathyes]: Change conditional to [$(subdir) = math]. * sysdeps/i386/i686/multiarch/s_fma-fma.c [HAVE_AVX_SUPPORT]: Make code unconditional. * sysdeps/i386/i686/multiarch/s_fma.c [HAVE_AVX_SUPPORT]: Likewise. * sysdeps/i386/i686/multiarch/s_fmaf-fma.c [HAVE_AVX_SUPPORT]: Likewise. * sysdeps/i386/i686/multiarch/s_fmaf.c [HAVE_AVX_SUPPORT]: Likewise. * sysdeps/x86_64/configure.ac (libc_cv_cc_avx): Remove configure test. (libc_cv_cc_sse2avx): Likewise. * sysdeps/x86_64/configure: Regenerated. * sysdeps/x86_64/Makefile [$(config-cflags-avx) = yes]: Make code unconditional. * sysdeps/x86_64/dl-trampoline.h (_dl_runtime_profile) [HAVE_AVX_SUPPORT || HAVE_AVX512_ASM_SUPPORT]: Make code unconditional. (_dl_runtime_profile) [!(HAVE_AVX_SUPPORT || HAVE_AVX512_ASM_SUPPORT)]: Remove conditional code. * sysdeps/x86_64/fpu/multiarch/Makefile [$(config-cflags-sse2avx) = yes]: Make code unconditional. * sysdeps/x86_64/fpu/multiarch/e_atan2.c [HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise. * sysdeps/x86_64/fpu/multiarch/e_exp.c [HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise. * sysdeps/x86_64/fpu/multiarch/e_log.c [HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise. * sysdeps/x86_64/fpu/multiarch/s_atan.c [HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise. * sysdeps/x86_64/fpu/multiarch/s_fma.c [HAVE_AVX_SUPPORT]: Likewise. * sysdeps/x86_64/fpu/multiarch/s_fmaf.c [HAVE_AVX_SUPPORT]: Likewise. * sysdeps/x86_64/fpu/multiarch/s_sin.c [HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise. * sysdeps/x86_64/fpu/multiarch/s_tan.c [HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise. * sysdeps/x86_64/multiarch/strcmp.S [HAVE_AVX_SUPPORT]: Likewise. * config.h.in (HAVE_AVX_SUPPORT): Remove #undef. (HAVE_SSE2AVX_SUPPORT): Likewise.
* Don't disable SSE in x86-64 ld.soH.J. Lu2015-08-261-0/+4
| | | | | | | | | | | | | | | | Since x86-64 ld.so preserves vector registers now, we can use SSE in x86-64 ld.so. We should run tst-ld-sse-use.sh only on i386. * sysdeps/x86/Makefile [$(subdir) == elf] (CFLAGS-.os, tests-special, $(objpfx)tst-ld-sse-use.out): Moved to ... * sysdeps/i386/Makefile [$(subdir) == elf] (CFLAGS-.os, tests-special, $(objpfx)tst-ld-sse-use.out): Here. Update comments. * sysdeps/x86_64/Makefile [$(subdir) == elf] (CFLAGS-.os): Add -mno-mmx for $(all-rtld-routines). * sysdeps/x86/tst-ld-sse-use.sh: Moved to ... * sysdeps/i386/tst-ld-sse-use.sh: Here. Replace x86-64 with i386.
* Save and restore vector registers in x86-64 ld.soH.J. Lu2015-08-251-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds SSE, AVX and AVX512 versions of _dl_runtime_resolve and _dl_runtime_profile, which save and restore the first 8 vector registers used for parameter passing. elf_machine_runtime_setup selects the proper _dl_runtime_resolve or _dl_runtime_profile based on _dl_x86_cpu_features. It avoids race condition caused by FOREIGN_CALL macros, which are only used for x86-64. Performance impact of saving and restoring 8 vector registers are negligible on Nehalem, Sandy Bridge, Ivy Bridge and Haswell when ld.so is optimized with SSE2. [BZ #15128] * sysdeps/x86_64/Makefile [$(subdir) == elf] (tests): Add ifuncmain8. (modules-names): Add ifuncmod8. ($(objpfx)ifuncmain8): New rule. * sysdeps/x86_64/dl-machine.h: Include <dl-procinfo.h> and <cpuid.h>. (elf_machine_runtime_setup): Use _dl_runtime_resolve_sse, _dl_runtime_resolve_avx, or _dl_runtime_resolve_avx512, _dl_runtime_profile_sse, _dl_runtime_profile_avx, or _dl_runtime_profile_avx512, based on HAS_ARCH_FEATURE. * sysdeps/x86_64/dl-trampoline.S: Rewrite. * sysdeps/x86_64/dl-trampoline.h: Likewise. * sysdeps/x86_64/ifuncmain8.c: New file. * sysdeps/x86_64/ifuncmod8.c: Likewise. * sysdeps/x86_64/nptl/tcb-offsets.sym (RTLD_SAVESPACE_SSE): Removed. * sysdeps/x86_64/nptl/tls.h (__128bits): Removed. (tcbhead_t): Change rtld_must_xmm_save to __glibc_unused1. Change rtld_savespace_sse to __glibc_unused2. (RTLD_CHECK_FOREIGN_CALL): Removed. (RTLD_ENABLE_FOREIGN_CALL): Likewise. (RTLD_PREPARE_FOREIGN_CALL): Likewise. (RTLD_FINALIZE_FOREIGN_CALL): Likewise.
* Fix dynamic linker issue with bind-nowPetar Jovanovic2015-08-191-0/+5
| | | | | | | | | | | | | | Fix the bind-now case when DT_REL and DT_JMPREL sections are separate and there is a gap between them. [BZ #14341] * elf/dynamic-link.h (elf_machine_lazy_rel): Properly handle the case when there is a gap between DT_REL and DT_JMPREL sections. * sysdeps/x86_64/Makefile (tests): Add tst-split-dynreloc. (LDFLAGS-tst-split-dynreloc): New. (tst-split-dynreloc-ENV): Likewise. * sysdeps/x86_64/tst-split-dynreloc.c: New file. * sysdeps/x86_64/tst-split-dynreloc.lds: Likewise.
* Clean up sysdep-dl-routines variable.Roland McGrath2015-02-061-2/+0
|
* Remove HP_TIMING_DIFF_INIT and dl_hp_timing_overheadRichard Henderson2014-07-031-2/+0
| | | | | Without HP_TIMING_ACCUM, dl_hp_timing_overhead is write-only. If we remove it, there's no point in HP_TIMING_DIFF_INIT either.
* Save and restore AVX-512 zmm registers to x86-64 ld.soIgor Zamyatin2014-03-131-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AVX-512 ISA adds 512-bit zmm registers. This patch updates _dl_runtime_profile to pass zmm registers to run-time audit. It also changes _dl_x86_64_save_sse and _dl_x86_64_restore_sse to upport zmm registers, which are called when only when RTLD_PREPARE_FOREIGN_CALL is used. Its performance impact is minimum. * config.h.in (HAVE_AVX512_SUPPORT): New #undef. (HAVE_AVX512_ASM_SUPPORT): Likewise. * sysdeps/x86_64/bits/link.h (La_x86_64_zmm): New. (La_x86_64_vector): Add zmm. * sysdeps/x86_64/Makefile (tests): Add tst-audit10. (modules-names): Add tst-auditmod10a and tst-auditmod10b. ($(objpfx)tst-audit10): New target. ($(objpfx)tst-audit10.out): Likewise. (tst-audit10-ENV): New. (AVX512-CFLAGS): Likewise. (CFLAGS-tst-audit10.c): Likewise. (CFLAGS-tst-auditmod10a.c): Likewise. (CFLAGS-tst-auditmod10b.c): Likewise. * sysdeps/x86_64/configure.ac: Set config-cflags-avx512, HAVE_AVX512_SUPPORT and HAVE_AVX512_ASM_SUPPORT. * sysdeps/x86_64/configure: Regenerated. * sysdeps/x86_64/dl-trampoline.S (_dl_runtime_profile): Add AVX-512 zmm register support. (_dl_x86_64_save_sse): Likewise. (_dl_x86_64_restore_sse): Likewise. * sysdeps/x86_64/dl-trampoline.h: Updated to support different size vector registers. * sysdeps/x86_64/link-defines.sym (YMM_SIZE): New. (ZMM_SIZE): Likewise. * sysdeps/x86_64/tst-audit10.c: New file. * sysdeps/x86_64/tst-auditmod10a.c: Likewise. * sysdeps/x86_64/tst-auditmod10b.c: Likewise.
* Move x86_64-specific audit tests to sysdeps/x86_64/.Joseph Myers2013-04-251-0/+44
|
* Also run tst-xmmymm.sh on i386 ld.soH.J. Lu2012-11-071-5/+0
|
* Set "fail on error" mode directly in testsuite shell scriptsDmitry V. Levin2012-09-251-1/+1
|
* Add tst-mallocalign1H.J. Lu2012-05-171-0/+4
|
* Handle R_X86_64_RELATIVE64 and R_X86_64_64 for x32H.J. Lu2012-05-101-0/+13
|
* Add optimized strncasecmp versions for x86-64.Ulrich Drepper2010-08-141-1/+1
|
* Implement optimized strcaecmp for x86-64.Ulrich Drepper2010-07-301-1/+2
|
* Refine testing for xmm/ymm register use in x86-64 ld.so.Ulrich Drepper2009-07-271-0/+1
| | | | | | | | | The test now takes the callgraph into account. Only code called during runtime relocation is affected by the limitation. We now determine the affected object files as closely as possible from the outside. This allowed to remove some the specializations for some of the string functions as they are only used in other code paths.
* Make sure no code in ld.so uses xmm/ymm registers on x86-64.Ulrich Drepper2009-07-261-0/+4
| | | | | | | | | | This patch introduces a test to make sure no function modifies the xmm/ymm registers. With the exception of the auditing functions. The test is probably too pessimistic. All code linked into ld.so is checked. Perhaps at some point the callgraph starting from _dl_fixup and _dl_profile_fixup is checked and we can start using faster SSE-using functions in parts of ld.so.
* Add AVX support to ld.so auditing for x86-64.H.J. Lu2009-07-101-0/+1
|
* Introduce TLS descriptors for i386 and x86_64.Ulrich Drepper2008-05-131-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * include/inline-hashtab.h: New file, copied from 2005's libiberty, with fix for memory leak imported afterwards by Glauber de Oliveira Costa. * elf/tlsdeschtab.h: New file. * elf/dl-reloc.c (_dl_try_allocate_static_tls): Extract from... (_dl_allocate_static_tls): ... here. Rearrange failure path. (CHECK_STATIC_TLS): Move to... * elf/dynamic-link.h: ... this file. (TRY_STATIC_TLS): New macro. * elf/dl-conflict.c (CHECK_STATIC_TLS, TRY_STATIC_TLS): Override. * elf/elf.h (R_386_TLS_GOTDESC, R_386_TLS_DESC_CALL, R_386_TLS_DESC): Define. (R_X86_64_PC64, R_X86_GOTOFF64, R_X86_64_GOTPC32): Merge from binutils. (R_X86_64_GOTPC32_TLSDESC, R_X86_64_TLSDESC_CALL, R_X86_64_TLSDESC): Define. (R_386_NUM, R_X86_64_NUM): Adjust. * sysdeps/i386/Makefile (sysdep-dl-routines, sysdep_routines, systep-rtld-routines): Add tlsdesc and dl-tlsdesc for elf subdir. (gen-as-const-headers): Add tlsdesc.sym to csu subdir. * sysdeps/i386/dl-lookupcfg.h: New file. Introduce _dl_unmap to release tlsdesc_table. * sysdeps/i386/dl-machine.h: Include dl-tlsdesc.h. (elf_machine_type_class): Mark R_386_TLS_DESC as PLT class. (elf_machine_rel): Handle R_386_TLS_DESC. (elf_machine_rela): Likewise. (elf_machine_lazy_rel): Likewise. (elf_machine_lazy_rela): Likewise. * sysdeps/i386/dl-tls.h (struct dl_tls_index): Name it. * sysdeps/i386/dl-tlsdesc.S: New file. * sysdeps/i386/dl-tlsdesc.h: New file. * sysdeps/i386/tlsdesc.c: New file. * sysdeps/i386/tlsdesc.sym: New file. * sysdeps/i386/bits/linkmap.h (struct link_map_machine): Add tlsdesc_table. * sysdeps/x86_64/Makefile (sysdep-dl-routines, sysdep_routines, systep-rtld-routines): Add tlsdesc and dl-tlsdesc for elf subdir. (gen-as-const-headers): Add tlsdesc.sym to csu subdir. * sysdeps/x86_64/dl-lookupcfg.h: New file. Introduce _dl_unmap to release tlsdesc_table. * sysdeps/x86_64/dl-machine.h: Include dl-tlsdesc.h. (elf_machine_runtime_setup): Set up lazy TLSDESC GOT entry. (elf_machine_type_class): Mark R_X86_64_TLSDESC as PLT class. (elf_machine_rel): Handle R_X86_64_TLSDESC. (elf_machine_rela): Likewise. (elf_machine_lazy_rel): Likewise. * sysdeps/x86_64/dl-tls.h (struct dl_tls_index): Name it. (__tls_get_addr): Do not declare for non-shared compiles. * sysdeps/x86_64/dl-tlsdesc.S: New file. * sysdeps/x86_64/dl-tlsdesc.h: New file. * sysdeps/x86_64/tlsdesc.c: New file. * sysdeps/x86_64/tlsdesc.sym: New file. * sysdeps/x86_64/bits/linkmap.h (struct link_map_machine): Add tlsdesc_table for both 32- and 64-bit structs.
* * sysdeps/unix/sysv/linux/x86_64/sysconf.c: Move cache informationUlrich Drepper2007-05-211-0/+4
| | | | | | | | | | | | handling to ... * sysdeps/x86_64/cacheinfo.c: ... here. New file. * sysdeps/x86_64/Makefile [subdir=string] (sysdep_routines): Add cacheinfo. * sysdeps/x86_64/memcpy.S: Complete rewrite. * sysdeps/x86_64/mempcpy.S: Adjust appropriately. Patch by Evandro Menezes <evandro.menezes@amd.com>. * sysdeps/unix/sysv/linux/i386/epoll_pwait.S: New file.