| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
Don't pass -Wno-sign-compare on Windows.
Add a #define HAVE_WINDOWS_H if _WIN32 is defined.
Don't assume sys/uio.h is available on Windows.
PiperOrigin-RevId: 524416809
|
|
|
|
| |
PiperOrigin-RevId: 524135175
|
| |
|
|
|
|
| |
PiperOrigin-RevId: 524031046
|
| |
|
|
|
|
| |
PiperOrigin-RevId: 523460180
|
|
|
|
| |
PiperOrigin-RevId: 523287305
|
|
|
|
| |
PiperOrigin-RevId: 518358512
|
|
|
|
|
|
|
|
|
|
|
| |
snappy-internal.h uses std::pair, which is defined in the <utility>
header. Typically, this works because existing C++ standard library
implementations provide <utility> via other transitive includes;
however, these transitive includes are not guaranteed to exist, and
don't exist in certain contexts (e.g. compiling against LLVM's libc++
with Clang modules.)
PiperOrigin-RevId: 517213822
|
|
|
|
|
|
| |
and three cycles.
PiperOrigin-RevId: 517141646
|
|
|
|
| |
PiperOrigin-RevId: 515161676
|
|
|
|
|
|
|
|
| |
buffers are never written. This allows us to defer any writes to the output buffer for an arbitrary amount of time as long as the writes all occur in the proper order. When a MemCopy64 would have normally occurred we save away the source address and length. Once we reach the location of the next write to the output buffer first perform the deferred copy. This gives time for the source address calculation and length to finish before the deferred copy.
This change gives 1.84% on CLX and 0.97% Milan.
PiperOrigin-RevId: 504012310
|
|\
| |
| |
| | |
PiperOrigin-RevId: 501489679
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The existing code uses a series of 8bit loads with shifts and ors to
emulate an (unaligned) load of a larger type. These are then expected to
become single loads in the compiler, producing optimal assembly. Whilst
this is true it happens very late in the compiler, meaning that
throughout most of the pipeline it is treated (and cost-modelled) as
multiple loads, shifts and ors. This can make the compiler make poor
decisions (such as not unrolling loops that should be), or to break up
the pattern before it is turned into a single load.
For example the loops in CompressFragment do not get unrolled as
expected due to a higher cost than the unroll threshold in clang.
Instead this patch uses a more conventional methods of loading unaligned
data, using a memcpy directly which the compiler will be able to deal
with much more straight forwardly, modelling it as a single unaligned
load. The old code is left as-is for big-endian systems.
This helps improve the performance of the BM_ZFlat benchmarks by up to
10-15% on an Arm Neoverse N1.
Change-Id: I986f845ebd0a0806d052d2be3e4dbcbee91713d7
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Calls to memcpy seem to be quite expensive
```
BM_ZFlat/0 [html (22.24 %) ] 114µs ± 6% 110µs ± 6% -3.97% (p=0.000 n=118+115)
BM_ZFlat/1 [urls (47.84 %) ] 1.63ms ± 5% 1.58ms ± 5% -3.39% (p=0.000 n=117+115)
BM_ZFlat/2 [jpg (99.95 %) ] 7.84µs ± 6% 7.70µs ± 6% -1.66% (p=0.000 n=119+117)
BM_ZFlat/3 [jpg_200 (73.00 %)] 265ns ± 6% 255ns ± 6% -3.48% (p=0.000 n=101+98)
BM_ZFlat/4 [pdf (83.31 %) ] 11.8µs ± 6% 11.6µs ± 6% -2.14% (p=0.000 n=118+116)
BM_ZFlat/5 [html4 (22.52 %) ] 525µs ± 6% 513µs ± 6% -2.36% (p=0.000 n=117+116)
BM_ZFlat/6 [txt1 (57.87 %) ] 494µs ± 5% 480µs ± 6% -2.84% (p=0.000 n=118+116)
BM_ZFlat/7 [txt2 (62.02 %) ] 444µs ± 4% 428µs ± 7% -3.51% (p=0.000 n=119+117)
BM_ZFlat/8 [txt3 (55.17 %) ] 1.34ms ± 5% 1.30ms ± 5% -2.40% (p=0.000 n=120+116)
BM_ZFlat/9 [txt4 (66.41 %) ] 1.84ms ± 5% 1.78ms ± 5% -3.55% (p=0.000 n=110+111)
BM_ZFlat/10 [pb (19.61 %) ] 101µs ± 5% 97µs ± 5% -4.67% (p=0.000 n=118+118)
BM_ZFlat/11 [gaviota (37.73 %)] 368µs ± 5% 360µs ± 6% -2.13% (p=0.000 n=91+90)
BM_ZFlat/12 [cp (48.25 %) ] 38.9µs ± 6% 36.8µs ± 6% -5.36% (p=0.000 n=88+87)
BM_ZFlat/13 [c (42.52 %) ] 13.4µs ± 6% 13.1µs ± 8% -2.38% (p=0.000 n=115+116)
BM_ZFlat/14 [lsp (48.94 %) ] 4.05µs ± 4% 3.94µs ± 4% -2.58% (p=0.000 n=91+85)
BM_ZFlat/15 [xls (41.10 %) ] 1.42ms ± 5% 1.39ms ± 7% -2.49% (p=0.000 n=116+117)
BM_ZFlat/16 [xls_200 (78.00 %)] 313ns ± 6% 307ns ± 5% -1.89% (p=0.000 n=89+84)
BM_ZFlat/17 [bin (18.12 %) ] 518µs ± 5% 506µs ± 5% -2.42% (p=0.000 n=118+116)
BM_ZFlat/18 [bin_200 (7.50 %) ] 86.8ns ± 6% 85.3ns ± 6% -1.76% (p=0.000 n=118+114)
BM_ZFlat/19 [sum (48.99 %) ] 67.9µs ± 4% 61.1µs ± 6% -9.96% (p=0.000 n=114+117)
BM_ZFlat/20 [man (59.45 %) ] 5.64µs ± 6% 5.47µs ± 7% -3.06% (p=0.000 n=117+115)
BM_ZFlatAll [21 kTestDataFiles] 9.23ms ± 4% 9.01ms ± 5% -2.44% (p=0.000 n=80+83)
BM_ZFlatIncreasingTableSize [7 tables ] 30.4µs ± 5% 29.3µs ± 7% -3.45% (p=0.000 n=96+96)
```
PiperOrigin-RevId: 490184133
|
| |
| |
| |
| | |
PiperOrigin-RevId: 489554313
|
| |
| |
| |
| |
| |
| |
| |
| | |
As far as we know, the lack of "cc" in the clobbers hasn't caused
problems yet, but it could. This change is to improve correctness,
and is also almost certainly performance neutral.
PiperOrigin-RevId: 487133620
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This change replaces the hashing function used during compression with
one that is roughly as good but faster. This speeds up compression by
two to a few percent on the Intel-, AMD-, and Arm-based machines we
tested. The amount of compression is roughly unchanged.
PiperOrigin-RevId: 485960303
|
| |
| |
| |
| |
| |
| | |
capable x86 platforms. This gives an average speedup of 6.87% on Milan and 1.90% on Skylake.
PiperOrigin-RevId: 480370725
|
| |
| |
| |
| | |
PiperOrigin-RevId: 479818960
|
| |
| |
| |
| |
| |
| | |
`std::string::data()` is const-only until C++17.
PiperOrigin-RevId: 479708109
|
| |
| |
| |
| | |
PiperOrigin-RevId: 478984028
|
| |
| |
| |
| |
| |
| | |
This reads from an `iovec` array rather than from a `char` array as in `snappy::Compress`.
PiperOrigin-RevId: 476930623
|
|\ \
| | |
| | |
| | | |
PiperOrigin-RevId: 463090354
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
As `i + offset` is promoted to a "negative" size_t,
UBSan would complain when adding the resulting offset to `dst`:
```
/tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43: runtime error: addition of unsigned offset to 0x6120003c5ec1 overflowed to 0x6120003c5ec0
#0 0x7f9ebd21769c in snappy::(anonymous namespace)::Copy64BytesWithPatternExtension(char*, unsigned long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43
#1 0x7f9ebd21769c in std::__1::pair<unsigned char const*, long> snappy::DecompressBranchless<char*>(unsigned char const*, unsigned char const*, long, char*, long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:1160:15
```
|
| | |
| | |
| | |
| | |
| | |
| | | |
contract of `MemCopy64()`, and clarify that it applies to `size`, not to 64.
PiperOrigin-RevId: 453920284
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
By default MemCpy() / MemMove() always copies 64 bytes in DecompressBranchless(). Profiling shows that the vast majority of the time we need to copy many fewer bytes (typically <= 16 bytes). It is safe to copy fewer bytes as long as we exceed len.
This change improves throughput by ~12% on ARM, ~35% on AMD Milan, and ~7% on Intel Cascade Lake.
PiperOrigin-RevId: 453917840
|
| | |
| | |
| | |
| | | |
PiperOrigin-RevId: 444863689
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| | |
Not everything defining __GNUC__ supports flag outputs
from asm statements; in particular, some Clang versions
on macOS does not. The correct test per the GCC documentation
is __GCC_ASM_FLAG_OUTPUTS__, so use that instead.
PiperOrigin-RevId: 423749308
|
|/
|
|
|
|
|
| |
* Align CONTRIBUTING.md with the google/new-project template.
* Explain the support story for the CMake config.
PiperOrigin-RevId: 421311695
|
|
|
|
|
|
| |
to avoid UB of passing uninitialized argument by value.
PiperOrigin-RevId: 406052814
|
|
|
|
| |
PiperOrigin-RevId: 394247182
|
|\
| |
| |
| | |
PiperOrigin-RevId: 394061345
|
|/
|
|
|
|
|
|
|
|
|
|
|
| |
The final ip advance value doesn't have to wait for
the result of offset to load *tag. It can be computed
along with the offset, so the codegen will use one
csinc in parallel with ldrb. This will improve the
throughput.
With this change it is observed ~4.2% uplift in UFlat/10
and ~3.7% in UFlatMedley
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I20ab211235bbf578c6c978f2bbd9160a49e920da
|
|\
| |
| |
| | |
PiperOrigin-RevId: 393681630
|
| |
| |
| |
| |
| | |
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I3fade568ff92b4303387705f843d0051d5e88349
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
After SHUFFLE code blocks are refactored, "tmmintrin.h"
is missed, and bmi2 code part will have build failure
as type conflicts.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I7800cd7e050f4d349e5a227206b14b9c566e547f
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The #if predicate evaluates to false if the macro is undefined, or
defined to 0. #ifdef (and its synonym #if defined) evaluates to false
only if the macro is undefined.
The new setup allows differentiating between setting a macro to 0 (to
express that the capability definitely does not exist / should not be
used) and leaving a macro undefined (to express not knowing whether a
capability exists / not caring if a capability is used).
PiperOrigin-RevId: 391094241
|
| |
| |
| |
| | |
PiperOrigin-RevId: 391082698
|
|\ \
| | |
| | |
| | | |
PiperOrigin-RevId: 390767998
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Clang doesn't realize the load with free zero-extension,
and emits another extra 'and xn, xm, 0xff' to calc offset.
With this change ,this extra op is removed, and consistent
1.7% performance uplift is observed.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: Ica4617852c4b93eadc6c5c551dc3961ffbadb8f0
|
|\ \
| |/
|/|
| | |
PiperOrigin-RevId: 390715690
|
|/
|
|
|
|
|
|
|
|
|
| |
Inspired by kExtractMasksCombined, this patch uses shift
to replace table lookup. On Arm the codegen is 2 shift ops
(lsl+lsr). Comparing to previous ldr which requires 4 cycles
latency, the lsl+lsr only need 2 cycles.
Slight (~0.3%) uplift observed on N1, and ~3% on A72.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I5b53632d22d9e5cf1a49d0c5cdd16265a15de23b
|
|
|
|
|
|
| |
improvement for ARM. Probably because ARM has more relaxed address computation than x86 https://www.godbolt.org/z/bfM1ezx41. I don't think this is a compiler bug or it can do something about it
PiperOrigin-RevId: 387569896
|
|
|
|
| |
PiperOrigin-RevId: 387356237
|
|
|
|
|
|
| |
code. clang for ARM and gcc for x86 https://gcc.godbolt.org/z/oxeGG7aEx
PiperOrigin-RevId: 383467656
|
|
|
|
|
|
| |
generation (csinc). For codegen see https://gcc.godbolt.org/z/a8z9j95Pv
PiperOrigin-RevId: 382688740
|
|
|
|
|
|
| |
The SSSE3 intrinsics we use have their direct analogues in NEON, so making this optimization portable requires a very thin translation layer.
PiperOrigin-RevId: 381280165
|
|
|
|
|
|
|
| |
Xcode (drives macOS image) : 12.2 => 12.5
Clang : 10 => 12
GCC : 10 => 11
PiperOrigin-RevId: 375610083
|
|
|
|
|
|
| |
context, because the other 5 bits in the byte are used for len-4 and the tag.
PiperOrigin-RevId: 374926553
|