summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Fixes for Windows bazel build.HEADmainRichard O'Grady2023-04-143-4/+15
| | | | | | | | Don't pass -Wno-sign-compare on Windows. Add a #define HAVE_WINDOWS_H if _WIN32 is defined. Don't assume sys/uio.h is available on Windows. PiperOrigin-RevId: 524416809
* Add initial bazel build support for snappy.Richard O'Grady2023-04-134-0/+229
| | | | PiperOrigin-RevId: 524135175
* Upgrade googletest to v1.13.0 release.Richard O'Grady2023-04-131-0/+0
|
* Disable Wimplicit-int-float-conversion warning in googletestRichard O'Grady2023-04-131-0/+7
| | | | PiperOrigin-RevId: 524031046
* Upgrade benchmark library to v1.7.1 release.Richard O'Grady2023-04-111-0/+0
|
* Disable -Wsign-compare warning.Richard O'Grady2023-04-111-0/+5
| | | | PiperOrigin-RevId: 523460180
* Define missing SNAPPY_PREFETCH macros.Richard O'Grady2023-04-113-0/+15
| | | | PiperOrigin-RevId: 523287305
* Add prefetch to zippy compressIlya Tokar2023-03-292-6/+2
| | | | PiperOrigin-RevId: 518358512
* Explicitly #include <utility> in snappy-internal.hSnappy Team2023-03-291-0/+2
| | | | | | | | | | | snappy-internal.h uses std::pair, which is defined in the <utility> header. Typically, this works because existing C++ standard library implementations provide <utility> via other transitive includes; however, these transitive includes are not guaranteed to exist, and don't exist in certain contexts (e.g. compiling against LLVM's libc++ with Clang modules.) PiperOrigin-RevId: 517213822
* Optimize check for uncommon decompression for ARM, saving two instructions ↵Snappy Team2023-03-291-5/+10
| | | | | | and three cycles. PiperOrigin-RevId: 517141646
* Tag open source release 1.1.10.1.1.10Victor Costan2023-03-082-1/+7
| | | | PiperOrigin-RevId: 515161676
* The output buffer in DecompressBranchless is never read from and the source ↵Snappy Team2023-03-071-9/+41
| | | | | | | | buffers are never written. This allows us to defer any writes to the output buffer for an arbitrary amount of time as long as the writes all occur in the proper order. When a MemCopy64 would have normally occurred we save away the source address and length. Once we reach the location of the next write to the output buffer first perform the deferred copy. This gives time for the source address calculation and length to finish before the deferred copy. This change gives 1.84% on CLX and 0.97% Milan. PiperOrigin-RevId: 504012310
* Merge pull request #150 from davemgreen:betterunalignedloadsVictor Costan2023-01-121-12/+45
|\ | | | | | | PiperOrigin-RevId: 501489679
| * Change LittleEndian loads/stores to use memcpyDavid Green2022-01-191-12/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing code uses a series of 8bit loads with shifts and ors to emulate an (unaligned) load of a larger type. These are then expected to become single loads in the compiler, producing optimal assembly. Whilst this is true it happens very late in the compiler, meaning that throughout most of the pipeline it is treated (and cost-modelled) as multiple loads, shifts and ors. This can make the compiler make poor decisions (such as not unrolling loops that should be), or to break up the pattern before it is turned into a single load. For example the loops in CompressFragment do not get unrolled as expected due to a higher cost than the unroll threshold in clang. Instead this patch uses a more conventional methods of loading unaligned data, using a memcpy directly which the compiler will be able to deal with much more straight forwardly, modelling it as a single unaligned load. The old code is left as-is for big-endian systems. This helps improve the performance of the BM_ZFlat benchmarks by up to 10-15% on an Arm Neoverse N1. Change-Id: I986f845ebd0a0806d052d2be3e4dbcbee91713d7
* | Allow some buffer overwrite on literal emittingSnappy Team2023-01-121-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calls to memcpy seem to be quite expensive ``` BM_ZFlat/0 [html (22.24 %) ] 114µs ± 6% 110µs ± 6% -3.97% (p=0.000 n=118+115) BM_ZFlat/1 [urls (47.84 %) ] 1.63ms ± 5% 1.58ms ± 5% -3.39% (p=0.000 n=117+115) BM_ZFlat/2 [jpg (99.95 %) ] 7.84µs ± 6% 7.70µs ± 6% -1.66% (p=0.000 n=119+117) BM_ZFlat/3 [jpg_200 (73.00 %)] 265ns ± 6% 255ns ± 6% -3.48% (p=0.000 n=101+98) BM_ZFlat/4 [pdf (83.31 %) ] 11.8µs ± 6% 11.6µs ± 6% -2.14% (p=0.000 n=118+116) BM_ZFlat/5 [html4 (22.52 %) ] 525µs ± 6% 513µs ± 6% -2.36% (p=0.000 n=117+116) BM_ZFlat/6 [txt1 (57.87 %) ] 494µs ± 5% 480µs ± 6% -2.84% (p=0.000 n=118+116) BM_ZFlat/7 [txt2 (62.02 %) ] 444µs ± 4% 428µs ± 7% -3.51% (p=0.000 n=119+117) BM_ZFlat/8 [txt3 (55.17 %) ] 1.34ms ± 5% 1.30ms ± 5% -2.40% (p=0.000 n=120+116) BM_ZFlat/9 [txt4 (66.41 %) ] 1.84ms ± 5% 1.78ms ± 5% -3.55% (p=0.000 n=110+111) BM_ZFlat/10 [pb (19.61 %) ] 101µs ± 5% 97µs ± 5% -4.67% (p=0.000 n=118+118) BM_ZFlat/11 [gaviota (37.73 %)] 368µs ± 5% 360µs ± 6% -2.13% (p=0.000 n=91+90) BM_ZFlat/12 [cp (48.25 %) ] 38.9µs ± 6% 36.8µs ± 6% -5.36% (p=0.000 n=88+87) BM_ZFlat/13 [c (42.52 %) ] 13.4µs ± 6% 13.1µs ± 8% -2.38% (p=0.000 n=115+116) BM_ZFlat/14 [lsp (48.94 %) ] 4.05µs ± 4% 3.94µs ± 4% -2.58% (p=0.000 n=91+85) BM_ZFlat/15 [xls (41.10 %) ] 1.42ms ± 5% 1.39ms ± 7% -2.49% (p=0.000 n=116+117) BM_ZFlat/16 [xls_200 (78.00 %)] 313ns ± 6% 307ns ± 5% -1.89% (p=0.000 n=89+84) BM_ZFlat/17 [bin (18.12 %) ] 518µs ± 5% 506µs ± 5% -2.42% (p=0.000 n=118+116) BM_ZFlat/18 [bin_200 (7.50 %) ] 86.8ns ± 6% 85.3ns ± 6% -1.76% (p=0.000 n=118+114) BM_ZFlat/19 [sum (48.99 %) ] 67.9µs ± 4% 61.1µs ± 6% -9.96% (p=0.000 n=114+117) BM_ZFlat/20 [man (59.45 %) ] 5.64µs ± 6% 5.47µs ± 7% -3.06% (p=0.000 n=117+115) BM_ZFlatAll [21 kTestDataFiles] 9.23ms ± 4% 9.01ms ± 5% -2.44% (p=0.000 n=80+83) BM_ZFlatIncreasingTableSize [7 tables ] 30.4µs ± 5% 29.3µs ± 7% -3.45% (p=0.000 n=96+96) ``` PiperOrigin-RevId: 490184133
* | Add prefetch to zippy decompess,Ilya Tokar2023-01-121-0/+8
| | | | | | | | PiperOrigin-RevId: 489554313
* | Add "cc" clobbers to inline asm that modifies flags.Snappy Team2023-01-122-3/+6
| | | | | | | | | | | | | | | | As far as we know, the lack of "cc" in the clobbers hasn't caused problems yet, but it could. This change is to improve correctness, and is also almost certainly performance neutral. PiperOrigin-RevId: 487133620
* | Improve the speed of hashing in zippy compression.Snappy Team2023-01-123-20/+79
| | | | | | | | | | | | | | | | | | This change replaces the hashing function used during compression with one that is roughly as good but faster. This speeds up compression by two to a few percent on the Intel-, AMD-, and Arm-based machines we tested. The amount of compression is roughly unchanged. PiperOrigin-RevId: 485960303
* | Modify MemCopy64 to use AVX 32 byte copies instead of SSE2 16 byte copies on ↵Snappy Team2023-01-121-7/+16
| | | | | | | | | | | | capable x86 platforms. This gives an average speedup of 6.87% on Milan and 1.90% on Skylake. PiperOrigin-RevId: 480370725
* | Fix the remaining occurrence of non-const `std::string::data()`.Marcin Kowalczyk2022-10-081-1/+1
| | | | | | | | PiperOrigin-RevId: 479818960
* | Fix compilation errors under C++11.Matt Callanan2022-10-083-3/+3
| | | | | | | | | | | | `std::string::data()` is const-only until C++17. PiperOrigin-RevId: 479708109
* | Fix warnings due to use of `__attribute__(always_inline)` without `inline`.Marcin Kowalczyk2022-10-051-2/+2
| | | | | | | | PiperOrigin-RevId: 478984028
* | Add `snappy::CompressFromIOVec`.Matt Callanan2022-09-294-23/+218
| | | | | | | | | | | | This reads from an `iovec` array rather than from a `char` array as in `snappy::Compress`. PiperOrigin-RevId: 476930623
* | Merge pull request #148 from pitrou:ubsan-ptr-add-overflowVictor Costan2022-07-271-1/+2
|\ \ | | | | | | | | | PiperOrigin-RevId: 463090354
| * | Fix UBSan error (ptr + offset overflow)Antoine Pitrou2021-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As `i + offset` is promoted to a "negative" size_t, UBSan would complain when adding the resulting offset to `dst`: ``` /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43: runtime error: addition of unsigned offset to 0x6120003c5ec1 overflowed to 0x6120003c5ec0 #0 0x7f9ebd21769c in snappy::(anonymous namespace)::Copy64BytesWithPatternExtension(char*, unsigned long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:343:43 #1 0x7f9ebd21769c in std::__1::pair<unsigned char const*, long> snappy::DecompressBranchless<char*>(unsigned char const*, unsigned char const*, long, char*, long) /tmp/RtmptDX1SS/file584e37df4e/snappy_ep-prefix/src/snappy_ep/snappy.cc:1160:15 ```
* | | Move the comment about non-overlap requirement from the implementation to theMarcin Kowalczyk2022-07-271-2/+2
| | | | | | | | | | | | | | | | | | contract of `MemCopy64()`, and clarify that it applies to `size`, not to 64. PiperOrigin-RevId: 453920284
* | | Optimize zippy MemCpy / MemMove during decompressionSnappy Team2022-07-271-16/+29
| | | | | | | | | | | | | | | | | | | | | | | | By default MemCpy() / MemMove() always copies 64 bytes in DecompressBranchless(). Profiling shows that the vast majority of the time we need to copy many fewer bytes (typically <= 16 bytes). It is safe to copy fewer bytes as long as we exceed len. This change improves throughput by ~12% on ARM, ~35% on AMD Milan, and ~7% on Intel Cascade Lake. PiperOrigin-RevId: 453917840
* | | Optimize Zippy compression for ARM by 5-10% by choosing csel instructionsSnappy Team2022-05-091-6/+6
| | | | | | | | | | | | PiperOrigin-RevId: 444863689
* | | Fix compilation for older GCC and Clang versions.Snappy Team2022-02-201-1/+1
| |/ |/| | | | | | | | | | | | | | | Not everything defining __GNUC__ supports flag outputs from asm statements; in particular, some Clang versions on macOS does not. The correct test per the GCC documentation is __GCC_ASM_FLAG_OUTPUTS__, so use that instead. PiperOrigin-RevId: 423749308
* | Update contributing guidelines.masterVictor Costan2022-01-122-24/+35
|/ | | | | | | * Align CONTRIBUTING.md with the google/new-project template. * Explain the support story for the CMake config. PiperOrigin-RevId: 421311695
* Pass by reference the first argument of ExtractLowBytesSnappy Team2021-11-141-1/+1
| | | | | | to avoid UB of passing uninitialized argument by value. PiperOrigin-RevId: 406052814
* Switch CI to GitHub Actions.Victor Costan2021-09-014-148/+136
| | | | PiperOrigin-RevId: 394247182
* Merge pull request #140 from JunHe77:advVictor Costan2021-08-311-4/+8
|\ | | | | | | PiperOrigin-RevId: 394061345
| * decompress: refine data depdencyJun He2021-08-301-4/+8
|/ | | | | | | | | | | | | The final ip advance value doesn't have to wait for the result of offset to load *tag. It can be computed along with the offset, so the codegen will use one csinc in parallel with ldrb. This will improve the throughput. With this change it is observed ~4.2% uplift in UFlat/10 and ~3.7% in UFlatMedley Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I20ab211235bbf578c6c978f2bbd9160a49e920da
* Merge pull request #133 from JunHe77:simdVictor Costan2021-08-303-2/+31
|\ | | | | | | PiperOrigin-RevId: 393681630
| * Add config and header file for NEON supportJun He2021-08-122-0/+12
| | | | | | | | | | Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I3fade568ff92b4303387705f843d0051d5e88349
| * Fix SSE3 and BMI2 compile errorJun He2021-08-122-23/+33
| | | | | | | | | | | | | | | | | | After SHUFFLE code blocks are refactored, "tmmintrin.h" is missed, and bmi2 code part will have build failure as type conflicts. Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I7800cd7e050f4d349e5a227206b14b9c566e547f
* | Migrate feature detection macro checks from #ifdef to #if.Victor Costan2021-08-167-43/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | The #if predicate evaluates to false if the macro is undefined, or defined to 0. #ifdef (and its synonym #if defined) evaluates to false only if the macro is undefined. The new setup allows differentiating between setting a macro to 0 (to express that the capability definitely does not exist / should not be used) and leaving a macro undefined (to express not knowing whether a capability exists / not caring if a capability is used). PiperOrigin-RevId: 391094241
* | Add baseline CPU level to Travis CI.Victor Costan2021-08-161-4/+13
| | | | | | | | PiperOrigin-RevId: 391082698
* | Merge pull request #135 from JunHe77:remove_extraVictor Costan2021-08-141-0/+9
|\ \ | | | | | | | | | PiperOrigin-RevId: 390767998
| * | decompress: add hint to remove extra ANDJun He2021-08-121-0/+9
| |/ | | | | | | | | | | | | | | | | | | Clang doesn't realize the load with free zero-extension, and emits another extra 'and xn, xm, 0xff' to calc offset. With this change ,this extra op is removed, and consistent 1.7% performance uplift is observed. Signed-off-by: Jun He <jun.he@arm.com> Change-Id: Ica4617852c4b93eadc6c5c551dc3961ffbadb8f0
* | Merge pull request #136 from JunHe77:ext_armVictor Costan2021-08-131-0/+4
|\ \ | |/ |/| | | PiperOrigin-RevId: 390715690
| * decompression: optimize ExtractOffset for ArmJun He2021-08-061-0/+3
|/ | | | | | | | | | | Inspired by kExtractMasksCombined, this patch uses shift to replace table lookup. On Arm the codegen is 2 shift ops (lsl+lsr). Comparing to previous ldr which requires 4 cycles latency, the lsl+lsr only need 2 cycles. Slight (~0.3%) uplift observed on N1, and ~3% on A72. Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I5b53632d22d9e5cf1a49d0c5cdd16265a15de23b
* Move the extract masks variable out in zippy. I see a consistent 1.5-2% ↵Snappy Team2021-08-021-9/+18
| | | | | | improvement for ARM. Probably because ARM has more relaxed address computation than x86 https://www.godbolt.org/z/bfM1ezx41. I don't think this is a compiler bug or it can do something about it PiperOrigin-RevId: 387569896
* Remove inline assembly as the bug in clang was fixedSnappy Team2021-08-021-16/+0
| | | | PiperOrigin-RevId: 387356237
* Optimize memset to pure SIMD because compilers generate consistently bad ↵Snappy Team2021-08-022-1/+15
| | | | | | code. clang for ARM and gcc for x86 https://gcc.godbolt.org/z/oxeGG7aEx PiperOrigin-RevId: 383467656
* Optimize tag extraction for ARM with conditional increment instruction ↵Snappy Team2021-07-051-2/+25
| | | | | | generation (csinc). For codegen see https://gcc.godbolt.org/z/a8z9j95Pv PiperOrigin-RevId: 382688740
* Enable vector byte shuffle optimizations on ARM NEONatdt2021-07-052-59/+99
| | | | | | The SSSE3 intrinsics we use have their direct analogues in NEON, so making this optimization portable requires a very thin translation layer. PiperOrigin-RevId: 381280165
* Update Travis CI config.Victor Costan2021-05-251-9/+9
| | | | | | | Xcode (drives macOS image) : 12.2 => 12.5 Clang : 10 => 12 GCC : 10 => 11 PiperOrigin-RevId: 375610083
* Clarify, in a comment, that offset/256 fits in 3 bits. It has to in this ↵Snappy Team2021-05-251-1/+1
| | | | | | context, because the other 5 bits in the byte are used for len-4 and the tag. PiperOrigin-RevId: 374926553