summaryrefslogtreecommitdiff
path: root/lib/compression/lzxpress_huffman.c
Commit message (Collapse)AuthorAgeFilesLines
* lib:compression: Fix code spellingAndreas Schneider2023-04-031-3/+3
| | | | | | | Best reviewed with: `git show --word-diff`. Signed-off-by: Andreas Schneider <asn@samba.org> Reviewed-by: Andrew Bartlett <abartlet@samba.org>
* lib/compression: Fix documentation of lzxpress_huffman_compress()Andrew Bartlett2023-03-311-2/+2
| | | | | | | | The "inconvenience function" takes one type, and converts it to another but the documentation was not updated. Signed-off-by: Andrew Bartlett <abartlet@samba.org> Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
* lib/compression: Add helper function lzxpress_huffman_max_compressed_size()Andrew Bartlett2023-03-311-6/+17
| | | | | | | This allows the calculation of the worst case to be shared with callers. Signed-off-by: Andrew Bartlett <abartlet@samba.org> Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
* compression/huffman: debug function bails upon disaster (CID 1517261)Douglas Bagnall2022-12-191-0/+5
| | | | | | | | | | | | | | | | We shouldn't get a node with a zero code, and there's probably nothing to do but stop. CID 1517261 (#1-2 of 2): Bad bit shift operation (BAD_SHIFT)11. negative_shift: In expression j >> offset - k, shifting by a negative amount has undefined behavior. The shift amount, offset - k, is -3. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Jeremy Allison <jra@samba.org> Autobuild-User(master): Jeremy Allison <jra@samba.org> Autobuild-Date(master): Mon Dec 19 23:29:04 UTC 2022 on sn-devel-184
* compression/huffman: double check distance in matches (CID 1517278)Douglas Bagnall2022-12-191-0/+3
| | | | | | | | | | | | | | Because we just wrote the intermediate representation to have no zero distances, we can be sure it doesn't, but Coverity doesn't know. If distance is zero, `bitlen_nonzero_16(distance)` would be bad. CID 1517278 (#1 of 1): Bad bit shift operation (BAD_SHIFT)41. large_shift: In expression 1 << code_dist, left shifting by more than 31 bits has undefined behavior. The shift amount, code_dist, is 65535. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Jeremy Allison <jra@samba.org>
* compression: fix sign extension of long matches (CID 1517275)Douglas Bagnall2022-12-191-1/+1
| | | | | | | | | | | | | | | | | | | Very long matches would be written instead as very very long matches. We can't in fact hit this because we have a MAX_MATCH_LENGTH defined as 64M, but if we could, it might make certain 2GB+ strings impossible to compress. CID 1517275 (#1 of 1): Unintended sign extension (SIGN_EXTENSION)sign_extension: Suspicious implicit sign extension: intermediate[i + 2UL] with type uint16_t (16 bits, unsigned) is promoted in intermediate[i + 2UL] << 16 to type int (32 bits, signed), then sign-extended to type unsigned long (64 bits, unsigned). If intermediate[i + 2UL] << 16 is greater than 0x7FFFFFFF, the upper bits of the result will all be 1. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Jeremy Allison <jra@samba.org>
* compression/huffman: check again for invalid codes (CID 1517302)Douglas Bagnall2022-12-191-0/+3
| | | | | | | | | | | | | We know that code is non-zero, because it comes from the combination of the intermediate representation and the symbol tables that were generated at the same time. But Coverity doesn't know that, and it thinks we could be doing undefined things in the subsequent shift. CID 1517302: Integer handling issues (BAD_SHIFT) In expression "1 << code_bit_len", shifting by a negative amount has Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Jeremy Allison <jra@samba.org>
* compression/huffman: tighten bit_len checks (fix SUSE -O3 build)Douglas Bagnall2022-12-191-2/+3
| | | | | | | | | | | | | | | | | | | | | | The struct write_context bit_len attribute is always between 0 and 31, but if the next patches are applied without this, SUSE GCC -O3 will worry thusly: ../../lib/compression/lzxpress_huffman.c: In function ‘lzxpress_huffman_compress’: ../../lib/compression/lzxpress_huffman.c:953:5: error: assuming signed overflow does not occur when simplifying conditional to constant [-Werror=strict-overflow] if (wc->bit_len > 16) { ^ cc1: all warnings being treated as errors Inspection tell us that the invariant holds. Nevertheless, we can safely use an unsigned type and insist that over- or under- flow is bad. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Jeremy Allison <jra@samba.org>
* compression/huffman: avoid semi-defined behaviour in decompressDouglas Bagnall2022-12-191-5/+7
| | | | | | | | | | | | | | | | | | | | | | | We had output[output_pos - distance]; where output_pos and distance are size_t and distance can be greater than output_pos (because it refers to a place in the previous block). The underflow is defined, leading to a big number, and when sizeof(size_t) == sizeof(*uint8_t) the subsequent overflow works as expected. But if size_t is smaller than a pointer, bad things will happen. This was found by OSSFuzz with 'UBSAN_OPTIONS=print_stacktrace=1:silence_unsigned_overflow=1'. Credit to OSSFuzz. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Volker Lendecke <vl@samba.org> Reviewed-by: Jeremy Allison <jra@samba.org>
* lib/compression: debug routines for lzxpress-huffmanDouglas Bagnall2022-12-011-1/+249
| | | | | | | | | | | | | If you need to see a Huffman tree (and sometimes you do), set DEBUG_HUFFMAN_TREE to true at the top of lzxpress_huffman.c, and run: make bin/test_lzx_huffman && bin/test_lzx_huffman Actually, that will show you hundreds of trees, and you'll be glad of that if you are ever trying to understand this. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
* lib/compression/lzhuff: add debug flag to skip LZ77Douglas Bagnall2022-12-011-1/+10
| | | | | | | | Encoding without LZ77 matches is valid, and it is useful for isolating bugs. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
* lib/compression: LZ77 + Huffman compressionDouglas Bagnall2022-12-011-0/+1107
| | | | | | | | | | | | | | | | | | | | This compresses files as described in MS-XCA 2.2, and as decompressed by the decompressor in the previous commit. As with the decompressor, there are two public functions -- one that uses a talloc context, and one that uses pre-allocated memory. The compressor requires a tightly bound amount of auxillary memory (>220kB) in a few different buffers, which is all gathered together in the public struct lzxhuff_compressor_mem. An instantiated but not initialised copy of this struct is required by the non-talloc function; it can be used over and over again. Our compression speed is about the same as the decompression speed (between 20 and 500 MB/s on this laptop, depending on the data), and our compression ratio is very similar to that of Windows. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
* lib/compression: add LZ77 + Huffman decompressionDouglas Bagnall2022-12-011-0/+656
This format is described in [MS-XCA] 2.1 and 2.2, with exegesis in many posts on the cifs-protocol list[1]. The two public functions are: ssize_t lzxpress_huffman_decompress(const uint8_t *input, size_t input_size, uint8_t *output, size_t output_size); uint8_t *lzxpress_huffman_decompress_talloc(TALLOC_CTX *mem_ctx, const uint8_t *input_bytes, size_t input_size, size_t output_size); In both cases the caller needs to know the *exact* decompressed size, which is essential for decompression. The _talloc version allocates the buffer for you, and uses the talloc context to allocate a 128k working buffer. THe non-talloc function will allocate the working buffer on the stack. This compression format gives better compression for messages of several kilobytes than the "plain" LXZPRESS compression, but is probably a bit slower to decompress and is certainly worse for very short messages, having a fixed 256 byte overhead for the first Huffman table. Experiments show decompression rates between 20 and 500 MB per second, depending on the compression ratio and data size, on an i5-1135G7 with no compiler optimisations. This compression format is used in AD claims and in SMB, but that doesn't happen with this commit. I will not try to describe LZ77 or Huffman encoding here. Don't expect an answer in MS-XCA either; instead read the code and/or Wikipedia. [1] Much of that starts here: https://lists.samba.org/archive/cifs-protocol/2022-October/ but there's more earlier, particularly in June/July 2020, when Aurélien Aptel was working on an implementation that ended up in Wireshark. Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz> Pair-programmed-with: Joseph Sutton <josephsutton@catalyst.net.nz> Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>