summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorsnappy.mirrorbot@gmail.com <snappy.mirrorbot@gmail.com@03e5f5b5-db94-4691-08a0-1a8bf15f6143>2011-05-16 08:59:18 +0000
committersnappy.mirrorbot@gmail.com <snappy.mirrorbot@gmail.com@03e5f5b5-db94-4691-08a0-1a8bf15f6143>2011-05-16 08:59:18 +0000
commit184f228f1c37b4e6b18051b6b601d23f1ca364ef (patch)
tree325b3e3e227e9ac961ac8d7602413fadb2cd2c06
parentf9c515d1c3f8ea5ca437291bf2ea17470eaf5316 (diff)
downloadsnappy-184f228f1c37b4e6b18051b6b601d23f1ca364ef.tar.gz
Fix public issue #32: Add compressed format documentation for Snappy.
This text is new, but an earlier version from Zeev Tarantov was used as reference. R=csilvers DELTA=112 (111 added, 0 deleted, 1 changed) Revision created by MOE tool push_codebase. MOE_MIGRATION=1867 git-svn-id: http://snappy.googlecode.com/svn/trunk@36 03e5f5b5-db94-4691-08a0-1a8bf15f6143
-rw-r--r--Makefile.am2
-rw-r--r--format_description.txt110
2 files changed, 111 insertions, 1 deletions
diff --git a/Makefile.am b/Makefile.am
index 2da357e..2fa36fd 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -17,7 +17,7 @@ TESTS = snappy_unittest
noinst_PROGRAMS = $(TESTS)
EXTRA_DIST = autogen.sh testdata/alice29.txt testdata/asyoulik.txt testdata/baddata1.snappy testdata/baddata2.snappy testdata/baddata3.snappy testdata/cp.html testdata/fields.c testdata/geo.protodata testdata/grammar.lsp testdata/house.jpg testdata/html testdata/html_x_4 testdata/kennedy.xls testdata/kppkn.gtb testdata/lcet10.txt testdata/mapreduce-osdi-1.pdf testdata/plrabn12.txt testdata/ptt5 testdata/sum testdata/urls.10K testdata/xargs.1
-dist_doc_DATA = ChangeLog COPYING INSTALL NEWS README
+dist_doc_DATA = ChangeLog COPYING INSTALL NEWS README format_description.txt
libtool: $(LIBTOOL_DEPS)
$(SHELL) ./config.status --recheck
diff --git a/format_description.txt b/format_description.txt
new file mode 100644
index 0000000..f5b990e
--- /dev/null
+++ b/format_description.txt
@@ -0,0 +1,110 @@
+Snappy compressed format description
+Last revised: 2011-05-13
+
+
+This is not a formal specification, but should suffice to explain most
+relevant parts of how the Snappy format works. It is originally based on
+text by Zeev Tarantov.
+
+Snappy is a LZ77-type compressor with a fixed, byte-oriented encoding.
+There is no entropy encoder backend nor framing layer -- the latter is
+assumed to be handled by other parts of the system.
+
+This document only describes the format, not how the Snappy compressor nor
+decompressor actually works. The correctness of the decompressor should not
+depend on implementation details of the compressor, and vice versa.
+
+
+1. Preamble
+
+The stream starts with the uncompressed length (up to a maximum of 2^32 - 1),
+stored as a little-endian varint. Varints consist of a series of bytes,
+where the lower 7 bits are data and the upper bit is set iff there are
+more bytes to be read. In other words, an uncompressed length of 64 would
+be stored as 0x40, and an uncompressed length of 2097151 (0x1FFFFF)
+would be stored as 0xFF 0xFF 0x7F.
+
+
+2. The compressed stream itself
+
+There are two types of elements in a Snappy stream: Literals and
+copies (backreferences). There is no restriction on the order of elements,
+except that the stream naturally cannot start with a copy. (Having
+two literals in a row is never optimal from a compression point of
+view, but nevertheless fully permitted.) Each element starts with a tag byte,
+and the lower two bits of this tag byte signal what type of element will
+follow:
+
+ 00: Literal
+ 01: Copy with 1-byte offset
+ 10: Copy with 2-byte offset
+ 11: Copy with 3-byte offset
+
+The interpretation of the upper six bits are element-dependent.
+
+
+2.1. Literals (00)
+
+Literals are uncompressed data stored directly in the byte stream.
+The literal length is stored differently depending on the length
+of the literal:
+
+ - For literals up to and including 60 bytes in length, the upper
+ six bits of the tag byte contain (len-1). The literal follows
+ immediately thereafter in the bytestream.
+ - For longer literals, the length is stored after the tag byte,
+ little-endian. The upper six bits of the tag byte describe how
+ many bytes are used for the length; 60, 61, 62 or 63 for
+ 1-4 bytes, respectively. The literal itself follows after the
+ length.
+
+
+2.2. Copies
+
+Copies are references back into previous decompressed data, telling
+the decompressor to reuse data it has previously decoded.
+They encode two values: The _offset_, saying how many bytes back
+from the current position to read, and the _length_, how many bytes
+to copy. Offsets of zero can be encoded, but are not legal;
+similarly, it is possible to encode backreferences that would
+go past the end of the block (offset > current decompressed position),
+which is also nonsensical and thus not allowed.
+
+As in most LZ77-based compressors, the length can be larger than the offset,
+yielding a form of run-length encoding (RLE). For instance,
+"xababab" could be encoded as
+
+ <literal: "xab"> <copy: offset=2 length=4>
+
+Note that since the current Snappy compressor works in 32 kB
+blocks and does not do matching across blocks, it will never produce
+a bitstream with offsets larger than about 32768. However, the
+decompressor should not rely on this, as it may change in the future.
+
+There are several different kinds of copy elements, depending on
+the amount of bytes to be copied (length), and how far back the
+data to be copied is (offset).
+
+
+2.2. Copy with 1-byte offset (01)
+
+These elements can encode lengths between [4..11] bytes and offsets
+between [0..2047] bytes. (len-4) occupies three bits and is stored
+in bits [2..4] of the tag byte. The offset occupies 11 bits, of which the
+upper three are stored in the upper three bits ([5..7]) of the tag byte,
+and the lower eight are stored in a byte following the tag byte.
+
+
+2.3. Copy with 2-byte offset (10)
+
+These elements can encode lengths between [1..64] and offsets from
+[0..65535]. (len-1) occupies six bits and is stored in the upper
+six bits ([2..7]) of the tag byte. The offset is stored as a
+little-endian 16-bit integer in the two bytes following the tag byte.
+
+
+2.4. Copy with 4-byte offsets (11)
+
+These are like the copies with 2-byte offsets (see previous subsection),
+except that the offset is stored as a 32-bit integer instead of a
+16-bit integer (and thus will occupy four bytes).