diff options
Diffstat (limited to 'zlib/doc/rfc1952.txt')
-rw-r--r-- | zlib/doc/rfc1952.txt | 675 |
1 files changed, 675 insertions, 0 deletions
diff --git a/zlib/doc/rfc1952.txt b/zlib/doc/rfc1952.txt new file mode 100644 index 00000000000..a8e51b4567f --- /dev/null +++ b/zlib/doc/rfc1952.txt @@ -0,0 +1,675 @@ + + + + + + +Network Working Group P. Deutsch +Request for Comments: 1952 Aladdin Enterprises +Category: Informational May 1996 + + + GZIP file format specification version 4.3 + +Status of This Memo + + This memo provides information for the Internet community. This memo + does not specify an Internet standard of any kind. Distribution of + this memo is unlimited. + +IESG Note: + + The IESG takes no position on the validity of any Intellectual + Property Rights statements contained in this document. + +Notices + + Copyright (c) 1996 L. Peter Deutsch + + Permission is granted to copy and distribute this document for any + purpose and without charge, including translations into other + languages and incorporation into compilations, provided that the + copyright notice and this notice are preserved, and that any + substantive changes or deletions from the original are clearly + marked. + + A pointer to the latest version of this and related documentation in + HTML format can be found at the URL + <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>. + +Abstract + + This specification defines a lossless compressed data format that is + compatible with the widely used GZIP utility. The format includes a + cyclic redundancy check value for detecting data corruption. The + format presently uses the DEFLATE method of compression but can be + easily extended to use other compression methods. The format can be + implemented readily in a manner not covered by patents. + + + + + + + + + + +Deutsch Informational [Page 1] + +RFC 1952 GZIP File Format Specification May 1996 + + +Table of Contents + + 1. Introduction ................................................... 2 + 1.1. Purpose ................................................... 2 + 1.2. Intended audience ......................................... 3 + 1.3. Scope ..................................................... 3 + 1.4. Compliance ................................................ 3 + 1.5. Definitions of terms and conventions used ................. 3 + 1.6. Changes from previous versions ............................ 3 + 2. Detailed specification ......................................... 4 + 2.1. Overall conventions ....................................... 4 + 2.2. File format ............................................... 5 + 2.3. Member format ............................................. 5 + 2.3.1. Member header and trailer ........................... 6 + 2.3.1.1. Extra field ................................... 8 + 2.3.1.2. Compliance .................................... 9 + 3. References .................................................. 9 + 4. Security Considerations .................................... 10 + 5. Acknowledgements ........................................... 10 + 6. Author's Address ........................................... 10 + 7. Appendix: Jean-Loup Gailly's gzip utility .................. 11 + 8. Appendix: Sample CRC Code .................................. 11 + +1. Introduction + + 1.1. Purpose + + The purpose of this specification is to define a lossless + compressed data format that: + + * Is independent of CPU type, operating system, file system, + and character set, and hence can be used for interchange; + * Can compress or decompress a data stream (as opposed to a + randomly accessible file) to produce another data stream, + using only an a priori bounded amount of intermediate + storage, and hence can be used in data communications or + similar structures such as Unix filters; + * Compresses data with efficiency comparable to the best + currently available general-purpose compression methods, + and in particular considerably better than the "compress" + program; + * Can be implemented readily in a manner not covered by + patents, and hence can be practiced freely; + * Is compatible with the file format produced by the current + widely used gzip utility, in that conforming decompressors + will be able to read data produced by the existing gzip + compressor. + + + + +Deutsch Informational [Page 2] + +RFC 1952 GZIP File Format Specification May 1996 + + + The data format defined by this specification does not attempt to: + + * Provide random access to compressed data; + * Compress specialized data (e.g., raster graphics) as well as + the best currently available specialized algorithms. + + 1.2. Intended audience + + This specification is intended for use by implementors of software + to compress data into gzip format and/or decompress data from gzip + format. + + The text of the specification assumes a basic background in + programming at the level of bits and other primitive data + representations. + + 1.3. Scope + + The specification specifies a compression method and a file format + (the latter assuming only that a file can store a sequence of + arbitrary bytes). It does not specify any particular interface to + a file system or anything about character sets or encodings + (except for file names and comments, which are optional). + + 1.4. Compliance + + Unless otherwise indicated below, a compliant decompressor must be + able to accept and decompress any file that conforms to all the + specifications presented here; a compliant compressor must produce + files that conform to all the specifications presented here. The + material in the appendices is not part of the specification per se + and is not relevant to compliance. + + 1.5. Definitions of terms and conventions used + + byte: 8 bits stored or transmitted as a unit (same as an octet). + (For this specification, a byte is exactly 8 bits, even on + machines which store a character on a number of bits different + from 8.) See below for the numbering of bits within a byte. + + 1.6. Changes from previous versions + + There have been no technical changes to the gzip format since + version 4.1 of this specification. In version 4.2, some + terminology was changed, and the sample CRC code was rewritten for + clarity and to eliminate the requirement for the caller to do pre- + and post-conditioning. Version 4.3 is a conversion of the + specification to RFC style. + + + +Deutsch Informational [Page 3] + +RFC 1952 GZIP File Format Specification May 1996 + + +2. Detailed specification + + 2.1. Overall conventions + + In the diagrams below, a box like this: + + +---+ + | | <-- the vertical bars might be missing + +---+ + + represents one byte; a box like this: + + +==============+ + | | + +==============+ + + represents a variable number of bytes. + + Bytes stored within a computer do not have a "bit order", since + they are always treated as a unit. However, a byte considered as + an integer between 0 and 255 does have a most- and least- + significant bit, and since we write numbers with the most- + significant digit on the left, we also write bytes with the most- + significant bit on the left. In the diagrams below, we number the + bits of a byte so that bit 0 is the least-significant bit, i.e., + the bits are numbered: + + +--------+ + |76543210| + +--------+ + + This document does not address the issue of the order in which + bits of a byte are transmitted on a bit-sequential medium, since + the data format described here is byte- rather than bit-oriented. + + Within a computer, a number may occupy multiple bytes. All + multi-byte numbers in the format described here are stored with + the least-significant byte first (at the lower memory address). + For example, the decimal number 520 is stored as: + + 0 1 + +--------+--------+ + |00001000|00000010| + +--------+--------+ + ^ ^ + | | + | + more significant byte = 2 x 256 + + less significant byte = 8 + + + +Deutsch Informational [Page 4] + +RFC 1952 GZIP File Format Specification May 1996 + + + 2.2. File format + + A gzip file consists of a series of "members" (compressed data + sets). The format of each member is specified in the following + section. The members simply appear one after another in the file, + with no additional information before, between, or after them. + + 2.3. Member format + + Each member has the following structure: + + +---+---+---+---+---+---+---+---+---+---+ + |ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->) + +---+---+---+---+---+---+---+---+---+---+ + + (if FLG.FEXTRA set) + + +---+---+=================================+ + | XLEN |...XLEN bytes of "extra field"...| (more-->) + +---+---+=================================+ + + (if FLG.FNAME set) + + +=========================================+ + |...original file name, zero-terminated...| (more-->) + +=========================================+ + + (if FLG.FCOMMENT set) + + +===================================+ + |...file comment, zero-terminated...| (more-->) + +===================================+ + + (if FLG.FHCRC set) + + +---+---+ + | CRC16 | + +---+---+ + + +=======================+ + |...compressed blocks...| (more-->) + +=======================+ + + 0 1 2 3 4 5 6 7 + +---+---+---+---+---+---+---+---+ + | CRC32 | ISIZE | + +---+---+---+---+---+---+---+---+ + + + + +Deutsch Informational [Page 5] + +RFC 1952 GZIP File Format Specification May 1996 + + + 2.3.1. Member header and trailer + + ID1 (IDentification 1) + ID2 (IDentification 2) + These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 + (0x8b, \213), to identify the file as being in gzip format. + + CM (Compression Method) + This identifies the compression method used in the file. CM + = 0-7 are reserved. CM = 8 denotes the "deflate" + compression method, which is the one customarily used by + gzip and which is documented elsewhere. + + FLG (FLaGs) + This flag byte is divided into individual bits as follows: + + bit 0 FTEXT + bit 1 FHCRC + bit 2 FEXTRA + bit 3 FNAME + bit 4 FCOMMENT + bit 5 reserved + bit 6 reserved + bit 7 reserved + + If FTEXT is set, the file is probably ASCII text. This is + an optional indication, which the compressor may set by + checking a small amount of the input data to see whether any + non-ASCII characters are present. In case of doubt, FTEXT + is cleared, indicating binary data. For systems which have + different file formats for ascii text and binary data, the + decompressor can use FTEXT to choose the appropriate format. + We deliberately do not specify the algorithm used to set + this bit, since a compressor always has the option of + leaving it cleared and a decompressor always has the option + of ignoring it and letting some other program handle issues + of data conversion. + + If FHCRC is set, a CRC16 for the gzip header is present, + immediately before the compressed data. The CRC16 consists + of the two least significant bytes of the CRC32 for all + bytes of the gzip header up to and not including the CRC16. + [The FHCRC bit was never set by versions of gzip up to + 1.2.4, even though it was documented with a different + meaning in gzip 1.2.4.] + + If FEXTRA is set, optional extra fields are present, as + described in a following section. + + + +Deutsch Informational [Page 6] + +RFC 1952 GZIP File Format Specification May 1996 + + + If FNAME is set, an original file name is present, + terminated by a zero byte. The name must consist of ISO + 8859-1 (LATIN-1) characters; on operating systems using + EBCDIC or any other character set for file names, the name + must be translated to the ISO LATIN-1 character set. This + is the original name of the file being compressed, with any + directory components removed, and, if the file being + compressed is on a file system with case insensitive names, + forced to lower case. There is no original file name if the + data was compressed from a source other than a named file; + for example, if the source was stdin on a Unix system, there + is no file name. + + If FCOMMENT is set, a zero-terminated file comment is + present. This comment is not interpreted; it is only + intended for human consumption. The comment must consist of + ISO 8859-1 (LATIN-1) characters. Line breaks should be + denoted by a single line feed character (10 decimal). + + Reserved FLG bits must be zero. + + MTIME (Modification TIME) + This gives the most recent modification time of the original + file being compressed. The time is in Unix format, i.e., + seconds since 00:00:00 GMT, Jan. 1, 1970. (Note that this + may cause problems for MS-DOS and other systems that use + local rather than Universal time.) If the compressed data + did not come from a file, MTIME is set to the time at which + compression started. MTIME = 0 means no time stamp is + available. + + XFL (eXtra FLags) + These flags are available for use by specific compression + methods. The "deflate" method (CM = 8) sets these flags as + follows: + + XFL = 2 - compressor used maximum compression, + slowest algorithm + XFL = 4 - compressor used fastest algorithm + + OS (Operating System) + This identifies the type of file system on which compression + took place. This may be useful in determining end-of-line + convention for text files. The currently defined values are + as follows: + + + + + + +Deutsch Informational [Page 7] + +RFC 1952 GZIP File Format Specification May 1996 + + + 0 - FAT filesystem (MS-DOS, OS/2, NT/Win32) + 1 - Amiga + 2 - VMS (or OpenVMS) + 3 - Unix + 4 - VM/CMS + 5 - Atari TOS + 6 - HPFS filesystem (OS/2, NT) + 7 - Macintosh + 8 - Z-System + 9 - CP/M + 10 - TOPS-20 + 11 - NTFS filesystem (NT) + 12 - QDOS + 13 - Acorn RISCOS + 255 - unknown + + XLEN (eXtra LENgth) + If FLG.FEXTRA is set, this gives the length of the optional + extra field. See below for details. + + CRC32 (CRC-32) + This contains a Cyclic Redundancy Check value of the + uncompressed data computed according to CRC-32 algorithm + used in the ISO 3309 standard and in section 8.1.1.6.2 of + ITU-T recommendation V.42. (See http://www.iso.ch for + ordering ISO documents. See gopher://info.itu.ch for an + online version of ITU-T V.42.) + + ISIZE (Input SIZE) + This contains the size of the original (uncompressed) input + data modulo 2^32. + + 2.3.1.1. Extra field + + If the FLG.FEXTRA bit is set, an "extra field" is present in + the header, with total length XLEN bytes. It consists of a + series of subfields, each of the form: + + +---+---+---+---+==================================+ + |SI1|SI2| LEN |... LEN bytes of subfield data ...| + +---+---+---+---+==================================+ + + SI1 and SI2 provide a subfield ID, typically two ASCII letters + with some mnemonic value. Jean-Loup Gailly + <gzip@prep.ai.mit.edu> is maintaining a registry of subfield + IDs; please send him any subfield ID you wish to use. Subfield + IDs with SI2 = 0 are reserved for future use. The following + IDs are currently defined: + + + +Deutsch Informational [Page 8] + +RFC 1952 GZIP File Format Specification May 1996 + + + SI1 SI2 Data + ---------- ---------- ---- + 0x41 ('A') 0x70 ('P') Apollo file type information + + LEN gives the length of the subfield data, excluding the 4 + initial bytes. + + 2.3.1.2. Compliance + + A compliant compressor must produce files with correct ID1, + ID2, CM, CRC32, and ISIZE, but may set all the other fields in + the fixed-length part of the header to default values (255 for + OS, 0 for all others). The compressor must set all reserved + bits to zero. + + A compliant decompressor must check ID1, ID2, and CM, and + provide an error indication if any of these have incorrect + values. It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC + at least so it can skip over the optional fields if they are + present. It need not examine any other part of the header or + trailer; in particular, a decompressor may ignore FTEXT and OS + and always produce binary output, and still be compliant. A + compliant decompressor must give an error indication if any + reserved bit is non-zero, since such a bit could indicate the + presence of a new field that would cause subsequent data to be + interpreted incorrectly. + +3. References + + [1] "Information Processing - 8-bit single-byte coded graphic + character sets - Part 1: Latin alphabet No.1" (ISO 8859-1:1987). + The ISO 8859-1 (Latin-1) character set is a superset of 7-bit + ASCII. Files defining this character set are available as + iso_8859-1.* in ftp://ftp.uu.net/graphics/png/documents/ + + [2] ISO 3309 + + [3] ITU-T recommendation V.42 + + [4] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification", + available in ftp://ftp.uu.net/pub/archiving/zip/doc/ + + [5] Gailly, J.-L., GZIP documentation, available as gzip-*.tar in + ftp://prep.ai.mit.edu/pub/gnu/ + + [6] Sarwate, D.V., "Computation of Cyclic Redundancy Checks via Table + Look-Up", Communications of the ACM, 31(8), pp.1008-1013. + + + + +Deutsch Informational [Page 9] + +RFC 1952 GZIP File Format Specification May 1996 + + + [7] Schwaderer, W.D., "CRC Calculation", April 85 PC Tech Journal, + pp.118-133. + + [8] ftp://ftp.adelaide.edu.au/pub/rocksoft/papers/crc_v3.txt, + describing the CRC concept. + +4. Security Considerations + + Any data compression method involves the reduction of redundancy in + the data. Consequently, any corruption of the data is likely to have + severe effects and be difficult to correct. Uncompressed text, on + the other hand, will probably still be readable despite the presence + of some corrupted bytes. + + It is recommended that systems using this data format provide some + means of validating the integrity of the compressed data, such as by + setting and checking the CRC-32 check value. + +5. Acknowledgements + + Trademarks cited in this document are the property of their + respective owners. + + Jean-Loup Gailly designed the gzip format and wrote, with Mark Adler, + the related software described in this specification. Glenn + Randers-Pehrson converted this document to RFC and HTML format. + +6. Author's Address + + L. Peter Deutsch + Aladdin Enterprises + 203 Santa Margarita Ave. + Menlo Park, CA 94025 + + Phone: (415) 322-0103 (AM only) + FAX: (415) 322-1734 + EMail: <ghost@aladdin.com> + + Questions about the technical content of this specification can be + sent by email to: + + Jean-Loup Gailly <gzip@prep.ai.mit.edu> and + Mark Adler <madler@alumni.caltech.edu> + + Editorial comments on this specification can be sent by email to: + + L. Peter Deutsch <ghost@aladdin.com> and + Glenn Randers-Pehrson <randeg@alumni.rpi.edu> + + + +Deutsch Informational [Page 10] + +RFC 1952 GZIP File Format Specification May 1996 + + +7. Appendix: Jean-Loup Gailly's gzip utility + + The most widely used implementation of gzip compression, and the + original documentation on which this specification is based, were + created by Jean-Loup Gailly <gzip@prep.ai.mit.edu>. Since this + implementation is a de facto standard, we mention some more of its + features here. Again, the material in this section is not part of + the specification per se, and implementations need not follow it to + be compliant. + + When compressing or decompressing a file, gzip preserves the + protection, ownership, and modification time attributes on the local + file system, since there is no provision for representing protection + attributes in the gzip file format itself. Since the file format + includes a modification time, the gzip decompressor provides a + command line switch that assigns the modification time from the file, + rather than the local modification time of the compressed input, to + the decompressed output. + +8. Appendix: Sample CRC Code + + The following sample code represents a practical implementation of + the CRC (Cyclic Redundancy Check). (See also ISO 3309 and ITU-T V.42 + for a formal specification.) + + The sample code is in the ANSI C programming language. Non C users + may find it easier to read with these hints: + + & Bitwise AND operator. + ^ Bitwise exclusive-OR operator. + >> Bitwise right shift operator. When applied to an + unsigned quantity, as here, right shift inserts zero + bit(s) at the left. + ! Logical NOT operator. + ++ "n++" increments the variable n. + 0xNNN 0x introduces a hexadecimal (base 16) constant. + Suffix L indicates a long value (at least 32 bits). + + /* Table of CRCs of all 8-bit messages. */ + unsigned long crc_table[256]; + + /* Flag: has the table been computed? Initially false. */ + int crc_table_computed = 0; + + /* Make the table for a fast CRC. */ + void make_crc_table(void) + { + unsigned long c; + + + +Deutsch Informational [Page 11] + +RFC 1952 GZIP File Format Specification May 1996 + + + int n, k; + for (n = 0; n < 256; n++) { + c = (unsigned long) n; + for (k = 0; k < 8; k++) { + if (c & 1) { + c = 0xedb88320L ^ (c >> 1); + } else { + c = c >> 1; + } + } + crc_table[n] = c; + } + crc_table_computed = 1; + } + + /* + Update a running crc with the bytes buf[0..len-1] and return + the updated crc. The crc should be initialized to zero. Pre- and + post-conditioning (one's complement) is performed within this + function so it shouldn't be done by the caller. Usage example: + + unsigned long crc = 0L; + + while (read_buffer(buffer, length) != EOF) { + crc = update_crc(crc, buffer, length); + } + if (crc != original_crc) error(); + */ + unsigned long update_crc(unsigned long crc, + unsigned char *buf, int len) + { + unsigned long c = crc ^ 0xffffffffL; + int n; + + if (!crc_table_computed) + make_crc_table(); + for (n = 0; n < len; n++) { + c = crc_table[(c ^ buf[n]) & 0xff] ^ (c >> 8); + } + return c ^ 0xffffffffL; + } + + /* Return the CRC of the bytes buf[0..len-1]. */ + unsigned long crc(unsigned char *buf, int len) + { + return update_crc(0L, buf, len); + } + + + + +Deutsch Informational [Page 12] + |