summaryrefslogtreecommitdiff
path: root/proginfo/txtvsbin.txt
diff options
context:
space:
mode:
Diffstat (limited to 'proginfo/txtvsbin.txt')
-rw-r--r--proginfo/txtvsbin.txt112
1 files changed, 112 insertions, 0 deletions
diff --git a/proginfo/txtvsbin.txt b/proginfo/txtvsbin.txt
new file mode 100644
index 0000000..6ba2805
--- /dev/null
+++ b/proginfo/txtvsbin.txt
@@ -0,0 +1,112 @@
+A Fast Method of Identifying Plain Text Files
+=============================================
+
+
+Introduction
+------------
+
+Given a file coming from an unknown source, it is generally impossible
+to conclude automatically, and with 100% accuracy, whether that file is
+a plain text file, without performing a heavy-duty semantic analysis on
+the file contents. It is, however, possible to obtain a fairly high
+degree of accuracy, by employing various simple heuristics.
+
+Previous versions of the zip tools were using a crude detection scheme,
+originally used by PKWare in its PKZip programs: if more than 80% (4/5)
+of the bytes are within the range [7..127], the file is labeled as plain
+text, otherwise it is labeled as binary. A prominent limitation of this
+scheme is the restriction to Latin-based alphabets. Other alphabets,
+like Greek, Cyrillic or Asian, make extensive use of the bytes within
+the range [128..255], and texts using these alphabets are most often
+mis-identified by this scheme; in other words, the rate of false
+negatives is sometimes too high, which means that the recall is low.
+Another weakness of this scheme is a reduced precision, due to the false
+positives that may occur when binary files containing a large amount of
+textual characters are mis-identified as plain text.
+
+In this article we propose a new detection scheme, with a much increased
+accuracy and precision, and a near-100% recall. This scheme is designed
+to work on ASCII and ASCII-derived alphabets, and it handles single-byte
+alphabets (ISO-8859, OEM, KOI-8, etc.), and variable-sized alphabets
+(DBCS, UTF-8, etc.). However, it cannot handle fixed-sized, multi-byte
+alphabets (UCS-2, UCS-4), nor UTF-16. The principle used by this scheme
+can easily be adapted to non-ASCII alphabets like EBCDIC.
+
+
+The Algorithm
+-------------
+
+The algorithm works by dividing the set of bytes [0..255] into three
+categories:
+- The white list of textual bytecodes:
+ 9 (TAB), 10 (LF), 13 (CR), 20 (SPACE) to 255
+- The gray list of tolerated bytecodes:
+ 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC)
+- The black list of undesired, non-textual bytecodes:
+ 0 (NUL) to 6, 14 to 31.
+
+If a file contains at least one byte that belongs to the white list, and
+no byte that belongs to the black list, then the file is categorized as
+plain text. Otherwise, it is categorized as binary.
+
+
+Rationale
+---------
+
+The idea behind this algorithm relies on two observations.
+
+The first observation is that, although the full range of 7-bit codes
+(0..127) is properly specified by the ASCII standard, most control
+characters in the range 0..31 are not used in practice. The only
+widely-used, almost universally-portable control codes are 9 (TAB),
+10 (LF), and 13 (CR). There are a few more control codes that are
+recognized on a reduced range of platforms and text viewers/editors:
+7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), and 27 (ESC); but these
+codes are rarely (if ever) used alone, without being accompanied by
+some printable text. Even the newer, portable text formats, such as
+XML, avoid using control characters outside the list mentioned here.
+
+The second observation is that most of the binary files tend to contain
+control characters, especially 0 (NUL); even though the older text
+detection schemes observe the presence of non-ASCII codes from the range
+[128..255], the precision rarely has to suffer if this upper range is
+labeled as textual, because the files that are genuinely binary tend to
+contain both control characters, and codes from the upper range. On the
+other hand, the upper range needs to be labeled as textual, because it
+is being used by virtually all ASCII extensions. In particular, this
+range is being heavily used to encode non-Latin scripts.
+
+Given the two observations, the plain text detection algorithm becomes
+straightforward. There must be at least some printable material, or
+some portable whitespace such as TAB, CR or LF, otherwise the file is
+not labeled as plain text. (The boundary case, when the file is empty,
+automatically falls into this category.) However, there must be no
+non-portable control characters, otherwise it's very likely that the
+intended reader of that file is a machine, rather than a human.
+
+Since there is no counting involved, other than simply observing the
+presence or the absence of some byte values, the algorithm produces
+uniform results on any particular text file, no matter what alphabet
+encoding is being used for that text. (In contrast, if counting were
+involved, it could be possible to obtain different results on a text
+encoded, say, using ISO-8859-2 versus UTF-8.) There is the category
+of plain text files that are "polluted" with one or a few black-listed
+codes, either by mistake, or by peculiar design considerations. In such
+cases, a scheme that tolerates a small percentage of black-listed codes
+would provide an increased recall (i.e. more true positives). This,
+however, incurs a reduced precision, since false positives are also more
+likely to appear in binary files that contain large chunks of textual
+data. "Polluted" plain text may, in fact, be regarded as binary, on
+which text conversions should not be performed. Under this premise, it
+is safe to say that the detection method provides a near-100% recall.
+
+Experiments have been run on a large set of files of various categories,
+including plain old texts, system logs, source code, formatted office
+documents, compiled object code, etcetera. The results confirm the
+optimistic assumptions about the high accuracy, precision and recall
+offered by this algorithm.
+
+
+--
+Cosmin Truta
+Last updated: 2005-Feb-27