Fix [ 3402314 ] non-ASCII whitespace and punctuation around inline markup.

Revision of the rules for allowed characters around inline markup start-string and end-string: Keep the carefully crafted ASCII-character set but add Unicode categories to the sets of allowed characters. This keeps the number of "false positives" requiring escaping low while making the rules simpler and international. This is a feature change. git-svn-id: http://svn.code.sf.net/p/docutils/code/trunk@7243 929543f6-e4f2-0310-98a6-ba3bd3dd1d04
author: milde <milde@929543f6-e4f2-0310-98a6-ba3bd3dd1d04> 2011-12-05 19:35:32 +0000
committer: milde <milde@929543f6-e4f2-0310-98a6-ba3bd3dd1d04> 2011-12-05 19:35:32 +0000
commit: d0ffb83b6243635a5f3ab1fd0bf36c325be3c9d4 (patch)
tree: 71215c8b9aca35319260cac291cce4d043c18d0d /docutils/docs
parent: 619c77f891903f894e5093431244d00aef37355d (diff)
download: docutils-d0ffb83b6243635a5f3ab1fd0bf36c325be3c9d4.tar.gz
2 files changed, 84 insertions, 238 deletions
diff --git a/docutils/docs/dev/todo.txt b/docutils/docs/dev/todo.txt
index 6611ea854..bb9a6637e 100644
--- a/docutils/docs/dev/todo.txt
+++ b/docutils/docs/dev/todo.txt
@@ -825,10 +825,6 @@ Misc
 
   See <http://thread.gmane.org/gmane.text.docutils.user/2499>.
 
-* Change the specification so that more punctuation is allowed
-  before/after inline markup start/end string
-  (http://article.gmane.org/gmane.text.docutils.cvs/3824).
-
 * Complain about bad URI characters
   (http://article.gmane.org/gmane.text.docutils.user/2046) and
   disallow internal whitespace
@@ -1129,150 +1125,32 @@ Misc
 Inline markup recognition rules
 -------------------------------
 
-Allow unicode whitespace and punctuation around `inline markup`_. See bug
-http://sourceforge.net/tracker/?func=detail&aid=3402314&group_id=38414&atid=422030
-and the older discussion
-<http://thread.gmane.org/gmane.text.docutils.user/2765>.
-
-The rules are currently *complicated* (rules, exceptions,
-explicite character lists, exceptions of exceptions) and *incomplete*: Many
-non-ASCII characters are missing in the inline markup start-string and
-end-string recognition rules. Use cases like »German ›angular‹ quotes« are
-not recognized.
+The `inline markup`_ recognition rules were devised intentionally to allow
+90% of non-markup uses of "*", "`", "_", and "|" *without* resorting to
+backslashes.  For 9% of the remaining 10%, use inline literals or literal
+blocks. Only those who understand the escaping and inline markup rules
+should attempt the remaining 1%.  ;-)
 
 .. _inline markup: ../ref/rst/restructuredtext.html#inline-markup
 
-Proposal
-````````
-
-Define character classes based on `Unicode categories`_, possibly with some
-exceptions (for backwards compatibility or based on use cases) and use them
-in the inline markup start-string and end-string recognition rules.
-
-The following sub-section is intended to replace the 5 inline markup rules in
-the reStructuredText Markup Specification's section on `inline markup`_.
-The composition of the character classes is open for discussion_.
-
-The actual change needs to be done in `parsers.rst.states.Inliner`.
-
-Inline markup syntax rules
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The inline markup start-string and end-string recognition rules distinguish
-the following character classes based on `Unicode categories`_:
-
-_`Whitespace`:
-   :Zs: Separator, Space
-   :Zl: Separator, Line
-
-   :Zp: Separator, Paragraph
-
-   Exception: Non-breaking spaces count as Delimiters_, they may
-   immediately follow a start-string or precede an end-string.
-
-   :[ ]:  U+00A0, NO-BREAK SPACE
-   :[ ]:  U+202F, NARROW NO-BREAK SPACE
-
-_`Open`:
-   :Ps: Punctuation, Open
-   :Pi: Punctuation, Initial quote
-   :Pf: Punctuation, Final quote [#PiPf]_
-   :<:  U+003C, LESS-THAN SIGN [#ltgt]_
-
-_`Close`:
-   :Pe: Punctuation, Close
-   :Pf: Punctuation, Final quote
-   :Pi: Punctuation, Initial quote [#PiPf]_
-   :>:  U+003E, GREATER-THAN SIGN
-
-_`Delimiters`:
-   :Pd: Punctuation, Dash
-   :Po: Punctuation, Other [#Po]_
-   :[ ]:  U+00A0, NO-BREAK SPACE
-   :[ ]:  U+202F, NARROW NO-BREAK SPACE
-
-If any of the following conditions are not met, the start-string or
-end-string will not be recognized or processed:
-
-1. Inline markup start-strings must start a text block or be immediately
-   preceded by a character of the classes Whitespace_, Open_, or
-   Delimiters_.
-
-2. Inline markup start-strings must not be followed by Whitespace_.
-
-3. Inline markup end-strings must not be preceded by Whitespace_.
-
-4. Inline markup end-strings must end a text block or be immediately
-   followed by a character of the classes Whitespace_, Close_, or
-   Delimiters_.
-
-5. If an inline markup start-string is immediately preceded by a
-   single or double quote or a character from Open_, it must not be
-   immediately followed by a corresponding single or double quote or
-   character from Close.
-
-6. An inline markup end-string must be separated by at least one
-   character from the start-string.
-
-7. An unescaped backslash preceding a start-string or end-string will
-   disable markup recognition, except for the end-string of `inline
-   literals`_.  See `Escaping Mechanism`_ above for details.
-
-
-Discussion
-``````````
-
-The current markup recognition rules deviate from the above proposal in some
-cases "to allow 90% of non-markup uses of "*", "`", "_", and "|" without
-resorting to backslashes".
-
-The above proposal aims to catch 85% of non-markup uses with simpler
-rules and enable additional markup uses (e.g. »German ›angular‹ quotes«)
-without escaping. It breaks backwards compatibility in some cases.
-However, if this is "the right thing", it should be done **now**, as long
-as the project is still "beta".
-
-Character classifications in need of discussion:
-
-.. [#PiPf] Pi (Punctuation, Initial quote) characters are "usually
-   closing, sometimes opening". Pf (Punctuation, Final quote) characters
-   are "usually closing, sometimes opening". I.e., both Pi and Pf may
-   behave like Ps or Pe depending on usage. The current implementation
-   sorts them into Open_ and Close_.
-   Adding Pf to Close_ and Pi to Open_ solves e.g. the problem with
-   »German ›angular‹ quotes«.
-
-.. [#ltgt] ``<`` and ``>`` belong to the Unicode category Ms (Symbols, Math).
-   The current implementation sorts them into Open_ and Close_ because of
-   their use as angular brackets in ASCII markup.
-
-.. [#Po] The ``Po`` characters ``.,;!?`` are usually followed by
-   whitespace. The backslash ``\`` is rarely used in front of marked-up
-   text. The current implementation sorts these characters into Close_.
-
-   The Po characters ``¡¿`` open a sentence. The current
-   implementation sorts them into Open_.
+Changes need to be done in `parsers.rst.states.Inliner`.
 
 Alternatives
-````````````
 
-a) The proposal_ above:
+a) Use `Unicode categories`_ for all chars (ASCII or not)
 
-   +1  truly international (considering characters of all writing systems
-       recorded in Unicode)
-   +2  simpler specification of the rules
-   -1  more complicated implementation
+   +1  comprehensible, standards based,
+   -1  many "false positives" need escaping,
+   -1  not backwards compatible.
 
-b) Backwards compatibility
+b) full backwards compatibility
 
-   :Pi: into Open_
-   :Pf: into Close_
+   :Pi: only before start-string
+   :Pf: only behind end-string
    :Po: "conservative" sorting of other punctuation:
 
-        :``.,;!?\``: Close_
-        :````¡¿``:   Open_
-
-        Are there more?
+        :``.,;!?\\``: Close
+        :``¡¿``:   Open
 
    +1  backwards compatible,
    +1  logical extension of the existing rules,
@@ -1280,41 +1158,9 @@ b) Backwards compatibility
    -1  rules even more complicated,
    -1  not clear how to sort "other" punctuation that is currently not
        recognized,
-   -2  use cases like »German ›angular‹ quotes« not recognized.
+   -2  international quoting convention like 
+       »German ›angular‹ quotes« not recognized.
 
-c) Simple rule: merge Open_, Close, and Delimiters_
-
-   Whitespace_, Open_, Close_, and Delimiters_ may all precede or follow
-   inline markup.
-
-   +3  very comprehensible,
-   -1  false positives need escaping,
-   -2  not backwards compatible.
-
-Implementation
-``````````````
-
-Some ideas for implementing the above rules:
-
-David's regexp to match whitespace but keep NO-BREAK spaces as "invisible
-escape"::
-
-  u'(?![\xa0\u202f])\\s', re.UNICODE
-
-For punctuation, check `Unicode categories`_ with
-``unicodedata.category(ch)``
-(http://bytes.com/topic/python/answers/854011-identifying-unicode-punctuation-characters-python-regex)
-and generate a pattern string, e.g. ::
-
-  chars_open = u''.join(unichr(x) for x in range(74868)
-       if unicodedata.category(unichr(x)) in ('Ps', 'Pi', 'Pf')
-
-Do this in the setup script and use the resulting string literal?
-(Avoids re-calculation with every parsing run.)
-
-.. _inline markup: ../ref/rst/restructuredtext.html#inline-markup
-.. _inline literals: ../ref/rst/restructuredtext.html#inline-literals
-.. _escaping mechanism: ../ref/rst/restructuredtext.html#escaping-mechanism
 .. _Unicode categories:
    http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values
 
diff --git a/docutils/docs/ref/rst/restructuredtext.txt b/docutils/docs/ref/rst/restructuredtext.txt
index 71fab2ba7..59679e548 100644
--- a/docutils/docs/ref/rst/restructuredtext.txt
+++ b/docutils/docs/ref/rst/restructuredtext.txt
@@ -2368,39 +2368,23 @@ Three constructs use different start-strings and end-strings:
 `Standalone hyperlinks`_ are recognized implicitly, and use no extra
 markup.
 
-The inline markup start-string and end-string recognition rules are as
-follows.  If any of the conditions are not met, the start-string or
-end-string will not be recognized or processed.
+Inline markup recognition rules
+-------------------------------
 
-1. Inline markup start-strings must start a text block or be
-   immediately preceded by whitespace, one of the ASCII
-   characters ``' " ( [ { <``, or the Unicode characters:
-
-       .. class:: borderless
-
-       ===  ==========================================================
-       ‘    (U+2018, left single-quote)
-       “    (U+201C, left double-quote)
-       ’    (U+2019, right single-quote, or apostrophe)
-       «    (U+00AB, left guillemet, or double angle quotation mark)
-       ¡    (U+00A1, inverted exclamation mark)
-       ¿    (U+00BF, inverted question mark)
-       ===  ==========================================================
-
-   The ASCII characters ``- / :`` and the Unicode characters
-
-       .. class:: borderless
+Inline markup start-strings and end-strings are only recognized if all of
+the following conditions are met:
 
-       ===  ==========================================================
-       ‐    (U+2010, hyphen)
-       ‑    (U+2011, non-breaking hyphen)
-       ‒    (U+2012, figure dash)
-       –    (U+2013, en dash)
-       —    (U+2014, em dash)
-       [ ]  (U+00A0, non-breaking space [between the brackets])
-       ===  ==========================================================
+1. Inline markup start-strings must start a text block or be
+   immediately preceded by
 
-   are _`delimiters`. They may precede or follow inline markup.
+   * whitespace,
+   * one of the ASCII characters ``- : / ' " < ( [ {`` or
+   * a non-ASCII punctuation character with `Unicode category`_
+     `Pd` (Dash),
+     `Po` (Other),
+     `Ps` (Open),
+     `Pi` (Initial quote), or
+     `Pf` (Final quote) [#PiPf]_.
 
 2. Inline markup start-strings must be immediately followed by
    non-whitespace.
@@ -2409,26 +2393,22 @@ end-string will not be recognized or processed.
    non-whitespace.
 
 4. Inline markup end-strings must end a text block or be immediately
-   followed by whitespace, the ASCII characters
-   ``' " ) ] } > . , ; ! ? \``, the Unicode characters:
-
-       .. class:: borderless
-
-       ===  ==========================================================
-       ’    (U+2019, right single-quote, or apostrophe)
-       ”    (U+201D, right double-quote)
-       »    (U+00BB, right guillemet, or double angle quotation mark)
-       ===  ==========================================================
-
-   or the `delimiters`_ listed in (1) above.
-
-5. If an inline markup start-string is immediately preceded by a
-   single or double quote, "(", "[", "{", or "<", it must not be
-   immediately followed by the corresponding single or double quote,
-   ")", "]", "}", or ">".
-
-   .. this also holds for the opening/closing Unicode character pairs
-      (since at least 05. Sep 2008).
+   followed by
+
+   * whitespace,
+   * one of the ASCII characters ``- . , : ; ! ? \ / ' " ) ] } >`` or
+   * a non-ASCII punctuation character with `Unicode category`_
+     `Pd` (Dash),
+     `Po` (Other),
+     `Pe` (Close),
+     `Pf` (Final quote), or
+     `Pi` (Initial quote) [#PiPf]_.
+
+5. If an inline markup start-string is immediately preceded by one of the
+   ASCII characters ``' " < ( [ {``, or a character with Unicode character
+   category `Ps`, `Pi`, or `Pf`, it must not be followed by the
+   corresponding [#corresponding-quotes]_ closing character from
+   ``' " ) ] } >`` or the categories `Pe`, `Pf`, or `Pi`.
 
 6. An inline markup end-string must be separated by at least one
    character from the start-string.
@@ -2437,32 +2417,52 @@ end-string will not be recognized or processed.
    disable markup recognition, except for the end-string of `inline
    literals`_.  See `Escaping Mechanism`_ above for details.
 
-For example, none of the following are recognized as containing inline
-markup start-strings:
+.. [#PiPf] `Pi` (Punctuation, Initial quote) characters are "usually
+   closing, sometimes opening". `Pf` (Punctuation, Final quote)
+   characters are "usually closing, sometimes opening".
+
+.. [#corresponding-quotes] For quotes, corresponding characters can be
+   any of the `quotation marks in international usage`_
+
+.. _Unicode category:
+   http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values
+
+.. _quotation marks in international usage:
+   http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage
+
+The inline markup recognition rules were devised to allow 90% of non-markup
+uses of "*", "`", "_", and "|" without escaping. For example, none of the
+following terms are recognized as containing inline markup strings:
 
-- asterisks: * "*" '*' (*) (* [*] {*} 1*x BOM32_*
-- double asterisks: **  a**b O(N**2) etc.
-- backquotes: ` `` etc.
-- underscores: _ __ __init__ __init__() etc.
-- vertical bars: | || etc.
+- 2*x a**b O(N**2) e**(x*y) f(x)*f(y) a|b file*.* (breaks 1)
+- 2 * x  a ** b  (* BOM32_* ` `` _ __ | (breaks 2)
+- "*" '|' (*) [*] {*} <*>
+  ‘*’ ‚*‘ ‘*‚ ’*’ ‚*’
+  “*” „*“ “*„ ”*” „*”
+  »*« ›*‹ «*» »*» ›*› (breaks 5)
+- || (breaks 6)
+- __init__ __init__()
 
-It may be desirable to use inline literals for some of these anyhow,
+No escaping is required inside the following inline markup examples:
+
+- *2 * x  *a **b *.txt* (breaks 3)
+- *2*x a**b O(N**2) e**(x*y) f(x)*f(y) a*(1+2)* (breaks 4)
+
+It may be desirable to use `inline literals`_ for some of these anyhow,
 especially if they represent code snippets.  It's a judgment call.
 
 These cases *do* require either literal-quoting or escaping to avoid
-misinterpretation::
+misinterpretation:
 
-    *4, class_, *args, **kwargs, `TeX-quoted', *ML, *.txt
+    \*4, class\_, \*args, \**kwargs, \`TeX-quoted', \*ML, \*.txt
 
-The inline markup recognition rules were devised intentionally to
-allow 90% of non-markup uses of "*", "`", "_", and "|" *without*
-resorting to backslashes.  For 9 of the remaining 10%, use inline
-literals or literal blocks::
+In most use cases, `inline literals`_ or `literal blocks`_ are the best
+choice (by default, this also selects a monospaced font)::
 
-    "``\*``" -> "\*" (possibly in another font or quoted)
+    *4, class_, *args, **kwargs, `TeX-quoted', *ML, *.txt
 
-Only those who understand the escaping and inline markup rules should
-attempt the remaining 1%.  ;-)
+Recognition order
+-----------------
 
 Inline markup delimiter characters are used for multiple constructs,
 so to avoid ambiguity there must be a specific recognition order for
author	milde <milde@929543f6-e4f2-0310-98a6-ba3bd3dd1d04>	2011-12-05 19:35:32 +0000
committer	milde <milde@929543f6-e4f2-0310-98a6-ba3bd3dd1d04>	2011-12-05 19:35:32 +0000
commit	d0ffb83b6243635a5f3ab1fd0bf36c325be3c9d4 (patch)
tree	71215c8b9aca35319260cac291cce4d043c18d0d /docutils/docs
parent	619c77f891903f894e5093431244d00aef37355d (diff)
download	docutils-d0ffb83b6243635a5f3ab1fd0bf36c325be3c9d4.tar.gz