delta/cpython-git.git/Modules/unicodedata.c, branch fix-namedexpr-comment

bpo-37752: Delete redundant Py_CHARMASK in normalizestring() (GH-15095)

2019-09-10T16:04:08+00:00

bpo-38043: Use `bool` for boolean flags on is_normalized_quickcheck. (GH-15711)

2019-09-09T09:16:31+00:00

closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)

2019-09-04T02:45:44+00:00

The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop

bpo-36974: tp_print -> tp_vectorcall_offset and tp_reserved -> tp_as_async (GH-13464)

2019-05-31T02:13:39+00:00

Automatically replace
tp_print -> tp_vectorcall_offset
tp_compare -> tp_as_async
tp_reserved -> tp_as_async

bpo-36642: make unicodedata const (GH-12855)

2019-04-16T23:40:34+00:00

closes bpo-32285: Add unicodedata.is_normalized. (GH-4806)

2018-11-04T23:58:24+00:00

bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)

2018-06-15T12:03:14+00:00

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).

update to Unicode 11.0.0 (closes bpo-33778) (GH-7439)

2018-06-07T03:14:28+00:00

Also, standardize indentation of generated tables.

Fix miscellaneous typos (#4275)

2017-11-05T13:37:50+00:00

bpo-30736: upgrade to Unicode 10.0 (#2344)

2017-06-23T05:31:08+00:00

Straightforward. While we're at it, though, strip trailing whitespace from generated tables.