Reduce size of backend scanner's tables.

Previously, the core scanner's yy_transition[] array had 37045 elements. Since that number is larger than INT16_MAX, Flex generated the array to contain 32-bit integers. By reimplementing some of the bulkier scanner rules, this patch reduces the array to 20495 elements. The much smaller total length, combined with the consequent use of 16-bit integers for the array elements reduces the binary size by over 200kB. This was accomplished in two ways: 1. Consolidate handling of quote continuations into a new start condition, rather than duplicating that logic for five different string types. 2. Treat Unicode strings and identifiers followed by a UESCAPE sequence as three separate tokens, rather than one. The logic to de-escape Unicode strings is moved to the filter code in parser.c, which already had the ability to provide special processing for token sequences. While we could have implemented the conversion in the grammar, that approach was rejected for performance and maintainability reasons. Performance in microbenchmarks of raw parsing seems equal or slightly faster in most cases, and it's reasonable to expect that in real-world usage (with more competition for the CPU cache) there will be a larger win. The exception is UESCAPE sequences; lexing those is about 10% slower, primarily because the scanner now has to be called three times rather than one. This seems acceptable since that feature is very rarely used. The psql and epcg lexers are likewise modified, primarily because we want to keep them all in sync. Since those lexers don't use the space-hogging -CF option, the space savings is much less, but it's still good for perhaps 10kB apiece. While at it, merge the ecpg lexer's handling of C-style comments used in SQL and in C. Those have different rules regarding nested comments, but since we already have the ability to keep track of the previous start condition, we can use that to handle both cases within a single start condition. This matches the core scanner more closely. John Naylor Discussion: https://postgr.es/m/CACPNZCvaoa3EgVWm5yZhcSTX6RAtaLgniCPcBVOCwm8h3xpWkw@mail.gmail.com
author: Tom Lane <tgl@sss.pgh.pa.us> 2020-01-13 15:04:31 -0500
committer: Tom Lane <tgl@sss.pgh.pa.us> 2020-01-13 15:04:31 -0500
commit: 7f380c59f800f7e0fb49f45a6ff7787256851a59 (patch)
tree: 76743b1ec372574af81c2d1340180ef809b9a542 /src/include/mb
parent: 259bbe177808986e5d226ea7ce5a1ebb74657791 (diff)
download: postgresql-7f380c59f800f7e0fb49f45a6ff7787256851a59.tar.gz
1 files changed, 22 insertions, 0 deletions
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 07ebc6365b..7fb5fa4111 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -509,6 +509,28 @@ typedef uint32 (*utf_local_conversion_func) (uint32 code);
 
 
 /*
+ * Some handy functions for Unicode-specific tests.
+ */
+static inline bool
+is_utf16_surrogate_first(pg_wchar c)
+{
+	return (c >= 0xD800 && c <= 0xDBFF);
+}
+
+static inline bool
+is_utf16_surrogate_second(pg_wchar c)
+{
+	return (c >= 0xDC00 && c <= 0xDFFF);
+}
+
+static inline pg_wchar
+surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
+{
+	return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
+}
+
+
+/*
  * These functions are considered part of libpq's exported API and
  * are also declared in libpq-fe.h.
  */
author	Tom Lane <tgl@sss.pgh.pa.us>	2020-01-13 15:04:31 -0500
committer	Tom Lane <tgl@sss.pgh.pa.us>	2020-01-13 15:04:31 -0500
commit	7f380c59f800f7e0fb49f45a6ff7787256851a59 (patch)
tree	76743b1ec372574af81c2d1340180ef809b9a542 /src/include/mb
parent	259bbe177808986e5d226ea7ce5a1ebb74657791 (diff)
download	postgresql-7f380c59f800f7e0fb49f45a6ff7787256851a59.tar.gz