summaryrefslogtreecommitdiff
path: root/lib
diff options
context:
space:
mode:
authorJarkko Hietaniemi <jhi@iki.fi>2001-11-06 03:05:34 +0000
committerJarkko Hietaniemi <jhi@iki.fi>2001-11-06 03:05:34 +0000
commita72c75842468bcd2a7cf17032844c4040a5a31e2 (patch)
treef1d67259d9b154926eb495b329d3239f96b9be7c /lib
parent545666dba9cc33d16d0b8341e36facdb43c44913 (diff)
downloadperl-a72c75842468bcd2a7cf17032844c4040a5a31e2.tar.gz
Implement the encoding pragma for regex literals.
p4raw-id: //depot/perl@12864
Diffstat (limited to 'lib')
-rw-r--r--lib/encoding.pm23
-rw-r--r--lib/encoding.t16
2 files changed, 36 insertions, 3 deletions
diff --git a/lib/encoding.pm b/lib/encoding.pm
index 6f5970f2ca..94ee3231fb 100644
--- a/lib/encoding.pm
+++ b/lib/encoding.pm
@@ -57,14 +57,33 @@ encoding pragma you can change this default.
The pragma is a per script, not a per block lexical. Only the last
C<use encoding> matters, and it affects B<the whole script>.
+Notice that only literals (string or regular expression) having only
+legacy code points are affected: if you mix data like this
+
+ \xDF\x{100}
+
+the data is assumed to be in (Latin 1 and) Unicode, not in your native
+encoding. In other words, this will match in "greek":
+
+ "\xDF" =~ /\x{3af}/
+
+but this will not
+
+ "\xDF\x{100}" =~ /\x{3af}\x{100}/
+
+since the C<\xDF> on the left will B<not> be upgraded to C<\x{3af}>
+because of the C<\x{100}> on the left. You should not be mixing your
+legacy data and Unicode in the same string.
+
If no encoding is specified, the environment variable L<PERL_ENCODING>
is consulted. If that fails, "latin1" (ISO 8859-1) is assumed. If no
encoding can be found, C<Unknown encoding '...'> error will be thrown.
=head1 KNOWN PROBLEMS
-Literals in regular expressions are not affected by this pragma.
-They very probably should.
+For native multibyte encodings (either fixed or variable length)
+the current implementation of the regular expressions may introduce
+recoding errors for longer regular expression literals than 127 bytes.
=head1 SEE ALSO
diff --git a/lib/encoding.t b/lib/encoding.t
index 03634410f4..8b14c8881a 100644
--- a/lib/encoding.t
+++ b/lib/encoding.t
@@ -1,4 +1,4 @@
-print "1..15\n";
+print "1..19\n";
use encoding "latin1"; # ignored (overwritten by the next line)
use encoding "greek"; # iso 8859-7 (no "latin" alias, surprise...)
@@ -68,3 +68,17 @@ print "ok 14\n";
print "not " unless ord(substr($c, 2, 1)) == 0x3af;
print "ok 15\n";
+# regex literals
+
+print "not " unless "\xDF" =~ /\x{3AF}/;
+print "ok 16\n";
+
+print "not " unless "\x{3AF}" =~ /\xDF/;
+print "ok 17\n";
+
+print "not " unless "\xDF" =~ /\xDF/;
+print "ok 18\n";
+
+print "not " unless "\x{3AF}" =~ /\x{3AF}/;
+print "ok 19\n";
+