diff options
author | Karl Williamson <khw@cpan.org> | 2015-07-11 12:19:59 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2015-07-13 12:17:41 -0600 |
commit | ce4793f183b29c423cb9d2d993fb4399c8d46baa (patch) | |
tree | 7ac909ae37eac06ffd4ddc8f839c8511a57d7e58 | |
parent | 87518e92cecac2acea7073cceea51ca610774fb0 (diff) | |
download | perl-ce4793f183b29c423cb9d2d993fb4399c8d46baa.tar.gz |
Forbid variable names with ASCII non-graphic chars
See http://nntp.perl.org/group/perl.perl5.porters/229168
Also, the documentation has been updated beyond this change to clarify
related matters, based on some experimentation.
Previously, spaces couldn't be in variable names; now ASCII control
characters can't be either. The remaining permissible ASCII characters
in a variable name now must be all graphic ones.
-rw-r--r-- | pod/perldata.pod | 92 | ||||
-rw-r--r-- | pod/perldelta.pod | 23 | ||||
-rw-r--r-- | pod/perlvar.pod | 38 | ||||
-rw-r--r-- | t/lib/warnings/toke | 42 | ||||
-rw-r--r-- | t/uni/variables.t | 81 | ||||
-rw-r--r-- | toke.c | 22 |
6 files changed, 114 insertions, 184 deletions
diff --git a/pod/perldata.pod b/pod/perldata.pod index b695598a54..a285eb7d43 100644 --- a/pod/perldata.pod +++ b/pod/perldata.pod @@ -37,8 +37,8 @@ collide with one of your normal variables. Strings that match parenthesized parts of a regular expression are saved under names containing only digits after the C<$> (see L<perlop> and L<perlre>). In addition, several special variables that provide windows into -the inner working of Perl have names containing punctuation characters -and control characters. These are documented in L<perlvar>. +the inner working of Perl have names containing punctuation characters. +These are documented in L<perlvar>. X<variable, built-in> Scalar values are always named with '$', even when referring to a @@ -99,11 +99,11 @@ that returns a reference to the appropriate type. For a description of this, see L<perlref>. Names that start with a digit may contain only more digits. Names -that do not start with a letter, underscore, digit or a caret (i.e. -a control character) are limited to one character, e.g., C<$%> or +that do not start with a letter, underscore, digit or a caret are +limited to one character, e.g., C<$%> or C<$$>. (Most of these one character names have a predefined significance to Perl. For instance, C<$$> is the current process -id.) +id. And all such names are reserved for Perl's possible use.) =head2 Identifier parsing X<identifiers> @@ -129,7 +129,7 @@ match C<\w> (this prevents some problematic cases); and Perl additionally accepts identfier names beginning with an underscore. If not under C<use utf8>, the source is treated as ASCII + 128 extra -controls, and identifiers should match +generic characters, and identifiers should match / (?aa) (?!\d) \w+ /x @@ -184,54 +184,66 @@ Put together, a grammar to match a basic identifier becomes Meanwhile, special identifiers don't follow the above rules; For the most part, all of the identifiers in this category have a special meaning given by Perl. Because they have special parsing rules, these generally can't be -fully-qualified. They come in four forms: +fully-qualified. They come in six forms (but don't use forms 5 and 6): =over -=item * +=item 1. A sigil, followed solely by digits matching C<\p{POSIX_Digit}>, like C<$0>, C<$1>, or C<$10000>. -=item * +=item 2. -A sigil, followed by a caret and any one of the characters -C<[][A-Z^_?\]>, like C<$^V> or C<$^]>, or a sigil followed by a literal non-space, -non-C<NUL> control character matching the C<\p{POSIX_Cntrl}> property. -Due to a historical oddity, if not running under C<use utf8>, the 128 -characters in the C<[0x80-0xff]> range are considered to be controls, -and may also be used in length-one variables. However, the use of -non-graphical characters is deprecated as of v5.22, and support for them -will be removed in a future version of perl. ASCII space characters and -C<NUL> already aren't allowed, so this means that a single-character -variable name with that name being any other C0 control C<[0x01-0x1F]>, -or C<DEL> will generate a deprecated warning. Already, under C<"use -utf8">, non-ASCII characters must match C<Perl_XIDS>. As of v5.22, when -not under C<"use utf8"> C1 controls C<[0x80-0x9F]>, NO BREAK SPACE, and -SOFT HYPHEN (C<SHY>)) generate a deprecated warning. - -=item * +A sigil followed by a single character matching the C<\p{POSIX_Punct}> +property, like C<$!> or C<%+>, except the character C<"{"> doesn't work. -Similar to the above, a sigil, followed by bareword text in braces, -where the first character is either a caret followed by any one of -the characters C<[][A-Z^_?\]>, like C<${^GLOBAL_PHASE}>, or a non-C<NUL>, -non-space literal -control like C<${\7LOBAL_PHASE}>. Like the above, when not under -C<"use utf8">, the characters in C<[0x80-0xFF]> are considered controls, but as -of v5.22, the use of any that are non-graphical are deprecated, and as -of v5.20 the use of any ASCII-range literal control is deprecated. -Support for these will be removed in a future version of perl. +=item 3. -=item * +A sigil, followed by a caret and any one of the characters +C<[][A-Z^_?\]>, like C<$^V> or C<$^]>. -A sigil followed by a single character matching the C<\p{POSIX_Punct}> -property, like C<$!> or C<%+>, except the character C<"{"> doesn't work. +=item 4. + +Similar to the above, a sigil, followed by bareword text in braces, +where the first character is a caret. The next character is any one of +the characters C<[][A-Z^_?\]>, followed by ASCII word characters. An +example is C<${^GLOBAL_PHASE}>. + +=item 5. + +A sigil, followed by any single character in the range C<[\x80-\xFF]> +when not under C<S<"use utf8">>. (Under C<S<"use utf8">>, the normal +identifier rules given earlier in this section apply.) Use of +non-graphic characters (the C1 controls, the NO-BREAK SPACE, and the +SOFT HYPHEN) is deprecated and will be forbidden in a future Perl +version. The use of the other characters is unwise, as these are all +reserved to have special meaning to Perl, and none of them currently +do have special meaning, though this could change without notice. + +Note that an implication of this form is that there are identifiers only +legal under C<S<"use utf8">>, and vice-versa, for example the identifier +C<$E<233>tat> is legal under C<S<"use utf8">>, but is otherwise +considered to be the single character variable C<$E<233>> followed by +the bareword C<"tat">, the combination of which is a syntax error. + +=item 6. + +This is a combination of the previous two forms. It is valid only when +not under S<C<"use utf8">> (normal identifier rules apply when under +S<C<"use utf8">>). The form is a sigil, followed by text in braces, +where the first character is any one of the characters in the range +C<[\x80-\xFF]> followed by ASCII word characters up to the trailing +brace. + +The same caveats as the previous form apply: The non-graphic characters +are deprecated, it is unwise to use this form at all, and utf8ness makes +a big difference. =back -Note that as of Perl 5.20, literal control characters in variable names -are deprecated; and as of Perl 5.22, any other non-graphic characters -are also deprecated. +Prior to Perl v5.24, non-graphical ASCII control characters were also +allowed in some situations; this had been deprecated since v5.20. =head2 Context X<context> X<scalar context> X<list context> diff --git a/pod/perldelta.pod b/pod/perldelta.pod index b3114a9844..b6ec5df21f 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -69,13 +69,22 @@ L</Selected Bug Fixes> section. =head1 Incompatible Changes -XXX For a release on a stable branch, this section aspires to be: - - There are no changes intentionally incompatible with 5.XXX.XXX - If any exist, they are bugs, and we request that you submit a - report. See L</Reporting Bugs> below. - -[ List each incompatible change as a =head2 entry ] +=head2 ASCII characters in variable names must now be all visible + +It was legal until now on ASCII platforms for variable names to contain +non-graphical ASCII control characters (ordinals 0 through 31, and 127, +which are the C0 controls and C<DELETE>). This usage has been +deprecated since v5.20, and as of now causes a syntax error. The +variables these names referred to are special, reserved by Perl for +whatever use it may choose, now, or in the future. Each such variable +has an alternative way of spelling it. Instead of the single +non-graphic control character, a two character sequence beginning with a +caret is used, like C<$^]> and C<${^GLOBAL_PHASE}>. Details are at +L<perlvar>. It remains legal, though unwise and deprecated (raising a +deprecation warning), to use certain non-graphic non-ASCII characters in +variables names when not under S<C<use utf8>>. No code should do this, +as all such variables are reserved by Perl, and Perl doesn't currently +define any of them (but could at any time, without notice). =head2 The C<autoderef> feature has been removed diff --git a/pod/perlvar.pod b/pod/perlvar.pod index cc69c3c47a..f825754b8e 100644 --- a/pod/perlvar.pod +++ b/pod/perlvar.pod @@ -12,32 +12,30 @@ arbitrarily long (up to an internal limit of 251 characters) and may contain letters, digits, underscores, or the special sequence C<::> or C<'>. In this case, the part before the last C<::> or C<'> is taken to be a I<package qualifier>; see L<perlmod>. - -Perl variable names may also be a sequence of digits or a single -punctuation or control character (with the literal control character -form deprecated). These names are all reserved for +A Unicode letter that is not ASCII is not considered to be a letter +unless S<C<"use utf8">> is in effect, and somewhat more complicated +rules apply; see L<perldata/Identifier parsing> for details. + +Perl variable names may also be a sequence of digits, a single +punctuation character, or the two-character sequence: C<^> (caret or +CIRCUMFLEX ACCENT) followed by any one of the characters C<[][A-Z^_?\]>. +These names are all reserved for special uses by Perl; for example, the all-digits names are used to hold data captured by backreferences after a regular expression -match. Perl has a special syntax for the single-control-character -names: It understands C<^X> (caret C<X>) to mean the control-C<X> -character. For example, the notation C<$^W> (dollar-sign caret -C<W>) is the scalar variable whose name is the single character -control-C<W>. This is better than typing a literal control-C<W> -into your program. - -Since Perl v5.6.0, Perl variable names may be alphanumeric strings that -begin with a caret (or a control character, but this form is -deprecated). -These variables must be written in the form C<${^Foo}>; the braces -are not optional. C<${^Foo}> denotes the scalar variable whose -name is a control-C<F> followed by two C<o>'s. These variables are +match. + +Since Perl v5.6.0, Perl variable names may also be alphanumeric strings +preceded by a caret. These must all be written in the form C<${^Foo}>; +the braces are not optional. C<${^Foo}> denotes the scalar variable +whose name is considered to be a control-C<F> followed by two C<o>'s. +These variables are reserved for future special uses by Perl, except for the ones that -begin with C<^_> (control-underscore or caret-underscore). No -control-character name that begins with C<^_> will acquire a special +begin with C<^_> (caret-underscore). No +name that begins with C<^_> will acquire a special meaning in any future version of Perl; such names may therefore be used safely in programs. C<$^_> itself, however, I<is> reserved. -Perl identifiers that begin with digits, control characters, or +Perl identifiers that begin with digits or punctuation characters are exempt from the effects of the C<package> declaration and are always forced to be in package C<main>; they are also exempt from C<strict 'vars'> errors. A few other names are also diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke index ad0e74b7d1..493c8a222c 100644 --- a/t/lib/warnings/toke +++ b/t/lib/warnings/toke @@ -150,38 +150,6 @@ EXPECT Use of bare << to mean <<"" is deprecated at - line 2. ######## # toke.c -BEGIN { - if (ord('A') == 193) { - print "SKIPPED\n# Literal control characters in variable names forbidden on EBCDIC"; - exit 0; - } -} -eval "\$\cT"; -eval "\${\7LOBAL_PHASE}"; -eval "\${\cT}"; -eval "\${\n\cT}"; -eval "\${\cT\n}"; -my $ret = eval "\${\n\cT\n}"; -print "ok\n" if $ret == $^T; - -no warnings 'deprecated' ; -eval "\$\cT"; -eval "\${\7LOBAL_PHASE}"; -eval "\${\cT}"; -eval "\${\n\cT}"; -eval "\${\cT\n}"; -eval "\${\n\cT\n}"; - -EXPECT -Use of literal control characters in variable names is deprecated at (eval 1) line 1. -Use of literal control characters in variable names is deprecated at (eval 2) line 1. -Use of literal control characters in variable names is deprecated at (eval 3) line 1. -Use of literal control characters in variable names is deprecated at (eval 4) line 2. -Use of literal control characters in variable names is deprecated at (eval 5) line 1. -Use of literal control characters in variable names is deprecated at (eval 6) line 2. -ok -######## -# toke.c $a =~ m/$foo/eq; $a =~ s/$foo/fool/seq; @@ -1497,20 +1465,10 @@ I ######## # toke.c #[perl #119123] disallow literal control character variables -BEGIN { - if (ord('A') == 193) { - print "SKIPPED\n# Literal control characters in variable names forbidden on EBCDIC"; - exit 0; - } -} -eval "\$\cQ = 25"; -eval "\${ \cX } = 24"; *{ Foo }; # shouldn't warn on {\n, even though \n is a control character EXPECT -Use of literal control characters in variable names is deprecated at (eval 1) line 1. -Use of literal control characters in variable names is deprecated at (eval 2) line 1. ######## # toke.c # [perl #120288] -X at start of line gave spurious warning, where X is not diff --git a/t/uni/variables.t b/t/uni/variables.t index 24e755a70b..33f057a645 100644 --- a/t/uni/variables.t +++ b/t/uni/variables.t @@ -15,7 +15,7 @@ use utf8; use open qw( :utf8 :std ); no warnings qw(misc reserved); -plan (tests => 66900); +plan (tests => 66894); # ${single:colon} should not be treated as a simple variable, but as a # block with a label inside. @@ -96,15 +96,8 @@ for ( 0x0 .. 0xff ) { $syntax_error = 1; } elsif ($chr =~ /[[:cntrl:]]/a) { - if ($chr eq "\N{NULL}") { - $name = sprintf "\\x%02x, NUL", $ord; - $syntax_error = 1; - } - else { - $name = sprintf "\\x%02x, an ASCII control", $ord; - $syntax_error = $::IS_EBCDIC; - $deprecated = ! $syntax_error; - } + $name = sprintf "\\x%02x, an ASCII control", $ord; + $syntax_error = 1; } elsif ($chr =~ /\pC/) { if ($chr eq "\N{SHY}") { @@ -142,18 +135,14 @@ for ( 0x0 .. 0xff ) { " ... and the same under 'use utf8'"); $tests++; } - elsif ($ord < 32 || $chr =~ /[[:punct:][:digit:]]/a) { + elsif ($chr =~ /[[:punct:][:digit:]]/a) { # Unlike other variables, we dare not try setting the length-1 - # variables that are \cX (for all valid X) nor ASCII ones that are - # punctuation nor digits. This is because many of these variables - # have meaning to the system, and setting them could have side - # effects or not work as expected (And using fresh_perl() doesn't - # always help.) For example, setting $^D (to use a visible - # representation of code point 0x04) turns on tracing, and setting - # $^E sets an error number, but what gets printed is instead a - # string associated with that number. For all these we just - # verify that they don't generate a syntax error. + # variables that are ASCII punctuation and digits. This is + # because many of these variables have meaning to the system, and + # setting them could have side effects or not work as expected + # (And using fresh_perl() doesn't always help.) For all these we + # just verify that they don't generate a syntax error. local $@; evalbytes "\$$chr;"; is $@, '', "$name as a length-1 variable doesn't generate a syntax error"; @@ -361,21 +350,25 @@ EOP { no strict; - # Silence the deprecation warning for literal controls - no warnings 'deprecated'; - for my $var ( '$', "\7LOBAL_PHASE", "^GLOBAL_PHASE", "^V" ) { - SKIP: { - skip("Literal control characters in variable names forbidden on EBCDIC", 3) - if ($::IS_EBCDIC && ord substr($var, 0, 1) < 32); + for my $var ( '$', "^GLOBAL_PHASE", "^V" ) { eval "\${ $var}"; is($@, '', "\${ $var} works" ); eval "\${$var }"; is($@, '', "\${$var } works" ); eval "\${ $var }"; is($@, '', "\${ $var } works" ); - } } + my $var = "\7LOBAL_PHASE"; + eval "\${ $var}"; + like($@, qr/Unrecognized character \\x07/, + "\${ $var} generates 'Unrecognized character' error" ); + eval "\${$var }"; + like($@, qr/Unrecognized character \\x07/, + "\${$var } generates 'Unrecognized character' error" ); + eval "\${ $var }"; + like($@, qr/Unrecognized character \\x07/, + "\${ $var } generates 'Unrecognized character' error" ); } } @@ -397,40 +390,8 @@ EOP ); } - SKIP: { - skip("Literal control characters in variable names forbidden on EBCDIC", 2) - if $::IS_EBCDIC; - no warnings 'deprecated'; my $ret = eval "\${\cT\n}"; - is($@, "", 'No errors from using ${\n\cT\n}'); - is($ret, $^T, " ... and we got the right value"); - } -} - -SKIP: { - skip("Literal control characters in variable names forbidden on EBCDIC", 5) - if $::IS_EBCDIC; - - # Originally from t/base/lex.t, moved here since we can't - # turn deprecation warnings off in that file. - no strict; - no warnings 'deprecated'; - - my $CX = "\cX"; - $ {$CX} = 17; - - # Does the syntax where we use the literal control character still work? - is( - eval "\$ {\cX}", - 17, - "Literal control character variables work" - ); - - eval "\$\cQ = 24"; # Literal control character - is($@, "", " ... and they can be assigned to without error"); - is(${"\cQ"}, 24, " ... and the assignment works"); - is($^Q, 24, " ... even if we access the variable through the caret name"); - is(\${"\cQ"}, \$^Q, '\${\cQ} == \$^Q'); + like($@, qr/\QUnrecognized character/, '${\n\cT\n} gives an error message'); } { @@ -8671,9 +8671,8 @@ S_scan_ident(pTHX_ char *s, char *dest, STRLEN destlen, I32 ck_uni) /* Is the byte 'd' a legal single character identifier name? 'u' is true * iff Unicode semantics are to be used. The legal ones are any of: * a) all ASCII characters except: - * 1) space-type ones, like \t and SPACE; - 2) NUL; - * 3) '{' + * 1) control and space-type ones, like NUL, SOH, \t, and SPACE; + * 2) '{' * The final case currently doesn't get this far in the program, so we * don't test for it. If that were to change, it would be ok to allow it. * c) When not under Unicode rules, any upper Latin1 character @@ -8691,11 +8690,10 @@ S_scan_ident(pTHX_ char *s, char *dest, STRLEN destlen, I32 ck_uni) : (isGRAPH_L1(*s) \ && LIKELY((U8) *(s) != LATIN1_TO_NATIVE(0xAD))))) #else -# define VALID_LEN_ONE_IDENT(s, is_utf8) (! isSPACE_A(*(s)) \ - && LIKELY(*(s) != '\0') \ - && (! is_utf8 \ - || isASCII_utf8((U8*) (s)) \ - || isIDFIRST_utf8((U8*) (s)))) +# define VALID_LEN_ONE_IDENT(s, is_utf8) \ + (isGRAPH_A(*(s)) || ((is_utf8) \ + ? isIDFIRST_utf8((U8*) (s)) \ + : ! isASCII_utf8((U8*) (s)))) #endif if ((s <= PL_bufend - (is_utf8) ? UTF8SKIP(s) @@ -8711,13 +8709,7 @@ S_scan_ident(pTHX_ char *s, char *dest, STRLEN destlen, I32 ck_uni) : (! isGRAPH_L1( (U8) *s) || UNLIKELY((U8) *(s) == LATIN1_TO_NATIVE(0xAD)))) { - /* Split messages for back compat */ - if (isCNTRL_A( (U8) *s)) { - deprecate("literal control characters in variable names"); - } - else { - deprecate("literal non-graphic characters in variable names"); - } + deprecate("literal non-graphic characters in variable names"); } if (is_utf8) { |