summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2015-07-11 12:19:59 -0600
committerKarl Williamson <khw@cpan.org>2015-07-13 12:17:41 -0600
commitce4793f183b29c423cb9d2d993fb4399c8d46baa (patch)
tree7ac909ae37eac06ffd4ddc8f839c8511a57d7e58
parent87518e92cecac2acea7073cceea51ca610774fb0 (diff)
downloadperl-ce4793f183b29c423cb9d2d993fb4399c8d46baa.tar.gz
Forbid variable names with ASCII non-graphic chars
See http://nntp.perl.org/group/perl.perl5.porters/229168 Also, the documentation has been updated beyond this change to clarify related matters, based on some experimentation. Previously, spaces couldn't be in variable names; now ASCII control characters can't be either. The remaining permissible ASCII characters in a variable name now must be all graphic ones.
-rw-r--r--pod/perldata.pod92
-rw-r--r--pod/perldelta.pod23
-rw-r--r--pod/perlvar.pod38
-rw-r--r--t/lib/warnings/toke42
-rw-r--r--t/uni/variables.t81
-rw-r--r--toke.c22
6 files changed, 114 insertions, 184 deletions
diff --git a/pod/perldata.pod b/pod/perldata.pod
index b695598a54..a285eb7d43 100644
--- a/pod/perldata.pod
+++ b/pod/perldata.pod
@@ -37,8 +37,8 @@ collide with one of your normal variables. Strings that match
parenthesized parts of a regular expression are saved under names
containing only digits after the C<$> (see L<perlop> and L<perlre>).
In addition, several special variables that provide windows into
-the inner working of Perl have names containing punctuation characters
-and control characters. These are documented in L<perlvar>.
+the inner working of Perl have names containing punctuation characters.
+These are documented in L<perlvar>.
X<variable, built-in>
Scalar values are always named with '$', even when referring to a
@@ -99,11 +99,11 @@ that returns a reference to the appropriate type. For a description
of this, see L<perlref>.
Names that start with a digit may contain only more digits. Names
-that do not start with a letter, underscore, digit or a caret (i.e.
-a control character) are limited to one character, e.g., C<$%> or
+that do not start with a letter, underscore, digit or a caret are
+limited to one character, e.g., C<$%> or
C<$$>. (Most of these one character names have a predefined
significance to Perl. For instance, C<$$> is the current process
-id.)
+id. And all such names are reserved for Perl's possible use.)
=head2 Identifier parsing
X<identifiers>
@@ -129,7 +129,7 @@ match C<\w> (this prevents some problematic cases); and Perl
additionally accepts identfier names beginning with an underscore.
If not under C<use utf8>, the source is treated as ASCII + 128 extra
-controls, and identifiers should match
+generic characters, and identifiers should match
/ (?aa) (?!\d) \w+ /x
@@ -184,54 +184,66 @@ Put together, a grammar to match a basic identifier becomes
Meanwhile, special identifiers don't follow the above rules; For the most
part, all of the identifiers in this category have a special meaning given
by Perl. Because they have special parsing rules, these generally can't be
-fully-qualified. They come in four forms:
+fully-qualified. They come in six forms (but don't use forms 5 and 6):
=over
-=item *
+=item 1.
A sigil, followed solely by digits matching C<\p{POSIX_Digit}>, like
C<$0>, C<$1>, or C<$10000>.
-=item *
+=item 2.
-A sigil, followed by a caret and any one of the characters
-C<[][A-Z^_?\]>, like C<$^V> or C<$^]>, or a sigil followed by a literal non-space,
-non-C<NUL> control character matching the C<\p{POSIX_Cntrl}> property.
-Due to a historical oddity, if not running under C<use utf8>, the 128
-characters in the C<[0x80-0xff]> range are considered to be controls,
-and may also be used in length-one variables. However, the use of
-non-graphical characters is deprecated as of v5.22, and support for them
-will be removed in a future version of perl. ASCII space characters and
-C<NUL> already aren't allowed, so this means that a single-character
-variable name with that name being any other C0 control C<[0x01-0x1F]>,
-or C<DEL> will generate a deprecated warning. Already, under C<"use
-utf8">, non-ASCII characters must match C<Perl_XIDS>. As of v5.22, when
-not under C<"use utf8"> C1 controls C<[0x80-0x9F]>, NO BREAK SPACE, and
-SOFT HYPHEN (C<SHY>)) generate a deprecated warning.
-
-=item *
+A sigil followed by a single character matching the C<\p{POSIX_Punct}>
+property, like C<$!> or C<%+>, except the character C<"{"> doesn't work.
-Similar to the above, a sigil, followed by bareword text in braces,
-where the first character is either a caret followed by any one of
-the characters C<[][A-Z^_?\]>, like C<${^GLOBAL_PHASE}>, or a non-C<NUL>,
-non-space literal
-control like C<${\7LOBAL_PHASE}>. Like the above, when not under
-C<"use utf8">, the characters in C<[0x80-0xFF]> are considered controls, but as
-of v5.22, the use of any that are non-graphical are deprecated, and as
-of v5.20 the use of any ASCII-range literal control is deprecated.
-Support for these will be removed in a future version of perl.
+=item 3.
-=item *
+A sigil, followed by a caret and any one of the characters
+C<[][A-Z^_?\]>, like C<$^V> or C<$^]>.
-A sigil followed by a single character matching the C<\p{POSIX_Punct}>
-property, like C<$!> or C<%+>, except the character C<"{"> doesn't work.
+=item 4.
+
+Similar to the above, a sigil, followed by bareword text in braces,
+where the first character is a caret. The next character is any one of
+the characters C<[][A-Z^_?\]>, followed by ASCII word characters. An
+example is C<${^GLOBAL_PHASE}>.
+
+=item 5.
+
+A sigil, followed by any single character in the range C<[\x80-\xFF]>
+when not under C<S<"use utf8">>. (Under C<S<"use utf8">>, the normal
+identifier rules given earlier in this section apply.) Use of
+non-graphic characters (the C1 controls, the NO-BREAK SPACE, and the
+SOFT HYPHEN) is deprecated and will be forbidden in a future Perl
+version. The use of the other characters is unwise, as these are all
+reserved to have special meaning to Perl, and none of them currently
+do have special meaning, though this could change without notice.
+
+Note that an implication of this form is that there are identifiers only
+legal under C<S<"use utf8">>, and vice-versa, for example the identifier
+C<$E<233>tat> is legal under C<S<"use utf8">>, but is otherwise
+considered to be the single character variable C<$E<233>> followed by
+the bareword C<"tat">, the combination of which is a syntax error.
+
+=item 6.
+
+This is a combination of the previous two forms. It is valid only when
+not under S<C<"use utf8">> (normal identifier rules apply when under
+S<C<"use utf8">>). The form is a sigil, followed by text in braces,
+where the first character is any one of the characters in the range
+C<[\x80-\xFF]> followed by ASCII word characters up to the trailing
+brace.
+
+The same caveats as the previous form apply: The non-graphic characters
+are deprecated, it is unwise to use this form at all, and utf8ness makes
+a big difference.
=back
-Note that as of Perl 5.20, literal control characters in variable names
-are deprecated; and as of Perl 5.22, any other non-graphic characters
-are also deprecated.
+Prior to Perl v5.24, non-graphical ASCII control characters were also
+allowed in some situations; this had been deprecated since v5.20.
=head2 Context
X<context> X<scalar context> X<list context>
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index b3114a9844..b6ec5df21f 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -69,13 +69,22 @@ L</Selected Bug Fixes> section.
=head1 Incompatible Changes
-XXX For a release on a stable branch, this section aspires to be:
-
- There are no changes intentionally incompatible with 5.XXX.XXX
- If any exist, they are bugs, and we request that you submit a
- report. See L</Reporting Bugs> below.
-
-[ List each incompatible change as a =head2 entry ]
+=head2 ASCII characters in variable names must now be all visible
+
+It was legal until now on ASCII platforms for variable names to contain
+non-graphical ASCII control characters (ordinals 0 through 31, and 127,
+which are the C0 controls and C<DELETE>). This usage has been
+deprecated since v5.20, and as of now causes a syntax error. The
+variables these names referred to are special, reserved by Perl for
+whatever use it may choose, now, or in the future. Each such variable
+has an alternative way of spelling it. Instead of the single
+non-graphic control character, a two character sequence beginning with a
+caret is used, like C<$^]> and C<${^GLOBAL_PHASE}>. Details are at
+L<perlvar>. It remains legal, though unwise and deprecated (raising a
+deprecation warning), to use certain non-graphic non-ASCII characters in
+variables names when not under S<C<use utf8>>. No code should do this,
+as all such variables are reserved by Perl, and Perl doesn't currently
+define any of them (but could at any time, without notice).
=head2 The C<autoderef> feature has been removed
diff --git a/pod/perlvar.pod b/pod/perlvar.pod
index cc69c3c47a..f825754b8e 100644
--- a/pod/perlvar.pod
+++ b/pod/perlvar.pod
@@ -12,32 +12,30 @@ arbitrarily long (up to an internal limit of 251 characters) and
may contain letters, digits, underscores, or the special sequence
C<::> or C<'>. In this case, the part before the last C<::> or
C<'> is taken to be a I<package qualifier>; see L<perlmod>.
-
-Perl variable names may also be a sequence of digits or a single
-punctuation or control character (with the literal control character
-form deprecated). These names are all reserved for
+A Unicode letter that is not ASCII is not considered to be a letter
+unless S<C<"use utf8">> is in effect, and somewhat more complicated
+rules apply; see L<perldata/Identifier parsing> for details.
+
+Perl variable names may also be a sequence of digits, a single
+punctuation character, or the two-character sequence: C<^> (caret or
+CIRCUMFLEX ACCENT) followed by any one of the characters C<[][A-Z^_?\]>.
+These names are all reserved for
special uses by Perl; for example, the all-digits names are used
to hold data captured by backreferences after a regular expression
-match. Perl has a special syntax for the single-control-character
-names: It understands C<^X> (caret C<X>) to mean the control-C<X>
-character. For example, the notation C<$^W> (dollar-sign caret
-C<W>) is the scalar variable whose name is the single character
-control-C<W>. This is better than typing a literal control-C<W>
-into your program.
-
-Since Perl v5.6.0, Perl variable names may be alphanumeric strings that
-begin with a caret (or a control character, but this form is
-deprecated).
-These variables must be written in the form C<${^Foo}>; the braces
-are not optional. C<${^Foo}> denotes the scalar variable whose
-name is a control-C<F> followed by two C<o>'s. These variables are
+match.
+
+Since Perl v5.6.0, Perl variable names may also be alphanumeric strings
+preceded by a caret. These must all be written in the form C<${^Foo}>;
+the braces are not optional. C<${^Foo}> denotes the scalar variable
+whose name is considered to be a control-C<F> followed by two C<o>'s.
+These variables are
reserved for future special uses by Perl, except for the ones that
-begin with C<^_> (control-underscore or caret-underscore). No
-control-character name that begins with C<^_> will acquire a special
+begin with C<^_> (caret-underscore). No
+name that begins with C<^_> will acquire a special
meaning in any future version of Perl; such names may therefore be
used safely in programs. C<$^_> itself, however, I<is> reserved.
-Perl identifiers that begin with digits, control characters, or
+Perl identifiers that begin with digits or
punctuation characters are exempt from the effects of the C<package>
declaration and are always forced to be in package C<main>; they are
also exempt from C<strict 'vars'> errors. A few other names are also
diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke
index ad0e74b7d1..493c8a222c 100644
--- a/t/lib/warnings/toke
+++ b/t/lib/warnings/toke
@@ -150,38 +150,6 @@ EXPECT
Use of bare << to mean <<"" is deprecated at - line 2.
########
# toke.c
-BEGIN {
- if (ord('A') == 193) {
- print "SKIPPED\n# Literal control characters in variable names forbidden on EBCDIC";
- exit 0;
- }
-}
-eval "\$\cT";
-eval "\${\7LOBAL_PHASE}";
-eval "\${\cT}";
-eval "\${\n\cT}";
-eval "\${\cT\n}";
-my $ret = eval "\${\n\cT\n}";
-print "ok\n" if $ret == $^T;
-
-no warnings 'deprecated' ;
-eval "\$\cT";
-eval "\${\7LOBAL_PHASE}";
-eval "\${\cT}";
-eval "\${\n\cT}";
-eval "\${\cT\n}";
-eval "\${\n\cT\n}";
-
-EXPECT
-Use of literal control characters in variable names is deprecated at (eval 1) line 1.
-Use of literal control characters in variable names is deprecated at (eval 2) line 1.
-Use of literal control characters in variable names is deprecated at (eval 3) line 1.
-Use of literal control characters in variable names is deprecated at (eval 4) line 2.
-Use of literal control characters in variable names is deprecated at (eval 5) line 1.
-Use of literal control characters in variable names is deprecated at (eval 6) line 2.
-ok
-########
-# toke.c
$a =~ m/$foo/eq;
$a =~ s/$foo/fool/seq;
@@ -1497,20 +1465,10 @@ I
########
# toke.c
#[perl #119123] disallow literal control character variables
-BEGIN {
- if (ord('A') == 193) {
- print "SKIPPED\n# Literal control characters in variable names forbidden on EBCDIC";
- exit 0;
- }
-}
-eval "\$\cQ = 25";
-eval "\${ \cX } = 24";
*{
Foo
}; # shouldn't warn on {\n, even though \n is a control character
EXPECT
-Use of literal control characters in variable names is deprecated at (eval 1) line 1.
-Use of literal control characters in variable names is deprecated at (eval 2) line 1.
########
# toke.c
# [perl #120288] -X at start of line gave spurious warning, where X is not
diff --git a/t/uni/variables.t b/t/uni/variables.t
index 24e755a70b..33f057a645 100644
--- a/t/uni/variables.t
+++ b/t/uni/variables.t
@@ -15,7 +15,7 @@ use utf8;
use open qw( :utf8 :std );
no warnings qw(misc reserved);
-plan (tests => 66900);
+plan (tests => 66894);
# ${single:colon} should not be treated as a simple variable, but as a
# block with a label inside.
@@ -96,15 +96,8 @@ for ( 0x0 .. 0xff ) {
$syntax_error = 1;
}
elsif ($chr =~ /[[:cntrl:]]/a) {
- if ($chr eq "\N{NULL}") {
- $name = sprintf "\\x%02x, NUL", $ord;
- $syntax_error = 1;
- }
- else {
- $name = sprintf "\\x%02x, an ASCII control", $ord;
- $syntax_error = $::IS_EBCDIC;
- $deprecated = ! $syntax_error;
- }
+ $name = sprintf "\\x%02x, an ASCII control", $ord;
+ $syntax_error = 1;
}
elsif ($chr =~ /\pC/) {
if ($chr eq "\N{SHY}") {
@@ -142,18 +135,14 @@ for ( 0x0 .. 0xff ) {
" ... and the same under 'use utf8'");
$tests++;
}
- elsif ($ord < 32 || $chr =~ /[[:punct:][:digit:]]/a) {
+ elsif ($chr =~ /[[:punct:][:digit:]]/a) {
# Unlike other variables, we dare not try setting the length-1
- # variables that are \cX (for all valid X) nor ASCII ones that are
- # punctuation nor digits. This is because many of these variables
- # have meaning to the system, and setting them could have side
- # effects or not work as expected (And using fresh_perl() doesn't
- # always help.) For example, setting $^D (to use a visible
- # representation of code point 0x04) turns on tracing, and setting
- # $^E sets an error number, but what gets printed is instead a
- # string associated with that number. For all these we just
- # verify that they don't generate a syntax error.
+ # variables that are ASCII punctuation and digits. This is
+ # because many of these variables have meaning to the system, and
+ # setting them could have side effects or not work as expected
+ # (And using fresh_perl() doesn't always help.) For all these we
+ # just verify that they don't generate a syntax error.
local $@;
evalbytes "\$$chr;";
is $@, '', "$name as a length-1 variable doesn't generate a syntax error";
@@ -361,21 +350,25 @@ EOP
{
no strict;
- # Silence the deprecation warning for literal controls
- no warnings 'deprecated';
- for my $var ( '$', "\7LOBAL_PHASE", "^GLOBAL_PHASE", "^V" ) {
- SKIP: {
- skip("Literal control characters in variable names forbidden on EBCDIC", 3)
- if ($::IS_EBCDIC && ord substr($var, 0, 1) < 32);
+ for my $var ( '$', "^GLOBAL_PHASE", "^V" ) {
eval "\${ $var}";
is($@, '', "\${ $var} works" );
eval "\${$var }";
is($@, '', "\${$var } works" );
eval "\${ $var }";
is($@, '', "\${ $var } works" );
- }
}
+ my $var = "\7LOBAL_PHASE";
+ eval "\${ $var}";
+ like($@, qr/Unrecognized character \\x07/,
+ "\${ $var} generates 'Unrecognized character' error" );
+ eval "\${$var }";
+ like($@, qr/Unrecognized character \\x07/,
+ "\${$var } generates 'Unrecognized character' error" );
+ eval "\${ $var }";
+ like($@, qr/Unrecognized character \\x07/,
+ "\${ $var } generates 'Unrecognized character' error" );
}
}
@@ -397,40 +390,8 @@ EOP
);
}
- SKIP: {
- skip("Literal control characters in variable names forbidden on EBCDIC", 2)
- if $::IS_EBCDIC;
- no warnings 'deprecated';
my $ret = eval "\${\cT\n}";
- is($@, "", 'No errors from using ${\n\cT\n}');
- is($ret, $^T, " ... and we got the right value");
- }
-}
-
-SKIP: {
- skip("Literal control characters in variable names forbidden on EBCDIC", 5)
- if $::IS_EBCDIC;
-
- # Originally from t/base/lex.t, moved here since we can't
- # turn deprecation warnings off in that file.
- no strict;
- no warnings 'deprecated';
-
- my $CX = "\cX";
- $ {$CX} = 17;
-
- # Does the syntax where we use the literal control character still work?
- is(
- eval "\$ {\cX}",
- 17,
- "Literal control character variables work"
- );
-
- eval "\$\cQ = 24"; # Literal control character
- is($@, "", " ... and they can be assigned to without error");
- is(${"\cQ"}, 24, " ... and the assignment works");
- is($^Q, 24, " ... even if we access the variable through the caret name");
- is(\${"\cQ"}, \$^Q, '\${\cQ} == \$^Q');
+ like($@, qr/\QUnrecognized character/, '${\n\cT\n} gives an error message');
}
{
diff --git a/toke.c b/toke.c
index 48b853dc2c..396fa76c64 100644
--- a/toke.c
+++ b/toke.c
@@ -8671,9 +8671,8 @@ S_scan_ident(pTHX_ char *s, char *dest, STRLEN destlen, I32 ck_uni)
/* Is the byte 'd' a legal single character identifier name? 'u' is true
* iff Unicode semantics are to be used. The legal ones are any of:
* a) all ASCII characters except:
- * 1) space-type ones, like \t and SPACE;
- 2) NUL;
- * 3) '{'
+ * 1) control and space-type ones, like NUL, SOH, \t, and SPACE;
+ * 2) '{'
* The final case currently doesn't get this far in the program, so we
* don't test for it. If that were to change, it would be ok to allow it.
* c) When not under Unicode rules, any upper Latin1 character
@@ -8691,11 +8690,10 @@ S_scan_ident(pTHX_ char *s, char *dest, STRLEN destlen, I32 ck_uni)
: (isGRAPH_L1(*s) \
&& LIKELY((U8) *(s) != LATIN1_TO_NATIVE(0xAD)))))
#else
-# define VALID_LEN_ONE_IDENT(s, is_utf8) (! isSPACE_A(*(s)) \
- && LIKELY(*(s) != '\0') \
- && (! is_utf8 \
- || isASCII_utf8((U8*) (s)) \
- || isIDFIRST_utf8((U8*) (s))))
+# define VALID_LEN_ONE_IDENT(s, is_utf8) \
+ (isGRAPH_A(*(s)) || ((is_utf8) \
+ ? isIDFIRST_utf8((U8*) (s)) \
+ : ! isASCII_utf8((U8*) (s))))
#endif
if ((s <= PL_bufend - (is_utf8)
? UTF8SKIP(s)
@@ -8711,13 +8709,7 @@ S_scan_ident(pTHX_ char *s, char *dest, STRLEN destlen, I32 ck_uni)
: (! isGRAPH_L1( (U8) *s)
|| UNLIKELY((U8) *(s) == LATIN1_TO_NATIVE(0xAD))))
{
- /* Split messages for back compat */
- if (isCNTRL_A( (U8) *s)) {
- deprecate("literal control characters in variable names");
- }
- else {
- deprecate("literal non-graphic characters in variable names");
- }
+ deprecate("literal non-graphic characters in variable names");
}
if (is_utf8) {