From dc4b9d21919146d4bb89a79889693513b617c46e Mon Sep 17 00:00:00 2001 From: neil Date: Thu, 27 Sep 2001 11:10:40 +0000 Subject: * doc/cppinternals.texi: Update. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@45839 138bc75d-0d04-0410-961f-82ee72b054a4 --- gcc/doc/cppinternals.texi | 186 +++++++++++++++++++++++++++++++++------------- 1 file changed, 133 insertions(+), 53 deletions(-) (limited to 'gcc/doc/cppinternals.texi') diff --git a/gcc/doc/cppinternals.texi b/gcc/doc/cppinternals.texi index d03e143025d..bcd0fd3117e 100644 --- a/gcc/doc/cppinternals.texi +++ b/gcc/doc/cppinternals.texi @@ -41,8 +41,8 @@ into another language, under the above conditions for modified versions. @titlepage @c @finalout @title Cpplib Internals -@subtitle Last revised Jan 2001 -@subtitle for GCC version 3.0 +@subtitle Last revised September 2001 +@subtitle for GCC version 3.1 @author Neil Booth @page @vskip 0pt plus 1filll @@ -69,14 +69,14 @@ into another language, under the above conditions for modified versions. @node Top, Conventions,, (DIR) @chapter Cpplib---the core of the GNU C Preprocessor -The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is +The GNU C preprocessor in GCC 3.x has been completely rewritten. It is now implemented as a library, cpplib, so it can be easily shared between a stand-alone preprocessor, and a preprocessor integrated with the C, C++ and Objective-C front ends. It is also available for use by other programs, though this is not recommended as its exposed interface has not yet reached a point of reasonable stability. -This library has been written to be re-entrant, so that it can be used +The library has been written to be re-entrant, so that it can be used to preprocess many files simultaneously if necessary. It has also been written with the preprocessing token as the fundamental unit; the preprocessor in previous versions of GCC would operate on text strings @@ -86,8 +86,6 @@ This brief manual documents some of the internals of cpplib, and a few tricky issues encountered. It also describes certain behaviour we would like to preserve, such as the format and spacing of its output. -Identifiers, macro expansion, hash nodes, lexing. - @menu * Conventions:: Conventions used in the code. * Lexer:: The combined C, C++ and Objective-C Lexer. @@ -123,18 +121,106 @@ behaviour. @node Lexer, Whitespace, Conventions, Top @unnumbered The Lexer @cindex lexer -@cindex tokens - -The lexer is contained in the file @file{cpplex.c}. We want to have a -lexer that is single-pass, for efficiency reasons. We would also like -the lexer to only step forwards through the input files, and not step -back. This will make future changes to support different character -sets, in particular state or shift-dependent ones, much easier. -This file also contains all information needed to spell a token, i.e.@: to -output it either in a diagnostic or to a preprocessed output file. This -information is not exported, but made available to clients through such -functions as @samp{cpp_spell_token} and @samp{cpp_token_len}. +@section Overview +The lexer is contained in the file @file{cpplex.c}. It is a hand-coded +lexer, and not implemented as a state machine. It can understand C, C++ +and Objective-C source code, and has been extended to allow reasonably +successful preprocessing of assembly language. The lexer does not make +an initial pass to strip out trigraphs and escaped newlines, but handles +them as they are encountered in a single pass of the input file. It +returns preprocessing tokens individually, not a line at a time. + +It is mostly transparent to users of the library, since the library's +interface for obtaining the next token, @code{cpp_get_token}, takes care +of lexing new tokens, handling directives, and expanding macros as +necessary. However, the lexer does expose some functionality so that +clients of the library can easily spell a given token, such as +@code{cpp_spell_token} and @code{cpp_token_len}. These functions are +useful when generating diagnostics, and for emitting the preprocessed +output. + +@section Lexing a token +Lexing of an individual token is handled by @code{_cpp_lex_direct} and +its subroutines. In its current form the code is quite complicated, +with read ahead characters and suchlike, since it strives to not step +back in the character stream in preparation for handling non-ASCII file +encodings. The current plan is to convert any such files to UTF-8 +before processing them. This complexity is therefore unnecessary and +will be removed, so I'll not discuss it further here. + +The job of @code{_cpp_lex_direct} is simply to lex a token. It is not +responsible for issues like directive handling, returning lookahead +tokens directly, multiple-include optimisation, or conditional block +skipping. It necessarily has a minor r@^ole to play in memory +management of lexed lines. I discuss these issues in a separate section +(@pxref{Lexing a line}). + +The lexer places the token it lexes into storage pointed to by the +variable @var{cur_token}, and then increments it. This variable is +important for correct diagnostic positioning. Unless a specific line +and column are passed to the diagnostic routines, they will examine the +@var{line} and @var{col} values of the token just before the location +that @var{cur_token} points to, and use that location to report the +diagnostic. + +The lexer does not consider whitespace to be a token in its own right. +If whitespace (other than a new line) precedes a token, it sets the +@code{PREV_WHITE} bit in the token's flags. Each token has its +@var{line} and @var{col} variables set to the line and column of the +first character of the token. This line number is the line number in +the translation unit, and can be converted to a source (file, line) pair +using the line map code. + +The first token on a logical, i.e.@: unescaped, line has the flag +@code{BOL} set for beginning-of-line. This flag is intended for +internal use, both to distinguish a @samp{#} that begins a directive +from one that doesn't, and to generate a callback to clients that want +to be notified about the start of every non-directive line with tokens +on it. Clients cannot reliably determine this for themselves: the first +token might be a macro, and the tokens of a macro expansion do not have +the @code{BOL} flag set. The macro expansion may even be empty, and the +next token on the line certainly won't have the @code{BOL} flag set. + +New lines are treated specially; exactly how the lexer handles them is +context-dependent. The C standard mandates that directives are +terminated by the first unescaped newline character, even if it appears +in the middle of a macro expansion. Therefore, if the state variable +@var{in_directive} is set, the lexer returns a @code{CPP_EOF} token, +which is normally used to indicate end-of-file, to indicate +end-of-directive. In a directive a @code{CPP_EOF} token never means +end-of-file. Conveniently, if the caller was @code{collect_args}, it +already handles @code{CPP_EOF} as if it were end-of-file, and reports an +error about an unterminated macro argument list. + +The C standard also specifies that a new line in the middle of the +arguments to a macro is treated as whitespace. This white space is +important in case the macro argument is stringified. The state variable +@code{parsing_args} is non-zero when the preprocessor is collecting the +arguments to a macro call. It is set to 1 when looking for the opening +parenthesis to a function-like macro, and 2 when collecting the actual +arguments up to the closing parenthesis, since these two cases need to +be distinguished sometimes. One such time is here: the lexer sets the +@code{PREV_WHITE} flag of a token if it meets a new line when +@code{parsing_args} is set to 2. It doesn't set it if it meets a new +line when @code{parsing_args} is 1, since then code like + +@smallexample +#define foo() bar +foo +baz +@end smallexample + +@noindent would be output with an erroneous space before @samp{baz}: + +@smallexample +foo + baz +@end smallexample + +This is a good example of the subtlety of getting token spacing correct +in the preprocessor; there are plenty of tests in the testsuite for +corner cases like this. The most painful aspect of lexing ISO-standard C and C++ is handling trigraphs and backlash-escaped newlines. Trigraphs are processed before @@ -148,62 +234,56 @@ within the characters of an identifier, and even between the @samp{*} and @samp{/} that terminates a comment. Moreover, you cannot be sure there is just one---there might be an arbitrarily long sequence of them. -So the routine @samp{parse_identifier}, that lexes an identifier, cannot -assume that it can scan forwards until the first non-identifier +So, for example, the routine that lexes a number, @code{parse_number}, +cannot assume that it can scan forwards until the first non-number character and be done with it, because this could be the @samp{\} introducing an escaped newline, or the @samp{?} introducing the trigraph -sequence that represents the @samp{\} of an escaped newline. Similarly -for the routine that handles numbers, @samp{parse_number}. If these -routines stumble upon a @samp{?} or @samp{\}, they call -@samp{skip_escaped_newlines} to skip over any potential escaped newlines -before checking whether they can finish. +sequence that represents the @samp{\} of an escaped newline. If it +encounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines} +to skip over any potential escaped newlines before checking whether the +number has been finished. -Similarly code in the main body of @samp{_cpp_lex_token} cannot simply +Similarly code in the main body of @code{_cpp_lex_direct} cannot simply check for a @samp{=} after a @samp{+} character to determine whether it has a @samp{+=} token; it needs to be prepared for an escaped newline of -some sort. These cases use the function @samp{get_effective_char}, -which returns the first character after any intervening newlines. +some sort. Such cases use the function @code{get_effective_char}, which +returns the first character after any intervening escaped newlines. -The lexer needs to keep track of the correct column position, -including counting tabs as specified by the @option{-ftabstop=} option. -This should be done even within comments; C-style comments can appear in -the middle of a line, and we want to report diagnostics in the correct +The lexer needs to keep track of the correct column position, including +counting tabs as specified by the @option{-ftabstop=} option. This +should be done even within C-style comments; they can appear in the +middle of a line, and we want to report diagnostics in the correct position for text appearing after the end of the comment. -Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers, +Some identifiers, such as @code{__VA_ARGS__} and poisoned identifiers, may be invalid and require a diagnostic. However, if they appear in a macro expansion we don't want to complain with each use of the macro. It is therefore best to catch them during the lexing stage, in -@samp{parse_identifier}. In both cases, whether a diagnostic is needed -or not is dependent upon lexer state. For example, we don't want to -issue a diagnostic for re-poisoning a poisoned identifier, or for using -@samp{__VA_ARGS__} in the expansion of a variable-argument macro. -Therefore @samp{parse_identifier} makes use of flags to determine +@code{parse_identifier}. In both cases, whether a diagnostic is needed +or not is dependent upon the lexer's state. For example, we don't want +to issue a diagnostic for re-poisoning a poisoned identifier, or for +using @code{__VA_ARGS__} in the expansion of a variable-argument macro. +Therefore @code{parse_identifier} makes use of state flags to determine whether a diagnostic is appropriate. Since we change state on a per-token basis, and don't lex whole lines at a time, this is not a problem. Another place where state flags are used to change behaviour is whilst -parsing header names. Normally, a @samp{<} would be lexed as a single -token. After a @code{#include} directive, though, it should be lexed -as a single token as far as the nearest @samp{>} character. Note that -we don't allow the terminators of header names to be escaped; the first +lexing header names. Normally, a @samp{<} would be lexed as a single +token. After a @code{#include} directive, though, it should be lexed as +a single token as far as the nearest @samp{>} character. Note that we +don't allow the terminators of header names to be escaped; the first @samp{"} or @samp{>} terminates the header name. Interpretation of some character sequences depends upon whether we are lexing C, C++ or Objective-C, and on the revision of the standard in -force. For example, @samp{::} is a single token in C++, but two -separate @samp{:} tokens, and almost certainly a syntax error, in C@. -Such cases are handled in the main function @samp{_cpp_lex_token}, based -upon the flags set in the @samp{cpp_options} structure. - -Note we have almost, but not quite, achieved the goal of not stepping -backwards in the input stream. Currently @samp{skip_escaped_newlines} -does step back, though with care it should be possible to adjust it so -that this does not happen. For example, one tricky issue is if we meet -a trigraph, but the command line option @option{-trigraphs} is not in -force but @option{-Wtrigraphs} is, we need to warn about it but then -buffer it and continue to treat it as 3 separate characters. +force. For example, @samp{::} is a single token in C++, but in C it is +two separate @samp{:} tokens and almost certainly a syntax error. Such +cases are handled by @code{_cpp_lex_direct} based upon command-line +flags stored in the @code{cpp_options} structure. + +@anchor{Lexing a line} +@section Lexing a line @node Whitespace, Hash Nodes, Lexer, Top @unnumbered Whitespace -- cgit v1.2.1