summaryrefslogtreecommitdiff
path: root/gcc/doc/cppinternals.texi
diff options
context:
space:
mode:
authorneil <neil@138bc75d-0d04-0410-961f-82ee72b054a4>2001-09-27 11:10:40 +0000
committerneil <neil@138bc75d-0d04-0410-961f-82ee72b054a4>2001-09-27 11:10:40 +0000
commitdc4b9d21919146d4bb89a79889693513b617c46e (patch)
tree0a1a38965fbe8c58fe43ca9216a358d5a2fc6322 /gcc/doc/cppinternals.texi
parent251edd9c087fbd754dfdae6c51691e3a8d200ce2 (diff)
downloadgcc-dc4b9d21919146d4bb89a79889693513b617c46e.tar.gz
* doc/cppinternals.texi: Update.
git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@45839 138bc75d-0d04-0410-961f-82ee72b054a4
Diffstat (limited to 'gcc/doc/cppinternals.texi')
-rw-r--r--gcc/doc/cppinternals.texi186
1 files changed, 133 insertions, 53 deletions
diff --git a/gcc/doc/cppinternals.texi b/gcc/doc/cppinternals.texi
index d03e143025d..bcd0fd3117e 100644
--- a/gcc/doc/cppinternals.texi
+++ b/gcc/doc/cppinternals.texi
@@ -41,8 +41,8 @@ into another language, under the above conditions for modified versions.
@titlepage
@c @finalout
@title Cpplib Internals
-@subtitle Last revised Jan 2001
-@subtitle for GCC version 3.0
+@subtitle Last revised September 2001
+@subtitle for GCC version 3.1
@author Neil Booth
@page
@vskip 0pt plus 1filll
@@ -69,14 +69,14 @@ into another language, under the above conditions for modified versions.
@node Top, Conventions,, (DIR)
@chapter Cpplib---the core of the GNU C Preprocessor
-The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is
+The GNU C preprocessor in GCC 3.x has been completely rewritten. It is
now implemented as a library, cpplib, so it can be easily shared between
a stand-alone preprocessor, and a preprocessor integrated with the C,
C++ and Objective-C front ends. It is also available for use by other
programs, though this is not recommended as its exposed interface has
not yet reached a point of reasonable stability.
-This library has been written to be re-entrant, so that it can be used
+The library has been written to be re-entrant, so that it can be used
to preprocess many files simultaneously if necessary. It has also been
written with the preprocessing token as the fundamental unit; the
preprocessor in previous versions of GCC would operate on text strings
@@ -86,8 +86,6 @@ This brief manual documents some of the internals of cpplib, and a few
tricky issues encountered. It also describes certain behaviour we would
like to preserve, such as the format and spacing of its output.
-Identifiers, macro expansion, hash nodes, lexing.
-
@menu
* Conventions:: Conventions used in the code.
* Lexer:: The combined C, C++ and Objective-C Lexer.
@@ -123,18 +121,106 @@ behaviour.
@node Lexer, Whitespace, Conventions, Top
@unnumbered The Lexer
@cindex lexer
-@cindex tokens
-
-The lexer is contained in the file @file{cpplex.c}. We want to have a
-lexer that is single-pass, for efficiency reasons. We would also like
-the lexer to only step forwards through the input files, and not step
-back. This will make future changes to support different character
-sets, in particular state or shift-dependent ones, much easier.
-This file also contains all information needed to spell a token, i.e.@: to
-output it either in a diagnostic or to a preprocessed output file. This
-information is not exported, but made available to clients through such
-functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
+@section Overview
+The lexer is contained in the file @file{cpplex.c}. It is a hand-coded
+lexer, and not implemented as a state machine. It can understand C, C++
+and Objective-C source code, and has been extended to allow reasonably
+successful preprocessing of assembly language. The lexer does not make
+an initial pass to strip out trigraphs and escaped newlines, but handles
+them as they are encountered in a single pass of the input file. It
+returns preprocessing tokens individually, not a line at a time.
+
+It is mostly transparent to users of the library, since the library's
+interface for obtaining the next token, @code{cpp_get_token}, takes care
+of lexing new tokens, handling directives, and expanding macros as
+necessary. However, the lexer does expose some functionality so that
+clients of the library can easily spell a given token, such as
+@code{cpp_spell_token} and @code{cpp_token_len}. These functions are
+useful when generating diagnostics, and for emitting the preprocessed
+output.
+
+@section Lexing a token
+Lexing of an individual token is handled by @code{_cpp_lex_direct} and
+its subroutines. In its current form the code is quite complicated,
+with read ahead characters and suchlike, since it strives to not step
+back in the character stream in preparation for handling non-ASCII file
+encodings. The current plan is to convert any such files to UTF-8
+before processing them. This complexity is therefore unnecessary and
+will be removed, so I'll not discuss it further here.
+
+The job of @code{_cpp_lex_direct} is simply to lex a token. It is not
+responsible for issues like directive handling, returning lookahead
+tokens directly, multiple-include optimisation, or conditional block
+skipping. It necessarily has a minor r@^ole to play in memory
+management of lexed lines. I discuss these issues in a separate section
+(@pxref{Lexing a line}).
+
+The lexer places the token it lexes into storage pointed to by the
+variable @var{cur_token}, and then increments it. This variable is
+important for correct diagnostic positioning. Unless a specific line
+and column are passed to the diagnostic routines, they will examine the
+@var{line} and @var{col} values of the token just before the location
+that @var{cur_token} points to, and use that location to report the
+diagnostic.
+
+The lexer does not consider whitespace to be a token in its own right.
+If whitespace (other than a new line) precedes a token, it sets the
+@code{PREV_WHITE} bit in the token's flags. Each token has its
+@var{line} and @var{col} variables set to the line and column of the
+first character of the token. This line number is the line number in
+the translation unit, and can be converted to a source (file, line) pair
+using the line map code.
+
+The first token on a logical, i.e.@: unescaped, line has the flag
+@code{BOL} set for beginning-of-line. This flag is intended for
+internal use, both to distinguish a @samp{#} that begins a directive
+from one that doesn't, and to generate a callback to clients that want
+to be notified about the start of every non-directive line with tokens
+on it. Clients cannot reliably determine this for themselves: the first
+token might be a macro, and the tokens of a macro expansion do not have
+the @code{BOL} flag set. The macro expansion may even be empty, and the
+next token on the line certainly won't have the @code{BOL} flag set.
+
+New lines are treated specially; exactly how the lexer handles them is
+context-dependent. The C standard mandates that directives are
+terminated by the first unescaped newline character, even if it appears
+in the middle of a macro expansion. Therefore, if the state variable
+@var{in_directive} is set, the lexer returns a @code{CPP_EOF} token,
+which is normally used to indicate end-of-file, to indicate
+end-of-directive. In a directive a @code{CPP_EOF} token never means
+end-of-file. Conveniently, if the caller was @code{collect_args}, it
+already handles @code{CPP_EOF} as if it were end-of-file, and reports an
+error about an unterminated macro argument list.
+
+The C standard also specifies that a new line in the middle of the
+arguments to a macro is treated as whitespace. This white space is
+important in case the macro argument is stringified. The state variable
+@code{parsing_args} is non-zero when the preprocessor is collecting the
+arguments to a macro call. It is set to 1 when looking for the opening
+parenthesis to a function-like macro, and 2 when collecting the actual
+arguments up to the closing parenthesis, since these two cases need to
+be distinguished sometimes. One such time is here: the lexer sets the
+@code{PREV_WHITE} flag of a token if it meets a new line when
+@code{parsing_args} is set to 2. It doesn't set it if it meets a new
+line when @code{parsing_args} is 1, since then code like
+
+@smallexample
+#define foo() bar
+foo
+baz
+@end smallexample
+
+@noindent would be output with an erroneous space before @samp{baz}:
+
+@smallexample
+foo
+ baz
+@end smallexample
+
+This is a good example of the subtlety of getting token spacing correct
+in the preprocessor; there are plenty of tests in the testsuite for
+corner cases like this.
The most painful aspect of lexing ISO-standard C and C++ is handling
trigraphs and backlash-escaped newlines. Trigraphs are processed before
@@ -148,62 +234,56 @@ within the characters of an identifier, and even between the @samp{*}
and @samp{/} that terminates a comment. Moreover, you cannot be sure
there is just one---there might be an arbitrarily long sequence of them.
-So the routine @samp{parse_identifier}, that lexes an identifier, cannot
-assume that it can scan forwards until the first non-identifier
+So, for example, the routine that lexes a number, @code{parse_number},
+cannot assume that it can scan forwards until the first non-number
character and be done with it, because this could be the @samp{\}
introducing an escaped newline, or the @samp{?} introducing the trigraph
-sequence that represents the @samp{\} of an escaped newline. Similarly
-for the routine that handles numbers, @samp{parse_number}. If these
-routines stumble upon a @samp{?} or @samp{\}, they call
-@samp{skip_escaped_newlines} to skip over any potential escaped newlines
-before checking whether they can finish.
+sequence that represents the @samp{\} of an escaped newline. If it
+encounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}
+to skip over any potential escaped newlines before checking whether the
+number has been finished.
-Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
+Similarly code in the main body of @code{_cpp_lex_direct} cannot simply
check for a @samp{=} after a @samp{+} character to determine whether it
has a @samp{+=} token; it needs to be prepared for an escaped newline of
-some sort. These cases use the function @samp{get_effective_char},
-which returns the first character after any intervening newlines.
+some sort. Such cases use the function @code{get_effective_char}, which
+returns the first character after any intervening escaped newlines.
-The lexer needs to keep track of the correct column position,
-including counting tabs as specified by the @option{-ftabstop=} option.
-This should be done even within comments; C-style comments can appear in
-the middle of a line, and we want to report diagnostics in the correct
+The lexer needs to keep track of the correct column position, including
+counting tabs as specified by the @option{-ftabstop=} option. This
+should be done even within C-style comments; they can appear in the
+middle of a line, and we want to report diagnostics in the correct
position for text appearing after the end of the comment.
-Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
+Some identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
may be invalid and require a diagnostic. However, if they appear in a
macro expansion we don't want to complain with each use of the macro.
It is therefore best to catch them during the lexing stage, in
-@samp{parse_identifier}. In both cases, whether a diagnostic is needed
-or not is dependent upon lexer state. For example, we don't want to
-issue a diagnostic for re-poisoning a poisoned identifier, or for using
-@samp{__VA_ARGS__} in the expansion of a variable-argument macro.
-Therefore @samp{parse_identifier} makes use of flags to determine
+@code{parse_identifier}. In both cases, whether a diagnostic is needed
+or not is dependent upon the lexer's state. For example, we don't want
+to issue a diagnostic for re-poisoning a poisoned identifier, or for
+using @code{__VA_ARGS__} in the expansion of a variable-argument macro.
+Therefore @code{parse_identifier} makes use of state flags to determine
whether a diagnostic is appropriate. Since we change state on a
per-token basis, and don't lex whole lines at a time, this is not a
problem.
Another place where state flags are used to change behaviour is whilst
-parsing header names. Normally, a @samp{<} would be lexed as a single
-token. After a @code{#include} directive, though, it should be lexed
-as a single token as far as the nearest @samp{>} character. Note that
-we don't allow the terminators of header names to be escaped; the first
+lexing header names. Normally, a @samp{<} would be lexed as a single
+token. After a @code{#include} directive, though, it should be lexed as
+a single token as far as the nearest @samp{>} character. Note that we
+don't allow the terminators of header names to be escaped; the first
@samp{"} or @samp{>} terminates the header name.
Interpretation of some character sequences depends upon whether we are
lexing C, C++ or Objective-C, and on the revision of the standard in
-force. For example, @samp{::} is a single token in C++, but two
-separate @samp{:} tokens, and almost certainly a syntax error, in C@.
-Such cases are handled in the main function @samp{_cpp_lex_token}, based
-upon the flags set in the @samp{cpp_options} structure.
-
-Note we have almost, but not quite, achieved the goal of not stepping
-backwards in the input stream. Currently @samp{skip_escaped_newlines}
-does step back, though with care it should be possible to adjust it so
-that this does not happen. For example, one tricky issue is if we meet
-a trigraph, but the command line option @option{-trigraphs} is not in
-force but @option{-Wtrigraphs} is, we need to warn about it but then
-buffer it and continue to treat it as 3 separate characters.
+force. For example, @samp{::} is a single token in C++, but in C it is
+two separate @samp{:} tokens and almost certainly a syntax error. Such
+cases are handled by @code{_cpp_lex_direct} based upon command-line
+flags stored in the @code{cpp_options} structure.
+
+@anchor{Lexing a line}
+@section Lexing a line
@node Whitespace, Hash Nodes, Lexer, Top
@unnumbered Whitespace