diff options
author | DJ Delorie <dj@redhat.com> | 2001-06-01 12:51:18 -0400 |
---|---|---|
committer | DJ Delorie <dj@gcc.gnu.org> | 2001-06-01 12:51:18 -0400 |
commit | 95e30ecc774ffb60a9222e042c04cd452fcbdcc0 (patch) | |
tree | f81399b3b264b1da44db9b622747ea3964777bce /gcc/doc/cppinternals.texi | |
parent | e5410b32112c69d305c5f377225d4e71c83f31c8 (diff) | |
download | gcc-95e30ecc774ffb60a9222e042c04cd452fcbdcc0.tar.gz |
c-tree.texi, [...]: Move to doc subdirectory.
* c-tree.texi, contrib.texi, cpp.texi, cppinternals.texi,
extend.texi, fdl.texi, gcov.texi, invoke.texi, md.texi, objc.texi,
rtl.texi, tm.texi, texinfo.tex: Move to doc subdirectory.
* install.texi: Move to doc/install-old.texi.
* gcc.texi: Move to doc, refer to install-old.texi.
* Makefile.in: Reflect move of docs to doc/.
* f/Make-lang.in: Ditto.
* java/Make-lang.in: Ditto.
* doc/.cvsignore: New.
From-SVN: r42779
Diffstat (limited to 'gcc/doc/cppinternals.texi')
-rw-r--r-- | gcc/doc/cppinternals.texi | 430 |
1 files changed, 430 insertions, 0 deletions
diff --git a/gcc/doc/cppinternals.texi b/gcc/doc/cppinternals.texi new file mode 100644 index 00000000000..2a038cb259e --- /dev/null +++ b/gcc/doc/cppinternals.texi @@ -0,0 +1,430 @@ +\input texinfo +@setfilename cppinternals.info +@settitle The GNU C Preprocessor Internals + +@ifinfo +@dircategory Programming +@direntry +* Cpplib: (cppinternals). Cpplib internals. +@end direntry +@end ifinfo + +@c @smallbook +@c @cropmarks +@c @finalout +@setchapternewpage odd +@ifinfo +This file documents the internals of the GNU C Preprocessor. + +Copyright 2000, 2001 Free Software Foundation, Inc. + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +@ignore +Permission is granted to process this file through Tex and print the +results, provided the printed document carries copying permission +notice identical to this one except for the removal of this paragraph +(this paragraph not being relevant to the printed manual). + +@end ignore +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided also that +the entire resulting derived work is distributed under the terms of a +permission notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the above conditions for modified versions. +@end ifinfo + +@titlepage +@c @finalout +@title Cpplib Internals +@subtitle Last revised Jan 2001 +@subtitle for GCC version 3.0 +@author Neil Booth +@page +@vskip 0pt plus 1filll +@c man begin COPYRIGHT +Copyright @copyright{} 2000, 2001 +Free Software Foundation, Inc. + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided also that +the entire resulting derived work is distributed under the terms of a +permission notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the above conditions for modified versions. +@c man end +@end titlepage +@page + +@node Top, Conventions,, (DIR) +@chapter Cpplib - the core of the GNU C Preprocessor + +The GNU C preprocessor in GCC 3.0 has been completely rewritten. It is +now implemented as a library, cpplib, so it can be easily shared between +a stand-alone preprocessor, and a preprocessor integrated with the C, +C++ and Objective C front ends. It is also available for use by other +programs, though this is not recommended as its exposed interface has +not yet reached a point of reasonable stability. + +This library has been written to be re-entrant, so that it can be used +to preprocess many files simultaneously if necessary. It has also been +written with the preprocessing token as the fundamental unit; the +preprocessor in previous versions of GCC would operate on text strings +as the fundamental unit. + +This brief manual documents some of the internals of cpplib, and a few +tricky issues encountered. It also describes certain behaviour we would +like to preserve, such as the format and spacing of its output. + +Identifiers, macro expansion, hash nodes, lexing. + +@menu +* Conventions:: Conventions used in the code. +* Lexer:: The combined C, C++ and Objective C Lexer. +* Whitespace:: Input and output newlines and whitespace. +* Hash Nodes:: All identifiers are hashed. +* Macro Expansion:: Macro expansion algorithm. +* Files:: File handling. +* Index:: Index. +@end menu + +@node Conventions, Lexer, Top, Top +@unnumbered Conventions +@cindex interface +@cindex header files + +cpplib has two interfaces - one is exposed internally only, and the +other is for both internal and external use. + +The convention is that functions and types that are exposed to multiple +files internally are prefixed with @samp{_cpp_}, and are to be found in +the file @samp{cpphash.h}. Functions and types exposed to external +clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}. For +historical reasons this is no longer quite true, but we should strive to +stick to it. + +We are striving to reduce the information exposed in cpplib.h to the +bare minimum necessary, and then to keep it there. This makes clear +exactly what external clients are entitled to assume, and allows us to +change internals in the future without worrying whether library clients +are perhaps relying on some kind of undocumented implementation-specific +behaviour. + +@node Lexer, Whitespace, Conventions, Top +@unnumbered The Lexer +@cindex lexer +@cindex tokens + +The lexer is contained in the file @samp{cpplex.c}. We want to have a +lexer that is single-pass, for efficiency reasons. We would also like +the lexer to only step forwards through the input files, and not step +back. This will make future changes to support different character +sets, in particular state or shift-dependent ones, much easier. + +This file also contains all information needed to spell a token, i.e. to +output it either in a diagnostic or to a preprocessed output file. This +information is not exported, but made available to clients through such +functions as @samp{cpp_spell_token} and @samp{cpp_token_len}. + +The most painful aspect of lexing ISO-standard C and C++ is handling +trigraphs and backlash-escaped newlines. Trigraphs are processed before +any interpretation of the meaning of a character is made, and unfortunately +there is a trigraph representation for a backslash, so it is possible for +the trigraph @samp{??/} to introduce an escaped newline. + +Escaped newlines are tedious because theoretically they can occur +anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token, +within the characters of an identifier, and even between the @samp{*} +and @samp{/} that terminates a comment. Moreover, you cannot be sure +there is just one - there might be an arbitrarily long sequence of them. + +So the routine @samp{parse_identifier}, that lexes an identifier, cannot +assume that it can scan forwards until the first non-identifier +character and be done with it, because this could be the @samp{\} +introducing an escaped newline, or the @samp{?} introducing the trigraph +sequence that represents the @samp{\} of an escaped newline. Similarly +for the routine that handles numbers, @samp{parse_number}. If these +routines stumble upon a @samp{?} or @samp{\}, they call +@samp{skip_escaped_newlines} to skip over any potential escaped newlines +before checking whether they can finish. + +Similarly code in the main body of @samp{_cpp_lex_token} cannot simply +check for a @samp{=} after a @samp{+} character to determine whether it +has a @samp{+=} token; it needs to be prepared for an escaped newline of +some sort. These cases use the function @samp{get_effective_char}, +which returns the first character after any intervening newlines. + +The lexer needs to keep track of the correct column position, +including counting tabs as specified by the @samp{-ftabstop=} option. +This should be done even within comments; C-style comments can appear in +the middle of a line, and we want to report diagnostics in the correct +position for text appearing after the end of the comment. + +Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers, +may be invalid and require a diagnostic. However, if they appear in a +macro expansion we don't want to complain with each use of the macro. +It is therefore best to catch them during the lexing stage, in +@samp{parse_identifier}. In both cases, whether a diagnostic is needed +or not is dependent upon lexer state. For example, we don't want to +issue a diagnostic for re-poisoning a poisoned identifier, or for using +@samp{__VA_ARGS__} in the expansion of a variable-argument macro. +Therefore @samp{parse_identifier} makes use of flags to determine +whether a diagnostic is appropriate. Since we change state on a +per-token basis, and don't lex whole lines at a time, this is not a +problem. + +Another place where state flags are used to change behaviour is whilst +parsing header names. Normally, a @samp{<} would be lexed as a single +token. After a @code{#include} directive, though, it should be lexed +as a single token as far as the nearest @samp{>} character. Note that +we don't allow the terminators of header names to be escaped; the first +@samp{"} or @samp{>} terminates the header name. + +Interpretation of some character sequences depends upon whether we are +lexing C, C++ or Objective C, and on the revision of the standard in +force. For example, @samp{::} is a single token in C++, but two +separate @samp{:} tokens, and almost certainly a syntax error, in C. +Such cases are handled in the main function @samp{_cpp_lex_token}, based +upon the flags set in the @samp{cpp_options} structure. + +Note we have almost, but not quite, achieved the goal of not stepping +backwards in the input stream. Currently @samp{skip_escaped_newlines} +does step back, though with care it should be possible to adjust it so +that this does not happen. For example, one tricky issue is if we meet +a trigraph, but the command line option @samp{-trigraphs} is not in +force but @samp{-Wtrigraphs} is, we need to warn about it but then +buffer it and continue to treat it as 3 separate characters. + +@node Whitespace, Hash Nodes, Lexer, Top +@unnumbered Whitespace +@cindex whitespace +@cindex newlines +@cindex escaped newlines +@cindex paste avoidance +@cindex line numbers + +The lexer has been written to treat each of @samp{\r}, @samp{\n}, +@samp{\r\n} and @samp{\n\r} as a single new line indicator. This allows +it to transparently preprocess MS-DOS, Macintosh and Unix files without +their needing to pass through a special filter beforehand. + +We also decided to treat a backslash, either @samp{\} or the trigraph +@samp{??/}, separated from one of the above newline indicators by +non-comment whitespace only, as intending to escape the newline. It +tends to be a typing mistake, and cannot reasonably be mistaken for +anything else in any of the C-family grammars. Since handling it this +way is not strictly conforming to the ISO standard, the library issues a +warning wherever it encounters it. + +Handling newlines like this is made simpler by doing it in one place +only. The function @samp{handle_newline} takes care of all newline +characters, and @samp{skip_escaped_newlines} takes care of arbitrarily +long sequences of escaped newlines, deferring to @samp{handle_newline} +to handle the newlines themselves. + +Another whitespace issue only concerns the stand-alone preprocessor: we +want to guarantee that re-reading the preprocessed output results in an +identical token stream. Without taking special measures, this might not +be the case because of macro substitution. We could simply insert a +space between adjacent tokens, but ideally we would like to keep this to +a minimum, both for aesthetic reasons and because it causes problems for +people who still try to abuse the preprocessor for things like Fortran +source and Makefiles. + +The token structure contains a flags byte, and two flags are of interest +here: @samp{PREV_WHITE} and @samp{AVOID_LPASTE}. @samp{PREV_WHITE} +indicates that the token was preceded by whitespace; if this is the case +we need not worry about it incorrectly pasting with its predecessor. +The @samp{AVOID_LPASTE} flag is set by the macro expansion routines, and +indicates that paste avoidance by insertion of a space to the left of +the token may be necessary. Recursively, the first token of a macro +substitution, the first token after a macro substitution, the first +token of a substituted argument, and the first token after a substituted +argument are all flagged @samp{AVOID_LPASTE} by the macro expander. + +If a token flagged in this way does not have a @samp{PREV_WHITE} flag, +and the routine @var{cpp_avoid_paste} determines that it might be +misinterpreted by the lexer if a space is not inserted between it and +the immediately preceding token, then stand-alone CPP's output routines +will insert a space between them. To avoid excessive spacing, +@var{cpp_avoid_paste} tries hard to only request a space if one is +likely to be necessary, but for reasons of efficiency it is slightly +conservative and might recommend a space where one is not strictly +needed. + +Finally, the preprocessor takes great care to ensure it keeps track of +both the position of a token in the source file, for diagnostic +purposes, and where it should appear in the output file, because using +CPP for other languages like assembler requires this. The two positions +may differ for the following reasons: + +@itemize @bullet +@item +Escaped newlines are deleted, so lines spliced in this way are joined to +form a single logical line. + +@item +A macro expansion replaces the tokens that form its invocation, but any +newlines appearing in the macro's arguments are interpreted as a single +space, with the result that the macro's replacement appears in full on +the same line that the macro name appeared in the source file. This is +particularly important for stringification of arguments - newlines +embedded in the arguments must appear in the string as spaces. +@end itemize + +The source file location is maintained in the @var{lineno} member of the +@var{cpp_buffer} structure, and the column number inferred from the +current position in the buffer relative to the @var{line_base} buffer +variable, which is updated with every newline whether escaped or not. + +TODO: Finish this. + +@node Hash Nodes, Macro Expansion, Whitespace, Top +@unnumbered Hash Nodes +@cindex hash table +@cindex identifiers +@cindex macros +@cindex assertions +@cindex named operators + +When cpplib encounters an "identifier", it generates a hash code for it +and stores it in the hash table. By "identifier" we mean tokens with +type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as +well as keywords, directive names, macro names and so on. For example, +all of "pragma", "int", "foo" and "__GNUC__" are identifiers and hashed +when lexed. + +Each node in the hash table contain various information about the +identifier it represents. For example, its length and type. At any one +time, each identifier falls into exactly one of three categories: + +@itemize @bullet +@item Macros + +These have been declared to be macros, either on the command line or +with @code{#define}. A few, such as @samp{__TIME__} are builtins +entered in the hash table during initialisation. The hash node for a +normal macro points to a structure with more information about the +macro, such as whether it is function-like, how many arguments it takes, +and its expansion. Builtin macros are flagged as special, and instead +contain an enum indicating which of the various builtin macros it is. + +@item Assertions + +Assertions are in a separate namespace to macros. To enforce this, cpp +actually prepends a @code{#} character before hashing and entering it in +the hash table. An assertion's node points to a chain of answers to +that assertion. + +@item Void + +Everything else falls into this category - an identifier that is not +currently a macro, or a macro that has since been undefined with +@code{#undef}. + +When preprocessing C++, this category also includes the named operators, +such as @samp{xor}. In expressions these behave like the operators they +represent, but in contexts where the spelling of a token matters they +are spelt differently. This spelling distinction is relevant when they +are operands of the stringizing and pasting macro operators @code{#} and +@code{##}. Named operator hash nodes are flagged, both to catch the +spelling distinction and to prevent them from being defined as macros. +@end itemize + +The same identifiers share the same hash node. Since each identifier +token, after lexing, contains a pointer to its hash node, this is used +to provide rapid lookup of various information. For example, when +parsing a @code{#define} statement, CPP flags each argument's identifier +hash node with the index of that argument. This makes duplicated +argument checking an O(1) operation for each argument. Similarly, for +each identifier in the macro's expansion, lookup to see if it is an +argument, and which argument it is, is also an O(1) operation. Further, +each directive name, such as @samp{endif}, has an associated directive +enum stored in its hash node, so that directive lookup is also O(1). + +@node Macro Expansion, Files, Hash Nodes, Top +@unnumbered Macro Expansion Algorithm + +@node Files, Index, Macro Expansion, Top +@unnumbered File Handling +@cindex files + +Fairly obviously, the file handling code of cpplib resides in the file +@samp{cppfiles.c}. It takes care of the details of file searching, +opening, reading and caching, for both the main source file and all the +headers it recursively includes. + +The basic strategy is to minimize the number of system calls. On many +systems, the basic @code{open ()} and @code{fstat ()} system calls can +be quite expensive. For every @code{#include}-d file, we need to try +all the directories in the search path until we find a match. Some +projects, such as glibc, pass twenty or thirty include paths on the +command line, so this can rapidly become time consuming. + +For a header file we have not encountered before we have little choice +but to do this. However, it is often the case that the same headers are +repeatedly included, and in these cases we try to avoid repeating the +filesystem queries whilst searching for the correct file. + +For each file we try to open, we store the constructed path in a splay +tree. This path first undergoes simplification by the function +@code{_cpp_simplify_pathname}. For example, +@samp{/usr/include/bits/../foo.h} is simplified to +@samp{/usr/include/foo.h} before we enter it in the splay tree and try +to @code{open ()} the file. CPP will then find subsequent uses of +@samp{foo.h}, even as @samp{/usr/include/foo.h}, in the splay tree and +save system calls. + +Further, it is likely the file contents have also been cached, saving a +@code{read ()} system call. We don't bother caching the contents of +header files that are re-inclusion protected, and whose re-inclusion +macro is defined when we leave the header file for the first time. If +the host supports it, we try to map suitably large files into memory, +rather than reading them in directly. + +The include paths are internally stored on a null-terminated +singly-linked list, starting with the @code{"header.h"} directory search +chain, which then links into the @code{<header.h>} directory chain. + +Files included with the @code{<foo.h>} syntax start the lookup directly +in the second half of this chain. However, files included with the +@code{"foo.h"} syntax start at the beginning of the chain, but with one +extra directory prepended. This is the directory of the current file; +the one containing the @code{#include} directive. Prepending this +directory on a per-file basis is handled by the function +@code{search_from}. + +Note that a header included with a directory component, such as +@code{#include "mydir/foo.h"} and opened as +@samp{/usr/local/include/mydir/foo.h}, will have the complete path minus +the basename @samp{foo.h} as the current directory. + +Enough information is stored in the splay tree that CPP can immediately +tell whether it can skip the header file because of the multiple include +optimisation, whether the file didn't exist or couldn't be opened for +some reason, or whether the header was flagged not to be re-used, as it +is with the obsolete @code{#import} directive. + +For the benefit of MS-DOS filesystems with an 8.3 filename limitation, +CPP offers the ability to treat various include file names as aliases +for the real header files with shorter names. The map from one to the +other is found in a special file called @samp{header.gcc}, stored in the +command line (or system) include directories to which the mapping +applies. This may be higher up the directory tree than the full path to +the file minus the base name. + +@node Index,, Files, Top +@unnumbered Index +@printindex cp + +@contents +@bye |