summaryrefslogtreecommitdiff
path: root/include_server/macro_eval.py
blob: 49dce53eb4feef4677ad8001fc999281d145e7a1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
#! /usr/bin/python2.4

# Copyright 2007 Google Inc.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301,
# USA.
 
"""Evaluation of macros acccording to an overapproximation semantics.

This module generally follows CPP semantics for the evalution of macros. But we
treat every define as a possible one, because we don't know whether it is
actually executed when a file is really preprocessed. Our semantics is thus
multi-valued: an expression (that is, a string) is evaluated to the set of
poosible values it can have according to arbitrary choices of #defines in effect
among those encountered in the program text.

An example explains the general idea. If we have:

#define A x
#define A y
#define B 1
#define B 2

Then the expression

  A.B

evaluates to one of the following:

  x.1
  y.1
  x.2
  y.2

The set {"x.1", "y.1", "x.2", "y.2"} is the value of EvalExpr("A.B",
symbol_table), where symbol_table is the dictionary in which we have
stored the four #define's.

Currently, we will be satisfied with

  A A

evaluating to

  x x
  x y
  y x
  y y

although a sharper semantic modelling would yield only:

  x x
  y y



How to Read This Code
---------------------
An understanding of the C preprocessor is necessary. See "The GNU C Preprocessor
Internals" (http://gcc.gnu.org/onlinedocs/cppinternals). Especially the section
"Macro Expansion Algorithm" is informative.

Whitespace Insertion and Other Deficiencies
-------------------------------------------
CPP inserts whitespaces and sometimes doesn't according to very complicated
rules. We do not insert whitespaces.

Also, we retokenize each intermediate expansion.

For actual arguments of macros, we also we do not do the right thing for
parentheses inside single quotes, which is to ignore them.

There are probably several more deviations from CPP semantics.

These deviations should not matter for most common included computes.

What If the Include Processor is Wrong
--------------------------------------
Assume that we have

  #include very_complicated_call(anfhis,fifj)

If the include processor produces spurious expansions like:

 "whatda.c"
 2 + 2
 "5 * 3"

then file whada.c if found in search directory becomes part of the include
closure. So does the file "5 * 3". But 2 + 2 does not have the shape of an a
filepath in an include: the filepath must be in quotes or angle brackets.

These spurious files are not harmful to preprocessing on the server.

If the include server omits calculating the expansion

 "right_file.h"

then the compilation on the server will fail. The client, according to the logic
of dcc_build_somewhere, will then perform a local compilation.

Symbol Table
------------

The symbol table is a dictionary, whose entries are of the form 

  symbol: definition_list

Each definition in definition_list is either
 - a string, denoting the expansion of an object-like macro, or
 - a pair ([param_1,...,param_n], rhs), which denotes a function-like macro,
   whose formal parameters are param_1,.., param_n and whose expansion is rhs,
   before the substitution of formal parameters for actual parameters.
"""

__author__ = "Nils Klarlund"

import re

import basics
import parse_file
import statistics

Debug = basics.Debug
DEBUG_TRACE = basics.DEBUG_TRACE
DEBUG_TRACE1 = basics.DEBUG_TRACE1
DEBUG_TRACE2 = basics.DEBUG_TRACE2
NotCoveredError = basics.NotCoveredError

# REGULAR EXPRESSIONS

SINGLE_POUND_RE = re.compile(r"\B#\s*(\S*)") # \B = here: not at end of word
DOUBLE_POUND_RE = re.compile(r"##")
SYMBOL_RE = re.compile(r"\b\w+\b") # \b = word boundary \w = word constituent


# HELPER FUNCTIONS

def _BigUnion(list_of_sets):
  """Return the set that is the union of the sets or lists in list_of_sets."""
  result = []
  for s in list_of_sets:
    result.extend(list(s))
  return set(result)


def _PrependToSet(expr, expr_set):
  """Return the set consiting of expr + element with element in expr_set."""
  return set([ expr + expr_ for expr_ in expr_set ])


def _SubstituteSymbolInString(x, y, str):
  """Return the string that results from substituting x for y in str."""
  Debug(DEBUG_TRACE2,
        """_SubstituteSymbolInString: x: "%s", y: "%s", str:"%s" """,
        x, y, str)
  result = re.sub(r"\b%s\b" % re.escape(x), y.replace('\\', '\\\\'), str)
  Debug(DEBUG_TRACE2, """_SubstituteSymbolInString (result): "%s" """, result)
  return result

def _ParseArgs(string, pos):
  """Split stuff according to commas at outer level in parenthesized string.

  If string[pos:] does not start with '(', return (None, pos). If string[pos:]
  starts with '(' and there is a balanced parenthesis structure ending
  at pos_end, then return (args, pos_end), where args is
  string[pos:pos_end] hacked into segments between commas at outer
  level.  So "(a,m(c, n(d)), c)...." results in (["a", "m(c, n(d))", "
  c"], 17) being returned.
  """
  # TODO(klarlund): we ignore ignoring parentheses inside single quotes. Such
  # occurrences are deemed unlikely at this moment. Fix that so that parentheses
  # inside single quotes are ignored.
  open_parens = 0
  if not pos < len(string) or string[pos] != '(':
    return (None, pos)
  # Prepare a list of comma and extremal positions.  The '(' at the left is the
  # first extremal position.
  commas = [pos]  
  pos_end = None
  inside_quotes = False
  for i in range(pos, len(string)):
    if inside_quotes:
      if string[i] == '"' and string[i-1] != r'\\':
        inside_quotes = False
      continue
    if string[i]==',' and open_parens==1:
      commas.append(i)
    elif string[i]=='(':
      open_parens += 1
    elif string[i]==')':
      open_parens -= 1
      if open_parens == 0:
        pos_end = i
        break
    elif string[i] == '"' and string[i-1] != r'\\':
      inside_quotes = True
  if not pos_end:
    return (None, pos)
  commas.append(pos_end)  # the other extremal position
  args_list = []
  for i in range(len(commas) - 1):
    args_list.append(string[commas[i] + 1 : commas[i + 1]])
  return (args_list, pos_end + 1)


def _MassageAccordingToPoundSigns(string):
  """Perform 'stringification (#) and concatenation (##)."""
  return SINGLE_POUND_RE.sub(r'"\1"', DOUBLE_POUND_RE.sub("", string))


# EVALUATION

def _EvalExprHelper(expr, symbol_table, disabled):
  if __debug__:
    Debug(DEBUG_TRACE2, "EvalExprHelper: expr: %s", expr)

  """Evaluate according to an overapproximation macro substitution semantics.

  Arguments:
    expr:          a string
    symbol_table:  { symbol: [rhs,
                              ...,
                              ([param_1,...,param_n], rhs),
                              ...],
                     ... }, where rhs and param_i are strings
    disabled:       set of disabled symbols (see "The GNU C Preprocessor
                    Internals")

  Returns: [ expr_1, expr_2, ...], which is a non-empy list of
    strings, namely the expansions of expr.
  """  

  def _ReEvalRecursivelyForExpansion(expansion, after):
    """Reevalute the expansion that is the result of finding a match for a
    macro.

    Arguments:
      symbol (outer scope): the name of the matched macro that resulted in
        expansion; it is the same as match.group()
      match (outer scope): the match data for the symbol
      expansion: the expansion we are substituting for the match
      after: the string after the expansion
    Modifies:
      value_set: the set of all possible expansions of expr
      
    The value set is updated according to recursive evaluations of the string
    that results from inserting expansion between expr[:match.start()] and
    expr[match.end():] (for symbol-like macro) or expr[args_end:] (for
    function-like macro).

    The idea is to form a set of strings from a cross product of two string sets
    descriping all possibly expansions before and after the match.

    There are two recursions involved. First, we evaluate after to find all
    possible values of what follows the match. This recursion does not involve a
    larger disabled set. Each resulting string is named after_eval_expr. Second,
    we evaluate expansion concatenated with each after_eval_expr value.  In
    these evaluations, symbol is added to the disabled set.
    """
    if __debug__:
      Debug(DEBUG_TRACE2,
            ("_ReEvalRecursivelyForExpansion: expr: %s\n" +
             "    before:     %s\n    expansion:        %s\n    after: %s")
            % (expr, expr[:match.start()], expansion, after))
    value_set.update(
      _PrependToSet(expr[:match.start()],
                   _BigUnion([ _EvalExprHelper(expansion + after_expansion,
                                              symbol_table,
                                              disabled | set([symbol]))
                              for after_expansion in
                              _EvalExprHelper(after,
                                              symbol_table,
                                              disabled) ])))

  def _EvalMacro(definition, disabled):
    """Evaluate symbol according to definition.

    Here definition is either object-like or function-like.
    """
    # Consider that this symbol goes unevaluated.
    value_set.update(
      _PrependToSet(expr[:match.end()],
                   _EvalExprHelper(expr[match.end():],
                                   symbol_table,
                                   disabled)))
    if isinstance(definition, str):
      # The expansion is the definition.
      _ReEvalRecursivelyForExpansion(definition, expr[match.end():])
    elif isinstance(definition, tuple):
      # We have an invocation a function-like macro. Find the possible
      # values of the function symbol, according to object-like
      # expansions, before substituting.
      (lhs, rhs) = definition  # lhs = formal parameters, rhs =
                               # expansion before substitution
      # Verify that the number of formal parameters match the
      # number of actual parameters; otherwise skip.
      if not args_list or len(lhs) != len(args_list):
        return
      # Expand arguments recursively.
      args_expand = [ _EvalExprHelper(arg, symbol_table, disabled)
                        for arg in args_list ]
      # Do the substitutions. Again, we'll need to piece together
      # strings from a cross product. In this the fragments come from
      # the expansions of the arguments.
      expansions = [rhs]
      for i in range(len(args_expand)):
        expansions = [ _SubstituteSymbolInString(lhs[i], arg, expansion)
                       for expansion in expansions
                       for arg in args_expand[i] ]
      for expansion in expansions:
        real_expansion = _MassageAccordingToPoundSigns(expansion)
        _ReEvalRecursivelyForExpansion(real_expansion, expr[args_end:])
    else:
      assert False, "Definition '%s' is unexpected." % definition
      
  # Look for a symbol.
  match = SYMBOL_RE.search(expr)
  if not match:
    # No symbol found.
    return set([expr])
  else:
    # Let's break down the string into segments according to the
    # symbol found. This is non-standard: the real CPP only tokenizes
    # once.
    symbol = match.group()
    (args_list, args_end) = _ParseArgs(expr, match.end())
    Debug(DEBUG_TRACE2,
          "EvalExprHelper (inside): expr: %s\n" +
          "    symbol:     %s\n    args_list:       %s\n" +
          "    before: %s\n",
          expr, symbol, args_list, expr[:match.start()])
    if symbol not in symbol_table:
      # Process rest of string recursively.
      return _PrependToSet(expr[:match.end()],
                          _EvalExprHelper(expr[match.end():],
                                          symbol_table,
                                          disabled))
    else:
      # Now consider the set of meanings of this symbol.  But first
      # note that the string remaining unexpanded is always a
      # possibility, because we are doing a "forall" analysis.
      value_set = set([expr])
      # Now carry out substitution on expr[match.start():match.end()],
      # the whole stretch of expr that consists of symbol and possibly
      # args with parentheses.
      if symbol not in disabled:
        defs = symbol_table[symbol]
        for definition in defs:
          _EvalMacro(definition, disabled)
      return value_set


def EvalExpression(expr, symbol_table):
  """Calculate sets of possibly values of expr given symbol_table.

  Arguments:
    expr:          any string to be macro expanded
    symbol_table:  { symbol: {rhs, ..}, ,...,
                     symbol:{((param_1,...,param_n), rhs), ... }
  Returns:
    [ expr_1, expr_2, ...], a list of strings: the possible expansions of expr.
  """
  if __debug__:
    Debug(DEBUG_TRACE, "EvalExpression: expr: %s", expr)
  r = set(_EvalExprHelper(expr, symbol_table, set([])))
  if __debug__:
    Debug(DEBUG_TRACE, "EvalExpression: return: %s", r)
  return r


def ResolveExpr(includepath_map_index,
                resolve,
                expr,
                currdir_idx,
                searchdir_idx,
                quote_dirs,
                angle_dirs,
                symbol_table):
  """Evaluate and resolve possible values for expr using symbol table.

  Determine all possible values of expr. Those that are of the form "filepath"
  or <filepath> are resolved against (file_dir_idx, quote_dirs) or angle_dirs,
  respectively. The set of resolvants is returned along with a list of all
  symbols that occurs in possible evaluations of expr.

  Arguments:
    includepath_map_index: the Index function of an includepath map
    resolve: a Resolve method of a BuildStatCache object
    expr: any string to be macro expanded
    currdir_idx: a directory index
    searchdir_idx: a directory index (used for resolving quote-includes)
    quote_dirs: a directory index list
    angle_dirs: a directory index list
    symbol_table: as described in module macro_expr
  Returns:
    a pair(files, symbols), where files is a list of (filepath pair,
    realpath index), namely those files that are successful resolutions of
    possible double-quoted and angle-quoted values of expr, and symbols is 
    the set of all identifiers occurring in expression or its possible 
    expansions
  Raises:
    NotCoveredError
  """
  if __debug__:
        Debug(DEBUG_TRACE, "ResolveExpr: %s, %s, %s",
              expr, searchdir_idx, angle_dirs)
  resolved_files = []
  symbols = []
  statistics.resolve_expr_counter += 1
  for val in EvalExpression(expr, symbol_table):
    match_result = parse_file.INCLUDE_STRING_RE.match(val)
    if match_result:
      if match_result.group('quote'):
	resolved = resolve(includepath_map_index(match_result.group('quote')),
			   currdir_idx, searchdir_idx, quote_dirs)
	resolved_files.append(resolved)
      elif match_result.group('angle'):
	resolved = resolve(includepath_map_index(match_result.group('angle')),
			   currdir_idx, None, angle_dirs)
	resolved_files.append(resolved)
    else:
      symbols.extend(SYMBOL_RE.findall(val))
  if __debug__:
      Debug(DEBUG_TRACE, "ResolveExpr: return: %s", resolved_files)
  return (resolved_files, set(symbols))