From eaf9c744335788c0eff89ea7fd84f7dd64565d21 Mon Sep 17 00:00:00 2001 From: Jordan Rose Date: Tue, 5 Feb 2013 17:31:34 +0000 Subject: [analyzer] Add a doc describing the internals of RegionStore. This is a text file with Markdown-ish formatting because we haven't decided where analyzer internal documents should go, but it's probably better to have this in source control than sitting on my local drive forever. git-svn-id: https://llvm.org/svn/llvm-project/cfe/trunk@174398 91177308-0d34-0410-b5e6-96231b3b80d8 --- docs/analyzer/RegionStore.txt | 171 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 171 insertions(+) create mode 100644 docs/analyzer/RegionStore.txt (limited to 'docs/analyzer') diff --git a/docs/analyzer/RegionStore.txt b/docs/analyzer/RegionStore.txt new file mode 100644 index 0000000000..5d37cf7bed --- /dev/null +++ b/docs/analyzer/RegionStore.txt @@ -0,0 +1,171 @@ +The analyzer "Store" represents the contents of memory regions. It is an opaque +functional data structure stored in each ProgramState; the only class that can +modify the store is its associated StoreManager. + +Currently (Feb. 2013), the only StoreManager implementation being used is +RegionStoreManager. This store records bindings to memory regions using a "base +region + offset" key. (This allows `*p` and `p[0]` to map to the same location, +among other benefits.) + +Regions are grouped into "clusters", which roughly correspond to "regions with +the same base region". This allows certain operations to be more efficient, +such as invalidation. + +Regions that do not have a known offset use a special "symbolic" offset. These +keys store both the original region, and the "concrete offset region" -- the +last region whose offset is entirely concrete. (For example, in the expression +`foo.bar[1][i].baz`, the concrete offset region is the array `foo.bar[1]`, +since that has a known offset from the start of the top-level `foo` struct.) + + +Binding Invalidation +==================== + +Supporting both concrete and symbolic offsets makes things a bit tricky. Here's +an example: + + foo[0] = 0; + foo[1] = 1; + foo[i] = i; + +After the third assignment, nothing can be said about the value of `foo[0]`, +because `foo[i]` may have overwritten it! Thus, *binding to a region with a +symbolic offset invalidates the entire concrete offset region.* We know +`foo[i]` is somewhere within `foo`, so we don't have to invalidate anything +else, but we do have to be conservative about all other bindings within `foo`. + +Continuing the example: + + foo[i] = i; + foo[0] = 0; + +After this latest assignment, nothing can be said about the value of `foo[i]`, +because `foo[0]` may have overwritten it! *Binding to a region R with a +concrete offset invalidates any symbolic offset bindings whose concrete offset +region is a super-region **or** sub-region of R.* All we know about `foo[i]` is +that it is somewhere within `foo`, so changing *anything* within `foo` might +change `foo[i]`, and changing *all* of `foo` (or its base region) will +*definitely* change `foo[i]`. + +This logic could be improved by using the current constraints on `i`, at the +cost of speed. The latter case could also be improved by matching region kinds, +i.e. changing `foo[0].a` is unlikely to affect `foo[i].b`, no matter what `i` +is. + +For more detail, read through RegionStoreManager::removeSubRegionBindings in +RegionStore.cpp. + + +ObjCIvarRegions +=============== + +Objective-C instance variables require a bit of special handling. Like struct +fields, they are not base regions, and when their parent object region is +invalidated, all the instance variables must be invalidated as well. However, +they have no concrete compile-time offsets (in the modern, "non-fragile" +runtime), and so cannot easily be represented as an offset from the start of +the object in the analyzer. Moreover, this means that invalidating a single +instance variable should *not* invalidate the rest of the object, since unlike +struct fields or array elements there is no way to perform pointer arithmetic +to access another instance variable. + +Consequently, although the base region of an ObjCIvarRegion is the entire +object, RegionStore offsets are computed from the start of the instance +variable. Thus it is not valid to assume that all bindings with non-symbolic +offsets start from the base region! + + +Region Invalidation +=================== + +Unlike binding invalidation, region invalidation occurs when the entire +contents of a region may have changed---say, because it has been passed to a +function the analyzer can model, like memcpy, or because its address has +escaped, usually as an argument to an opaque function call. In these cases we +need to throw away not just all bindings within the region itself, but within +its entire cluster, since neighboring regions may be accessed via pointer +arithmetic. + +Region invalidation typically does even more than this, however. Because it +usually represents the complete escape of a region from the analyzer's model, +its *contents* must also be transitively invalidated. (For example, if a region +'p' of type 'int **' is invalidated, the contents of '*p' and '**p' may have +changed as well.) The algorithm that traverses this transitive closure of +accessible regions is known as ClusterAnalysis, and is also used for finding +all live bindings in the store (in order to throw away the dead ones). The name +"ClusterAnalysis" predates the cluster-based organization of bindings, but +refers to the same concept: during invalidation and liveness analysis, all +bindings within a cluster must be treated in the same way for a conservative +model of program behavior. + + +Default Bindings +================ + +Most bindings in RegionStore are simple scalar values -- integers and pointers. +These are known as "Direct" bindings. However, RegionStore supports a second +type of binding called a "Default" binding. These are used to provide values to +all the elements of an aggregate type (struct or array) without having to +explicitly specify a binding for each individual element. + +When there is no Direct binding for a particular region, the store manager +looks at each super-region in turn to see if there is a Default binding. If so, +this value is used as the value of the original region. The search ends when +the base region is reached, at which point the RegionStore will pick an +appropriate default value for the region (usually a symbolic value, but +sometimes zero, for static data, or "uninitialized", for stack variables). + + int manyInts[10]; + manyInts[1] = 42; // Creates a Direct binding for manyInts[1]. + print(manyInts[1]); // Retrieves the Direct binding for manyInts[1]; + print(manyInts[0]); // There is no Direct binding for manyInts[1]. + // Is there a Default binding for the entire array? + // There is not, but it is a stack variable, so we use + // "uninitialized" as the default value (and emit a + // diagnostic!). + +NOTE: The fact that bindings are stored as a base region plus an offset limits +the Default Binding strategy, because in C aggregates can contain other +aggregates. In the current implementation of RegionStore, there is no way to +distinguish a Default binding for an entire aggregate from a Default binding +for the sub-aggregate at offset 0. + + +Lazy Bindings (LazyCompoundVal) +=============================== + +RegionStore implements an optimization for copying aggregates (structs and +arrays) called "lazy bindings", implemented using a special SVal called +LazyCompoundVal. When the store is asked for the "binding" for an entire +aggregate (i.e. for an lvalue-to-rvalue conversion), it returns a +LazyCompoundVal instead. When this value is then stored into a variable, it is +bound as a Default value. This makes copying arrays and structs much cheaper +than if they had required memberwise access. + +Under the hood, a LazyCompoundVal is implemented as a uniqued pair of (region, +store), representing "the value of the region during this 'snapshot' of the +store". This has important implications for any sort of liveness or +reachability analysis, which must take the bindings in the old store into +account. + +Retrieving a value from a lazy binding happens in the same way as any other +Default binding: since there is no direct binding, the store manager falls back +to super-regions to look for an appropriate default binding. LazyCompoundVal +differs from a normal default binding, however, in that it contains several +different values, instead of one value that will appear several times. Because +of this, the store manager has to reconstruct the subregion chain on top of the +LazyCompoundVal region, and look up *that* region in the previous store. + +Here's a concrete example: + + CGPoint p; + p.x = 42; // A Direct binding is made to the FieldRegion 'p.x'. + CGPoint p2 = p; // A LazyCompoundVal is created for 'p', along with a + // snapshot of the current store state. This value is then + // used as a Default binding for the VarRegion 'p2'. + return p2.x; // The binding for FieldRegion 'p2.x' is requested. + // There is no Direct binding, so we look for a Default + // binding to 'p2' and find the LCV. + // Because it's an LCV, we look at our requested region + // and see that it's the '.x' field. We ask for the value + // of 'p.x' within the snapshot, and get back 42. -- cgit v1.2.1