diff options
Diffstat (limited to 'chromium/docs/website/site/Home/chromium-security/articles/gwp-asan/index.md')
-rw-r--r-- | chromium/docs/website/site/Home/chromium-security/articles/gwp-asan/index.md | 431 |
1 files changed, 431 insertions, 0 deletions
diff --git a/chromium/docs/website/site/Home/chromium-security/articles/gwp-asan/index.md b/chromium/docs/website/site/Home/chromium-security/articles/gwp-asan/index.md new file mode 100644 index 00000000000..d82347bf30b --- /dev/null +++ b/chromium/docs/website/site/Home/chromium-security/articles/gwp-asan/index.md @@ -0,0 +1,431 @@ +--- +breadcrumbs: +- - /Home + - Chromium +- - /Home/chromium-security + - Chromium Security +- - /Home/chromium-security/articles + - Articles +page_name: gwp-asan +title: 'GWP-ASan: Sampling heap memory error detection in-the-wild' +--- + +By Vlad Tsyrklevich, Dynamic Tools Teams — November 2019 + +Memory safety errors, like use-after-frees and out-of-bounds reads/writes, are a +leading source of vulnerabilities in C/C++ applications. Despite investments in +preventing and detecting these errors in Chrome, over 60% of high severity +vulnerabilities in Chrome are memory safety errors. Some memory safety errors +don’t lead to security vulnerabilities but simply cause crashes and instability. + +Chrome uses state-of-the-art techniques to prevent these errors, including: + + [Coverage-guided](https://llvm.org/docs/LibFuzzer.html) + [fuzzing](https://en.wikipedia.org/wiki/American_fuzzy_lop_(fuzzer)) with + [AddressSanitizer](https://clang.llvm.org/docs/AddressSanitizer.html) (ASan) + + Unit and integration testing with ASan + + Defensive programming, like custom libraries to perform safe math or provide + bounds checked containers + + Mandatory code review + +Chrome also makes use of sandboxing and exploit mitigations to complicate +exploitation of memory errors that go undetected by the methods above. + +AddressSanitizer is a compiler instrumentation that finds memory errors +occurring on the heap, stack, or in globals. ASan is highly effective and one of +the lowest overhead instrumentations available that detects the errors that it +does; however, it still incurs an average 2-3x performance and memory overhead. +This makes it suitable for use with unit tests or fuzzing, but not deployment to +end users. Chrome used to deploy [SyzyASAN instrumented +binaries](https://blog.chromium.org/2013/05/testing-chromium-syzyasan-lightweight.html) +to detect memory errors. SyzyASAN had a similar overhead so it was only deployed +to a small subset of users on the canary channel. It was discontinued after the +Windows toolchain switched to LLVM. + +GWP-ASan, also known by its recursive backronym, GWP-ASan Will Provide +Allocation Sanity, is a sampling allocation tool designed to detect heap memory +errors occurring in production with negligible overhead. Because of its +negligible overhead we can deploy GWP-ASan to the entire Chrome user base to +find memory errors happening in the real world that are not caught by fuzzing or +testing with ASan. Unlike ASan, GWP-ASan can not find memory errors on the stack +or in globals. + +GWP-ASan is currently enabled for all Windows and macOS users for allocations +made using malloc() and PartitionAlloc. It is only enabled for a small fraction +of allocations and processes to reduce performance and memory overhead to a +negligible amount. At the time of writing it has found [over sixty +bugs](https://bugs.chromium.org/p/chromium/issues/list?q=Hotlist%3DGWP-ASan&can=1) +(many are still restricted view). About 90% of the issues GWP-ASan has found are +use-after-frees. The remaining are out-of-bounds reads and writes. + +Design + +Overview + +GWP-ASan is conceptually similar to +[ElectricFence](https://en.wikipedia.org/wiki/Electric_Fence) or +[PageHeap](https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/gflags-and-pageheap). +GWP-ASan installs an allocator instrumentation that samples allocations to a +debug allocator that places allocations on their own page, buttressed on both +sides by guard pages. New allocations are randomly either left- or right-aligned +within the page so that accessing the allocation below or above its bounds +causes a crash. When the allocation is freed, the page is unmapped so that a +use-after-free also immediately crashes. The allocator limits itself to a fixed +amount of memory to control memory overhead and samples allocation to the debug +allocator to reduce its high performance overhead. + +Use-after-frees and out-of-bounds accesses are often hard to debug because they +corrupt unrelated memory which can lead to crashes in unrelated code. GWP-ASan +simplifies debugging by causing a crash immediately at the site of the invalid +memory access. Furthermore, when a crash occurs a special crash handler hook +reports additional information, like allocation and deallocation stack traces, +to aid debugging. This metadata is similar to what AddressSanitizer provides and +has been shown to be very useful in identifying and fixing memory errors. + +GWP-ASan is a heap-only instrumentation so it does not find memory errors on the +stack or in globals that AddressSanitizer would; however, it can find some +memory errors that ASan would not. ASan works by instrumenting memory accesses +during compilation and makes use of ‘interceptors’ to detect misuse of common +library functions. Because GWP-ASan uses native memory management to detect +memory errors it doesn’t require interceptors to detect invalid memory use in +system libraries. This means it can identify API misuse for uncommon APIs that +don’t have interceptors, or even detect memory errors that occur due to bugs in +system libraries—something ASan can’t do without recompiling those potentially +proprietary libraries. + +GWP-ASan is only as effective as the number of allocation call sites it +instruments. For an internal Chrome allocator like PartitionAlloc it is possible +to intercept all uses; however, for malloc/free we may only be able to +instrument a subset of allocations. For example, on Windows we instrument malloc +and free by overriding the symbols for modules we build linked against //base, +so some DLLs shipped with Chrome—let alone Windows system code—may not be +instrumented. On macOS however the system allocator allows adding global hooks +meaning we can +([and](https://bugs.chromium.org/p/chromium/issues/list?q=Hotlist%3DGWP-ASan%20Component%3DInternals%3EPlatformIntegration&can=1) +[do](https://support.apple.com/en-us/HT210634)) detect memory errors from +allocations originating in code we don’t control, like Apple system libraries. + +Allocator + +The GWP-ASan allocator reserves a fixed range of memory at initialization that +it uses to service allocations to limit memory overhead. The memory range +consists of pages intended to be used to return allocations, called slots, +buttressed by guard pages as shown below. + +[<img alt="image" +src="/Home/chromium-security/articles/gwp-asan/gwp-asan-diagram1.png">](/Home/chromium-security/articles/gwp-asan/gwp-asan-diagram1.png) + +Allocations are randomly left- or right-aligned to help detect both underflows +and overflows. Like a traditional allocator, the GWP-ASan allocator always +suitably aligns allocations for any object of that size. This means that +right-aligned allocations are not always directly adjacent to the following +guard page, so small out-of-bounds accesses may go undetected. + +[<img alt="image" +src="/Home/chromium-security/articles/gwp-asan/gwp-asan-diagram2.png">](/Home/chromium-security/articles/gwp-asan/gwp-asan-diagram2.png) + +An array of allocation metadata is also maintained on the side to store stack +traces and other metadata for individual slots. + +The allocator has three primary tunable parameters: MaxSimultaneousAllocations, +MaxMetadata, and ReservedSlots. MaxSimultaneousAllocations controls the maximum +number of allocations that can be simultaneously allocated. + +Once every usable slot has been allocated and deallocated, they are reused to +service new allocations. When a use-after-free occurs the use may not occur +immediately after deallocation. If the slot has been reallocated then the +use-after-free will not behave as expected. If the slot is still allocated then +the use won’t crash, but if it is deallocated then it will cause a crash but the +metadata for the slot will have the wrong allocation/deallocation stack traces. + +Like ASan, GWP-ASan also makes use of a quarantine to help improve +use-after-free detection. ReservedSlots is always greater than or equal to +MaxSimultaneousAllocations and controls the number of slots we allocated virtual +memory for. If ReservedSlots > MaxSimultaneousAllocations, then not all slots +can be simultaneously allocated. If slots are allocated in a round-robin fashion +then a slot will not be re-used until at least (ReservedSlots - +MaxSimultaneousAllocations) allocations have taken place, forming a rudimentary +quarantine. This delays the amount of time until a slot is re-used, improving +use-after-free detection at the expense of using more memory. The allocator +consumes more virtual memory for the additional quarantine slots and more +physical memory storing allocation metadata about those quarantine slots. Each +slot’s metadata consumes about 400 bytes, primarily to store compressed +allocation/deallocation stack traces, compared to 4 kilobytes for every +allocation. As a result, setting ReservedSlots to be slightly greater than +MaxSimultaneousAllocations doesn’t significantly increase the amount of memory +used. + +The rudimentary quarantine described above is sufficient to delay slot re-use to +accurately detect use-after-frees occurring shortly after deallocation; however, +use-after-frees that occur long after deallocation are likely to access slots +that have already been reallocated. This can lead to long-lived use-after-frees +causing reports with numerous different stack traces for unrelated allocations +and deallocations, making it difficult to identify the real +allocation/deallocation call sites. This could be improved by making +ReservedSlots orders of magnitude larger than MaxSimultaneousAllocations; +however, the amount of additional allocation metadata that this would require +allocating would significantly increase GWP-ASan’s memory profile. + +To address this, GWP-ASan makes use of a third MaxMetadata parameter to limit +the number of slots for which we store metadata. We tune the allocator such that +ReservedSlots >= MaxMetadata >= MaxSimultaneousAllocations. GWP-ASan keeps +metadata for all currently allocated slots as well as some previously +deallocated slots. Because we discard metadata for some deallocated slots, we +can not always report allocation metadata if those slots are accessed because of +a use-after-free. By setting ReservedSlots to be an order of magnitude or more +greater than MaxMetadata and MaxSimultaneousAllocations, we make the quarantine +so large that many allocations have to occur before a slot is reused. This +ensures that even long-lived use-after-frees are not likely to be reallocated +before they’re accessed. If no metadata for the slot is available, then a useful +report can’t be sent; however, we eliminate many false reports. Short-lived +use-after-frees are still likely to be accessed before the metadata for the slot +is eliminated. Using random eviction to purge old metadata entries allows +metadata for old allocations to sometimes survive long enough to be reported for +long-lived use-after-frees. + +[<img alt="image" +src="/Home/chromium-security/articles/gwp-asan/diag4.png">](/Home/chromium-security/articles/gwp-asan/diag4.png) + +The debug allocator currently only services allocations less than or equal to a +single page in size. This is not a fundamental limitation in the design--it’s +possible to service larger allocations by increasing the size of a slot to be +multiple pages. It simply hasn’t been addressed yet because allocations larger +than a page are relatively rare. + +Unactionable crash reports can occur when a pointer is corrupted and the +overwritten value happens to accidentally point to a guard page or deallocated +slot in the GWP-ASan region. When such a wild pointer is accessed, it causes a +GWP-ASan report to be sent but it’s not actionable because the crash is caused +by an unrelated bug that corrupted the pointer value to point to an unrelated +allocation. In practice, such unactionable reports tend to occur on 32-bit +devices because the address space is smaller and the probability of a wild +pointer access touching the GWP-ASan region is much higher. GWP-ASan was +disabled for 32-bit desktop builds in order to eliminate these unactionable +reports. The allocator also explicitly maps the GWP-ASan memory region in high +memory locations to avoid the operating system choosing to place GWP-ASan region +in the bottom 32-bits of memory on 64-bit devices. + +Allocator Hooks + +GWP-ASan instruments an allocator’s allocation and deallocation routines. The +allocation instrumentation performs sampling to only route a fraction of +allocation requests to the debug allocator. The deallocation instrumentation +determines if the given allocation was allocated by the debug allocator and +routes the request to the debug allocator if so. Determining if an allocation +was returned by GWP-ASan is as simple as checking that the address is in +GWP-ASan’s fixed memory region and matching the address to the slot’s allocation +metadata. + +Production allocators are normally highly optimized so adding additional +instrumentation to the allocation/deallocation hot paths can easily introduce +significant performance regressions. While the debug allocator’s overhead can be +reduced to an arbitrary amount by adjusting the sampling probability, the +overhead of the instrumentation itself introduces a constant overhead. Some +allocation-heavy microbenchmarks regressed up to 5% when introducing allocator +instrumentation no matter how low the sampling probability was made. + +The instrumentation regression stems from the allocator hot-paths being very +performance sensitive and that instrumenting those hot-paths in Chrome requires +introducing a costly indirect call. GWP-ASan uses process sampling, only +enabling instrumentation for a fraction of processes, to reduce the +instrumentation overhead. This allows reducing the instrumentation overhead +arbitrarily and using more memory per-enabled process. + +Crash Handler + +Chrome is migrating to using crashpad for crash handling. Unlike its predecessor +breakpad, crashpad works almost entirely out-of-process. GWP-ASan registers a +hook in the crashpad process to inspect crashing processes in order to determine +if the crashes are related to GWP-ASan. On initialization, GWP-ASan saves the +address of the internal allocator object in a crashpad annotation so that the +crash handler can access it in the event of a crash. If the crashpad hook finds +this annotation, it reads the GWP-ASan allocator information to determine if the +crash occurred due to an access to a GWP-ASan allocation. If so, it attaches +[metadata](https://chromium.googlesource.com/chromium/src/+/refs/tags/79.0.3924.1/components/gwp_asan/crash_handler/crash.proto) +for the associated allocation to the crash report. + +# Tuning + +Chrome uses a multi-process model with different types of processes with varying +lifetimes and allocator demands. For example, there is a single browser process +for the entire lifetime of a given browser window while many renderer processes +can be launched and destroyed in a single tab. A browser process could be active +for weeks and make tens of billions of allocations while other processes may +live for milliseconds and make thousands of allocations. Accommodating both +types of processes is tricky because there is a tension between GWP-ASan +regularly sampling allocations and exhausting its fixed supply of memory. + +GWP-ASan exhausts its memory when all MaxSimultaneousAllocations slots are taken +and new allocations can’t be serviced. This can occur when all of the +allocations are long-lived, e.g. freed long after allocation or never freed at +all. If GWP-ASan runs out of allocations early in a process’ lifetime then the +majority of the process’ allocations go unsampled. + +In order to better understand allocation behavior we analyze heap traces for +different runs of Chromium. The following trace comes from opening a browser, +playing a YouTube video for ten seconds, and then closing the browser. The +following visualization shows allocation lifetimes for malloc() allocations in +the GPU process. + +<img alt="image" +src="https://lh3.googleusercontent.com/zNvAZs5kvLi3pWg95qzTx44-YEnV_cPhxUz5Zis7N3PHz3O8mTUl8AmyHRq4mBTyHlKLoHt8W4Ho-I4Ir8-mgShjxJBbBt4m0GMjUIOBPpf-paaeHpQcrjwLapXkHlvyK23uYzU-"> + +Every vertical bar represents two thousand allocations subdivided into different +allocation lifetimes. The horizontal axis is the process lifetime. This process +makes approximately 250,000 allocations. Most allocations are freed within 25 +milliseconds, and only 4% of allocations are never freed during the process’ +lifetime. + +The following graph is for allocations made using PartitionAlloc in the YouTube +renderer process: + +<img alt="image" +src="https://lh3.googleusercontent.com/DMRz54twMEpT7jM1N5YRptzCIbhaXXU3aAIfZ4cFbxEa47OcXLg6SosZJ4SN-TNEVkK8aVAv_jGzgdOvs18H8Bwatn0GdjLoswBywWCl83ON4fzTG8jpMoAJm0uJ-9firc5-7NtY"> + +This process makes about 1.1 million allocations and about 7% go unfreed. In +both examples, unfreed allocations cluster at the beginning of the process’ +lifetime. Because of the difference in number of total and long-lived +allocations, the renderer process may exhaust GWP-ASan allocations early with +the same parameters that would sample the GPU process without exhaustion. + +Long lifetime allocations can also lead to temporary allocator exhaustion, for +example if the allocations are not freed until right before process destruction. +Modeling simulated runs with different GWP-ASan configurations over different +heap traces best illustrates what allocator behavior can occur in practice. The +following is a simulated run for the renderer trace above with sampling +probability 1/1000 and 16 simultaneous allocations: + +<img alt="image" +src="https://lh3.googleusercontent.com/443plq4yGGNrGszyiQZqIye4zoN-pb09q0qAUzz5qOtatSNKuFzI8sz29Ehlsr_EQHRlfgk0hWRY1gpgoeLXB6gSrSgKqJ9crm3XztePIEYnKWY7w1-lYkWh6Z_85W_snmZYLGqF" +height=365 width=730.311111111111> + +The bars represent allocation lifetimes, with the vertical axis being time. In +the simulation above GWP-ASan runs out of allocations for most of the process +lifetime with occasional bursts of sampling as long-lived allocations are freed +and re-used until they are replaced by new long-lived allocations. + +To avoid allocator exhaustion, the allocator must use more memory per process or +reduce the sampling probability. The following is a simulation run with sampling +probability 1/8000 and 64 simultaneous allocations: + +<img alt="image" +src="https://lh6.googleusercontent.com/64as-cfN6-NoQqkrvxQ3bzziNscyMgOB56gqaqpaayVAcMHRxdTH_aCc5cg42k02T4xW8JXgojEWl9RZZNIGwCNK1X19lvWflOSc_Mgfg-tVEssi_BuJs35Xdl9dBXzvAOmYjVRS" +height=360 width=719.7338262476894> + +In this simulation GWP-ASan is able to evenly sample the entire process’ +lifetime despite the presence of long-lived and unfreed allocations. Some runs +may still be unlucky and run out of allocations early, but it’s far less common. + +In practice, because of process sampling we can allocate more memory per enabled +process. GWP-ASan’s production settings only sample a small fraction of +processes, so it’s safe to allocate more memory for every enabled process. + +Instead of uniformly reducing the sampling probability for all processes, +GWP-ASan picks a sampling probability from a range of probabilities at +initialization. The sampling probability may sometimes be more frequent (and +lead to early allocator exhaustion), or less frequent (and lead to fewer +detected errors), than optimal. However, it allows accommodating different +allocation behavior in different processes. + +# Results + +The Chrome project makes extensive use of ASan in unit tests and during fuzzing +with [ClusterFuzz](https://google.github.io/clusterfuzz/) to detect memory +errors early. As a result, the bugs GWP-ASan finds tend to be where our current +fuzzing and test infrastructure don't sufficiently test the underlying error +conditions. Unit and integration tests typically tend to only test expected +success and failure conditions. Fuzzers test a wider variety of inputs, but +coverage isn’t universal. Furthermore, fuzzing is well suited for testing +specific narrowly-scoped components like parsers and other input processors, but +not all memory safety errors in Chrome fit that description. + +Some of the types of bugs that GWP-ASan has been successful in finding include: + + Race conditions. These may manifest as races between two threads freeing an + allocation and using it, or an event firing at an inopportune time such that + an allocation used by the parent event loop is freed by the event. + ClusterFuzz may not be able to exercise the correct conditions to trigger + the race or may not reproduce the racy crash reliably enough to satisfy a + heuristic to avoid reporting false positives. + + Chrome- or OS-specific configuration bugs. Some bugs may only manifest in + configurations that are not exercised by Chrome’s testing and fuzzing + infrastructure. + + Bugs in UI code. Unit tests and fuzzers tend not to exercise UI code. UI + code is also susceptible to lifetime and bounds-related errors though they + are more likely to be stability issues instead of security issues. + +One example issue is [this](https://crbug.com/977341) bug in Skia. The +underlying memory error is a racy use-after-free where two threads +near-simultaneously free and access an allocation. This bug had been causing +crashes on macOS for a while, but it was difficult to spot the issue because the +crashes occurred in different places depending on which underlying allocation +was corrupted. With GWP-ASan it was immediately clear where the error occurred, +but both threads freeing and accessing the allocation were doing so after +locking the same mutex so it should have been impossible. With the use and +deallocation stack traces proving that this was occurring despite the mutex, it +was easy to track the bug down to the Skia mutex class. The macOS implementation +did not account for spurious wake-ups and could violate mutual exclusion. +Without the information provided by GWP-ASan, it would be difficult to debug +such an issue. + +As GWP-ASan was progressively rolled out to wider audiences, it detected rarer +and rarer bugs. Frequently occurring bugs may be detected within hours of a new +canary release while some rarely-occurring bugs have only been detected in the +stable population once so far. It’s possible to find these rare errors because +GWP-ASan is deployed widely and designed to minimize unactionable reports, but +there are likely to be rare errors that we don’t catch because increasing +sampling to detect them would require unacceptable memory and performance +overhead. The [ARM Memory Tagging +Extension](https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety) +and similar hardware-assisted memory tagging schemes would allow implementing a +[similar error +detector](https://github.com/google/sanitizers/blob/master/hwaddress-sanitizer/login_summer19_03_serebryany.pdf) +with much lower memory and performance overhead and a much higher probability of +detecting errors. Such memory tagging schemes also allow detecting stack bounds +and use-after-return errors and may even be useful as exploit mitigations. + +# Future Improvements + +GWP-ASan has a high memory overhead per allocation. Every allocation is stored +on its own page but Chrome’s median allocation size is only 32 bytes. It’s +possible to place multiple allocations on a single physical page and maintain +the ability to detect use-after-frees using a special virtual memory +configuration. The approach reduces GWP-ASan’s memory overhead at the cost of +reducing out-of-bounds error detection. + +Placing multiple allocations on the same virtual memory page would reduce +use-after-free detection because the page could not be unmapped until all of the +allocations on the page were deallocated. If a single allocation on that page +were to never be freed then use-after-free detection would be completely lost. + +It is possible to use the operating system’s shared memory facilities to work +around this constraint. It is possible to map shared memory multiple times in +the same process. This allows multiple virtual memory pages to point to the same +backing physical page. Multiple allocations can be placed on the same backing +physical page but every allocation can be given it’s own unique slot/virtual +page. This way, once an allocation is freed, the slot can be unmapped to detect +use-after-frees without interfering with the other allocations. Only a fraction +of allocations will be able to be left- or right-aligned within the page so +out-of-bounds errors detection would suffer with this scheme; however, in +practice use-after-free exceptions are much more common. + +[<img alt="image" +src="/Home/chromium-security/articles/gwp-asan/gwp-asan-diagram3.png">](/Home/chromium-security/articles/gwp-asan/gwp-asan-diagram3.png) + +This approach allows significantly increasing memory density and therefore the +number of simultaneous allocations. It’s conceivable that the memory overhead of +allocation metadata like stack traces would come to dominate GWP-ASan’s memory +usage instead of the wasted page overhead. + +Increasing the number of simultaneous allocations helps prevent allocator +exhaustion. Mobile platforms especially tend to be much more memory constrained +so deploying GWP-ASan in those environments may necessitate use of this +approach. + +Thanks to Matthew Denton, Adrian Taylor, Chris Palmer, Kostya Serebryany, Matt +Morehouse, and Mitch Phillips for their feedback.
\ No newline at end of file |