summaryrefslogtreecommitdiff
path: root/chromium/docs/website/site/developers/design-documents/software-updates-courgette
diff options
context:
space:
mode:
authorAllan Sandfeld Jensen <allan.jensen@qt.io>2022-02-04 17:20:24 +0100
committerAllan Sandfeld Jensen <allan.jensen@qt.io>2022-02-12 08:15:25 +0000
commit8fa0776f1f79e91fc9c0b9c1ba11a0a29c05196b (patch)
tree788d8d7549712682703a0310ca4a0f0860d4802b /chromium/docs/website/site/developers/design-documents/software-updates-courgette
parent606d85f2a5386472314d39923da28c70c60dc8e7 (diff)
downloadqtwebengine-chromium-8fa0776f1f79e91fc9c0b9c1ba11a0a29c05196b.tar.gz
BASELINE: Update Chromium to 98.0.4758.90
Change-Id: Ib7c41539bf8a8e0376bd639f27d68294de90f3c8 Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>
Diffstat (limited to 'chromium/docs/website/site/developers/design-documents/software-updates-courgette')
-rw-r--r--chromium/docs/website/site/developers/design-documents/software-updates-courgette/index.md211
1 files changed, 211 insertions, 0 deletions
diff --git a/chromium/docs/website/site/developers/design-documents/software-updates-courgette/index.md b/chromium/docs/website/site/developers/design-documents/software-updates-courgette/index.md
new file mode 100644
index 00000000000..61297c4a0ce
--- /dev/null
+++ b/chromium/docs/website/site/developers/design-documents/software-updates-courgette/index.md
@@ -0,0 +1,211 @@
+---
+breadcrumbs:
+- - /developers
+ - For Developers
+- - /developers/design-documents
+ - Design Documents
+page_name: software-updates-courgette
+title: 'Software Updates: Courgette'
+---
+
+[TOC]
+
+#### How Courgette works
+
+As I described in *[Smaller is faster (and safer
+too)](http://blog.chromium.org/2009/07/smaller-is-faster-and-safer-too.html)*,
+we wrote a new differential compression algorithm for making Google Chrome
+updates significantly smaller.
+
+We want smaller updates because it *narrows the window of vulnerability*. If the
+update is a tenth of the size, we can push ten times as many per unit of
+bandwidth. We have enough users that this means more users will be protected
+earlier. A secondary benefit is that a smaller update will work better for users
+who don't have great connectivity.
+
+Rather than push put a whole new 10MB update, we send out a diff that takes the
+previous version of Google Chrome and generates the new version. We tried
+several binary diff algorithms and have been using
+[bsdiff](http://www.daemonology.net/bsdiff/) up until now. We are big fans of
+bsdiff - it is small and worked better than anything else we tried.
+
+But bsdiff was still producing diffs that were bigger than we felt were
+necessary. So we wrote a new diff algorithm that knows more about the kind of
+data we are pushing - large files containing compiled executables. Here are the
+sizes in bytes for the recent 190.1-&gt;190.4 update on the developer channel:
+
+<table>
+<tr>
+<td>Full update</td>
+<td>10,385,920</td>
+</tr>
+<tr>
+<td>bsdiff update</td>
+<td> 704,512</td>
+</tr>
+<tr>
+<td>Courgette update</td>
+<td> 78,848</td>
+</tr>
+</table>
+
+The small size in combination with Google Chrome's silent update means we can
+update as often as necessary to keep users safe.
+
+#### Compiled code
+
+The problem with compiled applications is that even a small source code change
+causes a disproportional number of byte level changes. When you add a few lines
+of code, for example, a range check to prevent a buffer overrun, all the
+subsequent code gets moved to make room for the new instructions. The compiled
+code is full of internal references where some instruction or datum contains the
+address (or offset) of another instruction or datum. It only takes a few source
+changes before almost all of these internal pointers have a different value, and
+there are a lot of them - roughly half a million in a program the size of
+chrome.dll.
+
+The source code does not have this problem because all the entities in the
+source are *symbolic*. Functions don't get committed to a specific address until
+very late in the compilation process, during assembly or linking. If we could
+step backwards a little and make the internal pointers symbolic again, could we
+get smaller updates?
+
+Courgette uses a primitive disassembler to find the internal pointers. The
+disassembler splits the program into three parts: a list of the internal
+pointer's target addresses, all the other bytes, and an 'instruction' sequence
+that determines how the plain bytes and the pointers need to be interleaved and
+adjusted to get back the original input. We call this an 'assembly language'
+because we can run an 'assembler' to process the instructions and emit a
+sequence of bytes to recover the original file.
+
+The non-pointer part is about 80% of the size of the original program, and
+because it does not have any pointers mixed in, it tends to be well behaved,
+having a diff size that is in line with the changes in the source code. Simply
+converting the program into the assembly language form makes the diff produced
+by bsdiff about 30% smaller.
+
+We bring the pointers under control by introducing 'labels' for the addresses.
+The addresses are stored in an array and the list of pointers is replaced by a
+list of array indexes. The array is a primitive 'symbol table', where the names
+of the symbols, or 'labels' are the integer indexes into the array. What we get
+from the symbol table is a degree of freedom in how we express the program. We
+can move the addresses around in the array provided we make the corresponding
+changes to the list of indexes.
+
+How do we use this to generate a better diff? With bsdiff we would compute the
+new file, 'update' from the 'original' like this:
+
+server:
+diff = bsdiff(original, update)
+transmit diff
+
+client:
+
+receive diff
+
+update = bspatch(original, diff)
+
+(The server would pre-compute diff so that it could be transmitted immediately)
+
+Courgette transforms the program into the primitive assembly language and does
+the diffing at the assembly level:
+
+server:
+
+asm_old = disassemble(original)
+
+asm_new = disassemble(update)
+
+asm_new_adjusted = adjust(asm_new, asm_old)
+
+asm_diff = bsdiff(asm_old, asm_new_adjusted)
+
+transmit asm_diff
+
+client:
+
+receive asm_diff
+
+asm_old = disassemble(original)
+
+asm_new_adjusted = bspatch(asm_old, asm_diff)
+
+update = assemble(asm_new_adjusted)
+
+The special sauce is the adjust step. Courgette moves the addresses within the
+asm_new symbol table to minimize the size of asm_diff. Addresses in the two
+symbol tables are matched on their statistical properties which ensures the
+index lists have many long common substrings. The matching does not use any
+heuristics based on the surrounding code or debugging information to align the
+addresses.
+
+#### More than one executable, less than an executable
+
+For the above to work, 'assemble' and 'disassemble' have to be strict inverses,
+and 'original' and 'update' have to be single well-formed executable files. It
+is much more useful if 'original' and 'update' can contain several executables
+as well as a lot of non-compiled files like JavaScript and PNG images. For
+Google Chrome, the 'original' and 'update' are an archive file containing all
+the files needed to install and run the browser.
+
+We can think of a differential update as a prediction followed by a correction,
+a kind of guessing game. In its simplest form (just bsdiff / bspatch), the
+client has only a dumb guess, 'original', so the server sends a binary diff to
+correct 'original' to the desired answer, 'update'. Now what if the server could
+pass a hint that could be used to generate a better guess, but we are not sure
+the guess will be useful? We could insure against losing information by using
+the original and the guess together as the basis for the diff:
+
+server:
+
+hint = make_hint(original, update)
+
+guess = make_guess(original, hint)
+
+diff = bsdiff(concat(original, guess), update)
+
+transmit hint, diff
+
+client
+
+receive hint, diff
+
+guess = make_guess(original, hint)
+
+update = bspatch(concat(original, guess), diff)
+
+This system has some interesting properties. If the guess is the empty string,
+then we have the same diff as with plain bsdiff. If the guess is perfect, the
+diff will be tiny, simply a directive to copy the guess.
+
+Between the extremes, the guess could be a perfect subset of 'update'. Then
+bsdiff will construct a diff that mostly takes material from the perfect
+prediction and the original to construct the update. This is how Courgette deals
+with inputs like tar files containing both executable files and other files. The
+hint is the location of the embedded executables together with the asm_diff for
+each one.
+
+Once we have this prediction / correction scheme in place we can use it to
+reduce the amount of work that the client needs to do. Executables often have
+large regions that do not contain internal pointers, like the resource section
+which usually contains string tables and various visual elements like icons and
+bitmaps. The disassembler generates an assembly language program which pretty
+much says 'here is a big chunk of constant data', where the data is identical to
+the original file. bsdiff then generates a diff for the constant data. We can
+get substantially the same effect by omitting the pointer-free regions from the
+disassembly and letting the final diff do the work.
+
+#### Source Code
+
+Everyone loves source, so you can find it here:
+<https://chromium.googlesource.com/chromium/src/courgette/+/HEAD>
+
+#### Summary
+
+Courgette transforms the input into an alternate form where binary diffing is
+more effective, does the differential compression in the transformed space, and
+inverts the transform to get the patched output in the original format. With
+careful choice of the alternate format we can get substantially smaller updates.
+
+We are writing a more detailed paper on Courgette and will post an update when
+it is ready.