summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorRussell Belfer <rb@github.com>2013-06-10 10:10:39 -0700
committerRussell Belfer <rb@github.com>2013-06-10 10:10:39 -0700
commit114f5a6c41ea03393e00ae41126a6ddb0ef39a15 (patch)
treef579e849a72749123a54483180726396244177b2 /docs
parent7000f3fa7bad25ec07355d6afb640ea272201dff (diff)
downloadlibgit2-114f5a6c41ea03393e00ae41126a6ddb0ef39a15.tar.gz
Reorganize diff and add basic diff driver
This is a significant reorganization of the diff code to break it into a set of more clearly distinct files and to document the new organization. Hopefully this will make the diff code easier to understand and to extend. This adds a new `git_diff_driver` object that looks of diff driver information from the attributes and the config so that things like function content in diff headers can be provided. The full driver spec is not implemented in the commit - this is focused on the reorganization of the code and putting the driver hooks in place. This also removes a few #includes from src/repository.h that were overbroad, but as a result required extra #includes in a variety of places since including src/repository.h no longer results in pulling in the whole world.
Diffstat (limited to 'docs')
-rw-r--r--docs/diff-internals.md89
1 files changed, 89 insertions, 0 deletions
diff --git a/docs/diff-internals.md b/docs/diff-internals.md
new file mode 100644
index 000000000..1983b7939
--- /dev/null
+++ b/docs/diff-internals.md
@@ -0,0 +1,89 @@
+Diff is broken into four phases:
+
+1. Building a list of things that have changed. These changes are called
+ deltas (git_diff_delta objects) and are grouped into a git_diff_list.
+2. Applying file similarity measurement for rename and copy detection (and
+ to potentially split files that have changed radically). This step is
+ optional.
+3. Computing the textual diff for each delta. Not all deltas have a
+ meaningful textual diff. For those that do, the textual diff can
+ either be generated on the fly and passed to output callbacks or can be
+ turned into a git_diff_patch object.
+4. Formatting the diff and/or patch into standard text formats (such as
+ patches, raw lists, etc).
+
+In the source code, step 1 is implemented in `src/diff.c`, step 2 in
+`src/diff_tform.c`, step 3 in `src/diff_patch.c`, and step 4 in
+`src/diff_print.c`. Additionally, when it comes to accessing file
+content, everything goes through diff drivers that are implemented in
+`src/diff_driver.c`.
+
+External Objects
+----------------
+
+* `git_diff_options` repesents user choices about how a diff should be
+ performed and is passed to most diff generating functions.
+* `git_diff_file` represents an item on one side of a possible delta
+* `git_diff_delta` represents a pair of items that have changed in some
+ way - it contains two `git_diff_file` plus a status and other stuff.
+* `git_diff_list` is a list of deltas along with information about how
+ those particular deltas were found.
+* `git_diff_patch` represents the actual diff between a pair of items. In
+ some cases, a delta may not have a corresponding patch, if the objects
+ are binary, for example. The content of a patch will be a set of hunks
+ and lines.
+* A `hunk` is range of lines described by a `git_diff_range` (i.e. "lines
+ 10-20 in the old file became lines 12-23 in the new"). It will have a
+ header that compactly represents that information, and it will have a
+ number of lines of context surrounding added and deleted lines.
+* A `line` is simple a line of data along with a `git_diff_line_t` value
+ that tells how the data should be interpretted (e.g. context or added).
+
+Internal Objects
+----------------
+
+* `git_diff_file_content` is an internal structure that represents the
+ data on one side of an item to be diffed; it is an augmented
+ `git_diff_file` with more flags and the actual file data.
+** it is created from a repository plus a) a git_diff_file, b) a git_blob,
+ or c) raw data and size
+** there are three main operations on git_diff_file_content:
+*** _initialization_ sets up the data structure and does what it can up to,
+ but not including loading and looking at the actual data
+*** _loading_ loads the data, preprocesses it (i.e. applies filters) and
+ potentially analyzes it (to decide if binary)
+*** _free_ releases loaded data and frees any allocated memory
+
+* The internal structure of a `git_diff_patch` stores the actual diff
+ between a pair of `git_diff_file_content` items
+** it may be "unset" if the items are not diffable
+** "empty" if the items are the same
+** otherwise it will consist of a set of hunks each of which covers some
+ number of lines of context, additions and deletions
+** a patch is created from two git_diff_file_content items
+** a patch is fully instantiated in three phases:
+*** initial creation and initialization
+*** loading of data and preliminary data examination
+*** diffing of data and optional storage of diffs
+** (TBD) if a patch is asked to store the diffs and the size of the diff
+ is significantly smaller than the raw data of the two sides, then the
+ patch may be flattened using a pool of string data
+
+* `git_diff_output` is an internal structure that represents an output
+ target for a `git_diff_patch`
+** It consists of file, hunk, and line callbacks, plus a payload
+** There is a standard flattened output that can be used for plain text output
+** Typically we use a `git_xdiff_output` which drives the callbacks via the
+ xdiff code taken from core Git.
+
+* `git_diff_driver` is an internal structure that encapsulates the logic
+ for a given type of file
+** a driver is looked up based on the name and mode of a file.
+** the driver can then be used to:
+*** determine if a file is binary (by attributes, by git_diff_options
+ settings, or by examining the content)
+*** give you a function pointer that is used to evaluate function context
+ for hunk headers
+** At some point, the logic for getting a filtered version of file content
+ or calculating the OID of a file may be moved into the driver.
+