merge: cache negative cache results for similarity metrics

When computing renames, we cache the hash signatures for each of the potentially conflicting entries so that we do not need to repeatedly read the file and can at least halfway efficiently determine whether two files are similar enough to be deemed a rename. In order to make the hash signatures meaningful, we require at least four lines of data to be present, resulting in at least four different hashes that can be compared. Files that are deemed too small are not cached at all and will thus be repeatedly re-hashed, which is usually not a huge issue. The issue with above heuristic is in case a file does _not_ have at least four lines, where a line is anything separated by a consecutive run of "\n" or "\0" characters. For example "a\nb" is two lines, but "a\0\0b" is also just two lines. Taken to the extreme, a file that has megabytes of consecutive space- or NUL-only may also be deemed as too small and thus not get cached. As a result, we will repeatedly load its blob, calculate its hash signature just to finally throw it away as we notice it's not of any value. When you've got a comparitively big file that you compare against a big set of potentially renamed files, then the cost simply expodes. The issue can be trivially fixed by introducing negative cache entries. Whenever we determine that a given blob does not have a meaningful representation via a hash signature, we store this negative cache marker and will from then on not hash it again, but also ignore it as a potential rename target. This should help the "normal" case already where you have a lot of small files as rename candidates, but in the above scenario it's savings are extraordinarily high. To verify we do not hit the issue anymore with described solution, this commit adds a test that uses the exact same setup described above with one 50 megabyte blob of '\0' characters and 1000 other files that get renamed. Without the negative cache: $ time ./libgit2_clar -smerge::trees::renames::cache_recomputation >/dev/null real 11m48.377s user 11m11.576s sys 0m35.187s And with the negative cache: $ time ./libgit2_clar -smerge::trees::renames::cache_recomputation >/dev/null real 0m1.972s user 0m1.851s sys 0m0.118s So this represents a ~350-fold performance improvement, but it obviously depends on how many files you have and how big the blob is. The test number were chosen in a way that one will immediately notice as soon as the bug resurfaces.
author: Patrick Steinhardt <ps@pks.im> 2020-04-01 15:16:18 +0200
committer: Patrick Steinhardt <ps@pks.im> 2020-04-01 15:16:18 +0200
commit: 4dfcc50fe88e06a49a4ccccb048d47fe6be01292 (patch)
tree: 19feab609090df0487cb0049b5eadcd5861586ae /src/merge.c
parent: ca782c913b7052712c7a6163c102a01733202ebf (diff)
download: libgit2-4dfcc50fe88e06a49a4ccccb048d47fe6be01292.tar.gz
1 files changed, 20 insertions, 7 deletions
diff --git a/src/merge.c b/src/merge.c
index 05a776e45..afe69e564 100644
--- a/src/merge.c
+++ b/src/merge.c
@@ -68,6 +68,16 @@ struct merge_diff_df_data {
 	git_merge_diff *prev_conflict;
 };
 
+/*
+ * This acts as a negative cache entry marker. In case we've tried to calculate
+ * similarity metrics for a given blob already but `git_hashsig` determined
+ * that it's too small in order to have a meaningful hash signature, we will
+ * insert the address of this marker instead of `NULL`. Like this, we can
+ * easily check whether we have checked a gien entry already and skip doing the
+ * calculation again and again.
+ */
+static int cache_invalid_marker;
+
 /* Merge base computation */
 
 int merge_bases_many(git_commit_list **out, git_revwalk **walk_out, git_repository *repo, size_t length, const git_oid input_array[])
@@ -1027,6 +1037,9 @@ static int index_entry_similarity_calc(
 	git_object_size_t blobsize;
 	int error;
 
+	if (*out || *out == &cache_invalid_marker)
+		return 0;
+
 	*out = NULL;
 
 	if ((error = git_blob_lookup(&blob, repo, &entry->id)) < 0)
@@ -1047,6 +1060,8 @@ static int index_entry_similarity_calc(
 	error = opts->metric->buffer_signature(out, &diff_file,
 		git_blob_rawcontent(blob), (size_t)blobsize,
 		opts->metric->payload);
+	if (error == GIT_EBUFS)
+		*out = &cache_invalid_marker;
 
 	git_blob_free(blob);
 
@@ -1069,18 +1084,16 @@ static int index_entry_similarity_inexact(
 		return 0;
 
 	/* update signature cache if needed */
-	if (!cache[a_idx] && (error = index_entry_similarity_calc(&cache[a_idx], repo, a, opts)) < 0)
-		return error;
-	if (!cache[b_idx] && (error = index_entry_similarity_calc(&cache[b_idx], repo, b, opts)) < 0)
+	if ((error = index_entry_similarity_calc(&cache[a_idx], repo, a, opts)) < 0 ||
+	    (error = index_entry_similarity_calc(&cache[b_idx], repo, b, opts)) < 0)
 		return error;
 
 	/* some metrics may not wish to process this file (too big / too small) */
-	if (!cache[a_idx] || !cache[b_idx])
+	if (cache[a_idx] == &cache_invalid_marker || cache[b_idx] == &cache_invalid_marker)
 		return 0;
 
 	/* compare signatures */
-	if (opts->metric->similarity(
-		&score, cache[a_idx], cache[b_idx], opts->metric->payload) < 0)
+	if (opts->metric->similarity(&score, cache[a_idx], cache[b_idx], opts->metric->payload) < 0)
 		return -1;
 
 	/* clip score */
@@ -1550,7 +1563,7 @@ int git_merge_diff_list__find_renames(
 done:
 	if (cache != NULL) {
 		for (i = 0; i < cache_size; ++i) {
-			if (cache[i] != NULL)
+			if (cache[i] != NULL && cache[i] != &cache_invalid_marker)
 				opts->metric->free_signature(cache[i], opts->metric->payload);
 		}
author	Patrick Steinhardt <ps@pks.im>	2020-04-01 15:16:18 +0200
committer	Patrick Steinhardt <ps@pks.im>	2020-04-01 15:16:18 +0200
commit	4dfcc50fe88e06a49a4ccccb048d47fe6be01292 (patch)
tree	19feab609090df0487cb0049b5eadcd5861586ae /src/merge.c
parent	ca782c913b7052712c7a6163c102a01733202ebf (diff)
download	libgit2-4dfcc50fe88e06a49a4ccccb048d47fe6be01292.tar.gz