delta/git.git, branch ks/combine-diff

tests: add checking that combine-diff emits only correct paths

2014-02-24T22:44:57+00:00

where "correct paths" stands for paths that are different to all
parents.

Up until now, we were testing combined diff only on one file, or on
several files which were all different (t4038-diff-combined.sh).

As recent thinko in "simplify intersect_paths() further" showed, and
also, since we are going to rework code for finding paths different to
all parents, lets write at least basic tests.

Signed-off-by: Kirill Smelkov 
Signed-off-by: Junio C Hamano

combine-diff: simplify intersect_paths() further

2014-02-24T22:44:57+00:00

Linus once said:

    I actually wish more people understood the really core low-level
    kind of coding. Not big, complex stuff like the lockless name
    lookup, but simply good use of pointers-to-pointers etc. For
    example, I've seen too many people who delete a singly-linked
    list entry by keeping track of the "prev" entry, and then to
    delete the entry, doing something like

	if (prev)
	    prev->next = entry->next;
	else
	    list_head = entry->next;

    and whenever I see code like that, I just go "This person
    doesn't understand pointers". And it's sadly quite common.

    People who understand pointers just use a "pointer to the entry
    pointer", and initialize that with the address of the
    list_head. And then as they traverse the list, they can remove
    the entry without using any conditionals, by just doing a "*pp =
    entry->next".

Applying that simplification lets us lose 7 lines from this function
even while adding 2 lines of comment.

I was tempted to squash this into the original commit, but because
the benchmarking described in the commit log is without this
simplification, I decided to keep it a separate follow-up patch.

Signed-off-by: Junio C Hamano

combine-diff: combine_diff_path.len is not needed anymore

2014-02-24T22:44:57+00:00

The field was used in order to speed-up name comparison and also to
mark removed paths by setting it to 0.

Because the updated code does significantly less strcmp and also
just removes paths from the list and free right after we know a path
will not be needed, it is not needed anymore.

Signed-off-by: Kirill Smelkov 
Signed-off-by: Junio C Hamano

combine-diff: optimize combine_diff_path sets intersection

2014-02-24T22:44:57+00:00

When generating combined diff, for each commit, we intersect diff
paths from diff(parent_0,commit) to diff(parent_i,commit) comparing
all paths pairs, i.e. doing it the quadratic way. That is correct,
but could be optimized.

Paths come from trees in sorted (= tree) order, and so does diff_tree()
emits resulting paths in that order too. Now if we look at diffcore
transformations, all of them, except diffcore_order, preserve resulting
path ordering:

    - skip_stat_unmatch, grep, pickaxe, filter
                            -- just skip elements -> order stays preserved

    - break                 -- just breaks diff for a path, adding path
                               dup after the path -> order stays preserved

    - detect rename/copy    -- resulting paths are emitted sorted
                               (verified empirically)

So only diffcore_order changes diff paths ordering.

But diffcore_order meaning affects only presentation - i.e. only how to
show the diff, so we could do all the internal computations without
paths reordering, and order only resultant paths set. This is faster,
since, if we know two paths sets are all ordered, their intersection
could be done in linear time.

This patch does just that.

Timings for `git log --raw --no-abbrev --no-renames` without `-c` ("git log")
and with `-c` ("git log -c") before and after the patch are as follows:

                linux.git v3.10..v3.11

            log     log -c

    before  1.9s    20.4s
    after   1.9s    16.6s

                navy.git    (private repo)

            log     log -c

    before  0.83s   15.6s
    after   0.83s    2.1s

P.S.

I think linux.git case is sped up not so much as the second one, since
in navy.git, there are more exotic (subtree, etc) merges.

P.P.S.

My tracing showed that the rest of the time (16.6s vs 1.9s) is usually
spent in computing huge diffs from commit to second parent. Will try to
deal with it, if I'll have time.

P.P.P.S.

For combine_diff_path, ->len is not needed anymore - will remove it in
the next noisy cleanup path, to maintain good signal/noise ratio here.

Signed-off-by: Kirill Smelkov 
Signed-off-by: Junio C Hamano

diff test: add tests for combine-diff with orderfile

2014-02-24T22:44:57+00:00

In the next patch combine-diff will have special code-path for taking
orderfile into account. Prepare for making changes by introducing
coverage tests for that case.

Signed-off-by: Kirill Smelkov 
Signed-off-by: Junio C Hamano

diffcore-order: export generic ordering interface

2014-02-24T22:44:57+00:00

diffcore_order() interface only accepts a queue of `struct
diff_filepair`.

In the next patches, we'll want to order `struct combine_diff_path`
by path, so let's first rework diffcore-order to also provide
generic low-level interface for ordering arbitrary objects, provided
they have path accessors.

The new interface is:

    - `struct obj_order`    for describing objects to ordering routine, and
    - order_objects()       for actually doing the ordering work.

Signed-off-by: Kirill Smelkov 
Signed-off-by: Junio C Hamano

tree-walk: finally switch over tree descriptors to contain a pre-parsed entry

2014-02-24T22:43:29+00:00

This continues 4651ece8 (Switch over tree descriptors to contain a
pre-parsed entry) and moves the only rest computational part

    mode = canon_mode(mode)

from tree_entry_extract() to tree entry decode phase - to
decode_tree_entry().

The reason to do it, is that canon_mode() is at least 2 conditional
jumps for regular files, and that could be noticeable should canon_mode()
be invoked several times.

That does not matter for current Git codebase, where typical tree
traversal is

    while (t->size) {
        sha1 = tree_entry_extract(t, &path, &mode);
        ...
        update_tree_entry(t);
    }

i.e. we do t -> sha1,path.mode "extraction" only once per entry. In such
cases, it does not matter performance-wise, where that mode
canonicalization is done - either once in tree_entry_extract(), or once
in decode_tree_entry() called by update_tree_entry() - it is
approximately the same.

But for future code, which could need to work with several tree_desc's
in parallel, it could be handy to operate on tree_desc descriptors, and
do "extracts" only when needed, or at all, access only relevant part of
it through structure fields directly.

And for such situations, having canon_mode() be done once in decode
phase is better - we won't need to pay the performance price of 2 extra
conditional jumps on every t->mode access.

So let's move mode canonicalization to decode_tree_entry(). That was the
final bit. Now after tree entry is decoded, it is fully ready and could
be accessed either directly via field, or through tree_entry_extract()
which this time got really "totally trivial".

Signed-off-by: Kirill Smelkov 
Signed-off-by: Junio C Hamano

revision: convert to using diff_tree_sha1()

2014-02-05T18:51:16+00:00

Since diff_tree_sha1() can now accept empty trees via NULL sha1, we
could just call it without manually reading trees into tree_desc and
duplicating code.

Besides, that

	if (!tree)
		return 0;

looked suspect - we were saying an invalid tree != empty tree, but maybe it is
better to just say the tree is invalid here, which is what diff_tree_sha1()
does for such case.

Signed-off-by: Kirill Smelkov 
Signed-off-by: Junio C Hamano

line-log: convert to using diff_tree_sha1()

2014-02-05T18:50:36+00:00

Since diff_tree_sha1() can now accept empty trees via NULL sha1, we
could just call it without manually reading trees into tree_desc and
duplicating code.

Cc: Thomas Rast 
Signed-off-by: Kirill Smelkov 
Signed-off-by: Junio C Hamano

tree-diff: convert diff_root_tree_sha1() to just call diff_tree_sha1 with old=NULL

2014-02-05T18:49:07+00:00

Now since diff_tree_sha1 understands NULL for both old and new, we could
indicate an empty tree for root commit by providing just NULL for old
sha1.

Signed-off-by: Kirill Smelkov 
Signed-off-by: Junio C Hamano