Optimizing Git’s Merge Machinery, #4

Published in

Palantir Blog

13 min readJul 1, 2021

Editor’s note: This is the fourth post in a series by a Palantir Software Engineer on scaling Git’s merge and rename detection machinery. Click to read the first, second and third blog posts.

The first post in this series included some background information on how the merge machinery works and why I have worked on optimizing and rewriting it. That background will be critical to understanding this post. So, if you’re unaware that Git stores hashes for files in addition to having hashes for commits, or you’re unaware that merging involves three-way content merges, or you don’t know that rebasing and cherry-picking involves a sequence of special merges, then you may want to go back and read that post first. If you know all that already, let’s keep going.

In this post, I’ll consider another optimization, one that I summarized in my first post simply as: “don’t repeatedly detect the same renames done by the upstream branch when rebasing (or cherry-picking) a series of commits.” However, in reality, it’s a bit more complicated.

Abstraction layers, again

In the previous post, we discussed abstraction layers. In particular, since each merge involved doing two separate sets of rename detection (from the merge base to each of the two commits being merged), and since rename detection is designed to only work on two endpoints (for merges, that’d be the merge base and the side of history we are detecting renames for), rename detection was inherently unaware of what was happening to files on the other side of history. However, in order to implement a “skip irrelevant renames” optimization, we needed both sets of rename detection operations to know information about all three endpoints (the merge base and the commit at the tip of each branch being merged). To implement that optimization, we had to breach the rename detection abstraction layer and pass in some kind of information from an additional side of history. This idea of breaching abstraction layers is relevant to this optimization as well.

As discussed in more detail in the first post in this series, rebases and cherry-picks of a series of commits involve a series of three-way merges. Each of those involves two sets of rename detection — from the “merge base” to side 1 (which happens to be the upstream side) of history, and from the “merge base” to side 2 (which happens to be the commit we are trying to transplant). So, for transplanting N commits in a rebase or cherry-pick operation, there would be 2*N calls to the rename detection machinery, and N of those calls for rename detection are for detecting renames done on the upstream side of history. Does that mean we are detecting the same renames N times?

Repeatedly re-detecting the same renames?

Let’s dive a little deeper to see how the repeated detection of the same renames on the upstream side of history comes up. Let’s revisit what it looks like to rebase or cherry-pick a series of commits. Consider the following setup:

                 A---B---C topic
                /
           D---E---F---G main

After rebasing or cherry-picking topic onto main, this will appear as:

                         A'---B'---C' topic
                        /
           D---E---F---G main

The way the commits A’, B’, and C’ are created is through a series of merges, where rebase or cherry-pick sequentially uses each of the three A-B-C commits in a special merge operation. Let’s label the three commits in the merge operation as MERGE_BASE, MERGE_SIDE1, and MERGE_SIDE2. For this picture, the three commits for each of the three merges would be:

To create A’:

MERGE_BASE: E
MERGE_SIDE1: G
MERGE_SIDE2: A

To create B’:

MERGE_BASE: A
MERGE_SIDE1: A’
MERGE_SIDE2: B

To create C’:

MERGE_BASE: B
MERGE_SIDE1: B’
MERGE_SIDE2: C

Now, if the upstream side in the original commit had several renames between E and G, then the three-way merge creating A’ will reflect those renames. This means that in the second merge, all those same renames will be re-detected when comparing A and A’…and get incorporated into B’. Then, in the third merge, all those same renames will be re-detected again when comparing B and B’ and then get incorporated into C’. Thus, we are repeatedly re-detecting the renames from the upstream side of history with each commit that we are transplanting in our rebase or cherry-pick.

Repeatedly re-detecting renames is bad enough, but it’s even worse in this instance: typically, the upstream side of history will have far more commits and changes than the topic branch, and so the upstream side of history is comparing two commits that are relatively far apart in history. Having commits that are far apart means more changes, including lots more files to pair up with rename detection. More files to pair up with rename detection means the upstream side of history is the side that is much more computationally expensive to detect renames for…and yet it’s also the side of history where we repeatedly re-detect renames. But, there’s a silver lining: this suggests a simple solution.

Caching renames!

Realizing that we are detecting the renames done on the upstream side of history N times for cherry-picking or rebasing a sequence of N commits suggests a really simple solution: remember the renames after the first rename detection (in an in-memory cache) and just use that cached knowledge for subsequent commits.

Sounds trivial. Why didn’t we do this before?

The idea is quite simple. I think it was overlooked for a combination of reasons: first, the sequence of merges done for cherry-picks or rebases is slightly unusual which makes it easy to accidentally assume wrong “merge bases” and presume that the optimization is just completely unsafe (as evidenced by the first response to my initial posting of the series). Second, it violates the abstraction layers that exist (and since the abstraction layers mirror how we think of performing these operations, we have to view the operation from a different angle). And perhaps most importantly, while the idea is simple, trying to verify that it does not cause any weird bugs or behavioral differences (beyond running faster) is actually quite complicated. As it turns out, there is a minor behavioral difference (noted below); it’s just somewhere between innocuous and beneficial.

That point about trying to verify there are no weird bugs caused by caching renames is pretty important. Despite the fact that the idea is the easiest of this blog series to describe (just look how short the “Caching renames!” section is), the work needed to show it is a valid and safe optimization is more complex than any other optimization I’ve described thus far in the series. For instance, some questions that had to be answered in order to ensure that this optimization was safe (note that I’ll use “transplanting commits” as a shorthand for either rebasing or cherry-picking commits):

Are all the renames from transplanting a given commit also renames when transplanting a subsequent commit? Is there a particular set of files in a repository that will result in this not being the case? (If so, this suggests that some of the cached renames might be invalid.)
Are the renames in a given commit enough to represent all the renames in a subsequent commit? Is there a special repository setup where this will not be the case? (If so, this suggests that we need more than just the cached renames.)
Even if the renames in a subsequent transplanted commit are a subset of the renames from former transplanted commits, what if we didn’t detect all the renames in the previous commit(s)? The previous blogpost had an optimization for skipping irrelevant renames — what if those skipped renames were irrelevant for transplanting the previous commit, but are relevant for transplanting the next one?
The above two questions suggest there are three types of source files typically fed into rename detection: those that will be the source filename for a rename, those that will correspond to deletes, and those which might be part of a rename or a delete but we didn’t spend the time to detect which of the two it was. Given that there are more than two types (rename, delete, or unknown), do we also need to cache deletes as well as renames?
Is there other metadata that we need to cache, such as directory rename detection metadata?
Speaking of directory renames, how are those handled? When one side of history adds a file to an old directory, and the other side renames the directory, do we just cache the directory rename or do we need to also cache the rename of the particular file in question? On which side of history?
What about a more complex directory rename case: one side of history renames a file into an old directory (thus original/file → old-dir/file) and the other side renames the old directory (thus old-dir/ → new-dir/). What do we cache for that case? original/file → old-dir/file? original/file → new-dir/file? old-dir/file → new-dir/file? Does it depend on if the upstream side of history was the one renaming the old directory, or renaming the file into the old directory?
What if the transplant operation (rebase or cherry-pick) has conflicts so we have to stop and give the user a chance to edit all the files in the working tree and include their changes in the commit. Does that cause us to need more renames for the next commit we will transplant? Are the renames from the previous transplant going to still be valid after the user is done editing?
What if the rebase is interactive and the user stops and edits? Do we have all the valid renames, or are there more we need to detect? Are the renames we have even still valid?
What if the user has a non-linear history they are transplanting? Or what if the user is doing a more involved interactive rebase using the ‘merge’ or ‘reset’ directives (i.e. rebasing merges or rebasing multiple branches)? In general, how do we know that sequential calls into the merge machinery are of such a nature that this optimization is valid to use?
Related to the last question, are there other cases besides transplanting a linear series of commits where the caching of renames is a valid optimization?
Can this optimization cause us to get different types of conflicts, or different merge results, or conflicts that we wouldn’t get without this optimization, or to not get conflicts that we would get with this optimization?

Answering these questions can get pretty hairy. If you’re curious about the details, you can read the extensive documentation I wrote up to explain these details. That documentation is essentially a proof of what is and is not possible from this optimization, along with important limitations to apply to avoid surprises, and an explanation about one rare pair of behavioral differences that cannot be avoided with this optimization. You can read that document for details, but here I’ll just cover the potential behavioral differences and the limitations imposed.

Behavioral Differences

As noted above, given the limitations imposed on when this optimization is triggered, there is just one really rare case that can result in behavioral differences. The new behavior is still quite reasonable; in fact, I consider it to provide improved behavior. Unfortunately, trying to explain this rare case is somewhat gnarly…

Preconditions

Consider a case that satisfies all the following bullet points:

The upstream side of history both modifies a file and renames it (A → B, for example)
The modifications on the upstream side provide a net addition to the number of lines found in B relative to A.
The non-upstream side of history dramatically reduces the size of A such that the number of lines remaining is strictly less than the net number of lines added to B on the upstream side.
Despite the significant content changes made by both sides, they all merge cleanly; there are no conflicts. The result will be a file B with cleanly merged changes from both sides. Let’s refer to that resulting merged file as B-merged.

Example testcase

If it’s difficult to imagine an example that satisfies the above conditions, you can see a concrete example by looking at the “cherry-pick both a commit and its immediate revert” testcase that I added to the Git testsuite. (Note that the behavior tested in that patch is for before the optimization; the test was tweaked in the commit that turned the remembering renames optimization on.) A quick summary of what is involved in that testcase:

A is a file that originally has 20 lines
upstream adds 10 new lines to the beginning of A, and then renames it to B (so that B has 30 lines total)
the non-upstream side removes all but the first three lines of A (so it goes from 20 lines down to 3 lines; note that: 3 < 10)

The above example will merge cleanly. B-merged will be a file named B which is 13 lines long, consisting of the 10 new lines that the upstream side added, followed by the first 3 lines of the original file A that the non-upstream side left intact.

Additional conditions and the possible observed differences

Now, for such a case, if the non-upstream side of history has another commit that also modifies A, then there are two possibilities:

First, If these further changes to A can also merge cleanly with B-merged:

With the optimization, the transplant operation can continue on uninterrupted.
Without the optimization, the transplant operation will halt with a modify/delete conflict for A, with both the newest version of A and the B-merged version of B present in the working copy and no signal to the user that A and B are likely related. The user will have to manually resolve the rebase or cherry-pick and continue it.

Second, if these further changes to A do not merge cleanly with B-merged:

With the optimization, a content conflict will be reported, with the three-way merge of changes (including conflict markers) found in B for the user to sort out. The fact that B was a rename of A earlier in the series suggests this will likely make intuitive sense to the user, though they can split the files apart if they like since the different files will be available at different stages in the index for B.
Without the optimization, a modify/delete conflict will be reported for A. Both the newest version of A and the B-merged version of B will be present in the working copy, but no signal will be provided to the user that A and B might be related. The user will just be responsible for attempting to reconcile the conflict and hopefully discover both A and B’s relation (if relevant) on their own.

Extended example testcase

When could this all happen? Well, let’s continue on with our example testcase. If the series of commits being transplanted involved just two commits, first the one that shrunk A from 20 lines down to 3, followed by a revert of that commit that restores A from 3 lines back to the original 20. You would think that if you can pick a commit cleanly, then you should be able to pick its immediate revert cleanly right afterwards, and with this optimization you can, but without this optimization it’s a unique case where the merge algorithm would not be able to do so because it wouldn’t think the shrunk A and B-merged are renames.

Limitations

As noted above, there are some limitations imposed about when the optimization is applied (either to ensure the optimization remains safe, or in one case for implementation simplicity). These limitations are:

The caching renames optimization only applies for a linear sequence of commits being transplanted (i.e. cherry-picked or rebased)
The cache is in-memory, and is not stored on disk.
The cache is discarded at the end of the transplant operation, signaled by the working tree and index being updated. (Note: rebase and cherry-pick need to be reworked to only update the working tree and index at the end of the transplant operation, instead of uselessly updating after each and every commit that is transplanted.)
The cache is discarded if both sides of history rename a file the same way (i.e. A is renamed to B on both sides of history).

Note that the second item, the cache being in-memory only has several ramifications:

the optimization is not useful if only one commit is being transplanted
the optimization is turned off once there are any conflicts since stopping for user input will cause the cache to be discarded
the optimization is turned off when directory renames are present unless the user has set the merge.directoryRenames config setting to “true” (the default is “conflict”)
the optimization is turned off for interactive rebases once users halt a rebase (via “edit” or “break” directives), since the halting causes the cache to be discarded.

Advantages

Although there are certainly a number of things limiting when this optimization applies, it benefits from applying in a wide variety of circumstances when the previous optimizations in this blog series did not. Note the following restrictions on those other optimizations:

Optimizing Git’s Merge Machinery, Part I: only applies when the file is renamed with NO content changes and the other commits on the same side of history do not modify the renamed file before or after the rename either
Optimizing Git’s Merge Machinery, Part II: only applies when the file is moved into a different directory without renaming the file
Optimizing Git’s Merge Machinery, Part III: only applies when the side of history that did not rename the file does not make any modifications to the file

In contrast, this optimization applies even if none of those conditions are met. So long as only one side renames a file and the merge is clean and automatic, the cached renames optimization is applicable for transplanting all the remaining commits in the linear history being cherry-picked or rebased.

Results

For the particular testcase I’ve been focusing on in this series, a testcase involving rebasing 35 patches onto a branch that renamed ~26K files, this speeds up the overall runtime by about 13%:

                             Before                  After
    mega-renames:     11.435 s ±  0.158 s    10.127 s ±  0.073 s

However, this particular testcase was one that is nearly optimal for the previous three optimizations; it added a commit on the upstream side that simply renamed a top-level directory (drivers/ → pilots/) resulting in what looks to Git like 26,000 renames (of each file under the drivers/ directory). When I first implemented this optimization, I actually saw a speedup factor of 2 on the same testcase, but that was because I hadn’t yet found multiple additional tweaks to the optimizations from the previous blogposts that made them even more effective. Since this optimization tends to pick up the slack when the other optimizations cannot kick in, making the other optimizations more effective made this one look less effective. But a ~13% speedup is still a pretty nice gain.

It can also be interesting to look at what performance we would see if we threw out the optimizations from Parts II and III of this blog series and see how much of a gain we got from this optimization. In that case, we see a speed increase of 8x.

                Before Basename Series   After Just This Series
  mega-renames:  1799.937 s ±  0.493 s    205.709 s ±  0.457 s

So, this optimization can help significantly in the cases where the other optimizations are not effective.

This optimization will appear in Git 2.33 in August 2021. For more details about this optimization, please see the upstream thread.

With the four optimizations covered thus far, the time spent on rename detection has dropped to approximately 9% of overall runtime for the mega-renames testcase above. The remaining two posts in this series will focus on optimizing other parts of the merge machinery.

Author

Elijah Newren