We migrated from Mercurial to Git in 2014 to improve local performance and began consolidating our repositories that hosted our backend code. But as this monorepo grew, we experienced Git performance issues that grew linearly with the number of files we added. Inconveniently, this problem was the most severe on macOS—the platform most of our engineers work on. Fortunately, between upstream improvements in Git itself and a small wrapper of custom code, we’ve been able to speed Git operations up without fragmenting our unified and growing repository.
Towards a monorepo
Originally, our code was distributed across several dozen Mercurial repositories. But around 2014, we tested and found that we’d have better local performance with Git. More importantly, Git had become an industry standard tool that most new engineers had already used. So, over the course of a few Hack Weeks, a small group of engineers migrated many of our repositories from Mercurial to Git, and planned to migrate the rest.
The backend business logic for Dropbox at the time lived mainly in a monolithic Python web application, with infrastructural components built independently. Over time, we extracted targeted components from the monolith into separate services, but a large number of our engineers still contribute to the monolith. For example, Magic Pocket, our custom block storage system was built separately from day one, while our metadata store, Edgestore, started off as a simple client side Python library that eventually spun off into a sophisticated service.
In practice, this meant the monolith was the largest and often the only consumer of a smaller service or some code in a separate repository. Consequently, developers would write integration tests for these services in the monolith’s repository. These tests often failed when the service’s code changed, and since our continuous integration (CI) process simply ran all tests at HEAD for each repository (with no notion of pinning), it was hard to diagnose the root cause of problems.
The large frequency of changes meant there would be multiple failures a week. Debugging test failures, tracking down changes, and fixing the build before the daily release was a painful process. It caused a lot of work for the release engineers and made the release process slow and inconsistent. Due to the indeterminate nature of these test failures, engineers didn’t trust CI test results and inspect failures, which led to more problems. To solve this, we needed a single commit identifier to reproducibly determine the state of the code we were testing. We either needed a polyglot tool (or a set of tools per language) to pin versions for each dependency directly, a repository-based pinning mechanism like git subtree, or to merge our repositories into one.
For a while, we had a “super repo” that received a check-in every time one of the server-related repositories changed (via a Git pre-receive hook). This provided a global ordering on changes and test results, and helped narrow down the repository that caused a breakage. Eventually we realized it would be simplest to merge all relevant repositories. The combined repository size was not that large (~50,000 files), and we estimated that Git performance would be acceptable for at least a few years. This merge, combined with various other initiatives to improve testing infrastructure and quality, helped us keep master much stabler with much less work, and smoothed our release processes. Eventually, we’ve seen even more benefits, like simple code sharing, easy large scale refactoring, good operability with monorepo centric build tools like Bazel, and simple automatic bisects and reverts.
This worked for us for a few years, but as expected, Git performance started degrading. Specifically, it seemed to degrade linearly with the number of files being added to the repository. Common operations like git status were getting slower with time. So in late 2017, we started investigating the various options to speed up Git for our users.
To start off with, checking in large files severely impacts the performance of many operations in Git. Fortunately, we didn’t have workflows that depended on this, and we set up a pre-receive hook to limit the size of new files pushed to our repositories and prevent regressions.
macOS is the supported development platform for Dropbox engineers, but they’re free to use other platforms as they see fit. For example, an engineer working on the Dropbox desktop client might work on Windows, and some server engineers prefer Linux. In practice, most of our server engineers use macOS, and that’s where we focused our efforts.
In order to tackle local Git performance, we needed to control the version of Git that developers on macOS used. Additionally, we had to measure performance. We created a small fork of Git that measured timing of operations like git status and git pull, had it automatically provisioned and installed on developer machines, and munged $PATH so that developers use our Git over the ones installed by the system or Homebrew.
At the time, git status took over two seconds on average. In our case, many Git operations were slow and grew linearly since they ran the lstat syscall on every file in the repository to check if it’s up to date. Since most developers modify a small subset of files, this is wasted cycles in most cases. Interestingly, git status was 5-10x faster on Linux compared to macOS.
There has been substantial work in the open source community to speed up Git for large repositories in the last few years. For example, Git now has a file system monitor (fsmonitor) to detect changed files, which is integrated with Watchman, a daemon to watch and buffer filesystem changes. Fsmonitor is a Git hook, and acts as a thin wrapper around Watchman. This is a useful abstraction for internal Git tests, and would be helpful if there's a need to use an alternate file watcher or migrate away from Watchman.
Git uses an index file that contains an entry for each file in the repository, and determines what files are staged for commit. The index also supports extensions for various features like fsmonitor, and his file has the ability to cache results from fsmonitor. A Git operation that needs local filesystem state, like add, status, or diff, checks with Fsmonitor for changed files, and updates the index if there are any changes.
The index is sorted by file path by default. So common operations like adding a file to the index (via git add) requires a full index rewrite to insert the new path in the right place, which is slow for repositories with large indices.
Git also introduced a new --split-index mode to convert the index format, so deltas could quickly be appended to a split index file and eventually consolidated into a few shared index files, making index writes significantly faster. Finally, there is an untracked cache mode that cached directory mtimes so Git could skip traversing unmodified directories.
We deployed fsmonitor and Watchman to developer MacBooks and released some instructions on how to turn it on. Unfortunately, there were both performance bugs, where Git seemed to ignore fsmonitor data in some cases, and correctness bugs, where operations like git status sometimes returned wrong results with fsmonitor enabled. We dug into these and fixed some bugs, but ultimately shifted our focus when we lost team members with Git expertise and had other priorities.
In the second half of 2019, our repo had reached >250k files, and we decided to refocus on this problem. An upstream bugfix looked promising, and we fixed one large performance issue. This time, we were confident to enable these improvements for all our users without their intervention.
One of our core principles was shipping developer tools that would be configured correctly for everyone by default. We settled on shipping a wrapper on top of Git that would automatically tweak configs and enable fsmonitor for developers (if it wasn’t already turned on) only for a whitelisted set of repositories. This is very similar to Microsoft’s Scalar in principle, but without most of the features, and without the need for developers to learn an additional tool and run additional commands.
We saw a significant reduction in p50 and p90 durations for common operations. It’s worth noting that these operations are nowhere as fast as running Git in a smaller repository, but they’re a big improvement from the status quo, and acceptable for most purposes. We are also sure that these times are not growing linearly with the number of files in the repository. Finally, all of this was enabled through <200 lines of custom plumbing code around Git with no additional services (or large virtual filesystems) to maintain.
There’s exciting work being done in open source to improve version control performance for large repositories, for both Git and Mercurial. With some prep work, like disallowing large files and setting up the right flags for clients, Git can perform reasonably well with almost zero ongoing maintenance burden. There are many trade-offs when deciding between a monorepo and multiple repositories, but version control scalability should not be a dealbreaker for a long time.
Dozens of engineers contributed to the work described above over many years, from the Mercurial to Git migration, to the repository merges, and finally work described to speed up Git. The various current and former Dropboxers include Tim Abbott, Jon Goldberg, Nipunn Koorapati, Jason Michalski, Greg Price, Mike Solomon, and Alex Vandiver. Also, many thanks to the maintainers of Git, and the various Microsoft engineers who helped review our patches and contributed several major features to Git like the file system monitor. And kudos to Facebook for building and open sourcing Watchman, and Charles Strahan, who wrote the go watchman library we use.