Accelerating Iteration Velocity on Dropbox’s Desktop Client, Part 2

In our previous blog post on investing in the Desktop Client platform at Dropbox, we discussed the challenges of trying to innovate and iterate on a product while maintaining high platform quality and low overhead. In 2016, Dropbox quadrupled the cadence at which we shipped the Desktop Client, releasing a new a major version every 2 weeks rather than every 8 weeks by investing in foundational improvements. These efforts tended to illustrate one or both of the following themes:

Reduce KTLO work: “Keeping The Lights On,” or KTLO, includes manual tasks such as setting configuration parameters, running scripts, and figuring out the right person to send a bug to. We strove to automate or eliminate these tasks.
Automatically affirm quality: Use testing and monitoring to establish confidence in code correctness and performance, so that releases can happen without anxiety.

The previous post described the improvements that we made to Continuous Integration, i.e., the tools that allow engineers to commit code frequently to the same codebase without breaking each others’ tests. This article will talk about the new (or significantly improved) processes and technologies that we implemented for the rest of code’s life cycle: building and testing new binaries, getting them out the door, and making sure they work in the wild.

2. Automated official builds

Enhancing Continuous Integration reduced a lot of developer agony. With a more reliable CI system, they could commit code more quickly, know that their tests pass across the wide number of platforms that Dropbox supports, and be safeguarded against future regressions. However, in early 2016, an on-call engineer still spent 8+ hours a week executing scripts, changing configurations, and writing emails to make and release new official Desktop Client builds. This was costly, and not much fun for the engineer. Meanwhile, thanks to the CI, we no longer required an engineer to track down test failures and decide which commits would make it into a release, because almost every commit had passing tests and was releasable. Build-making was thus a prime candidate for automation!

To start, we began with the highest impact area: “internal alpha” builds, meaning binaries intended for internal “dogfooding” by Dropboxers. Internal alpha was the highest impact because we wanted to make builds for this group of users most frequently, as it was (and continues to be) the first destination for new code changes. It’s the first source of realistic feedback for developers. In addition, we were fine with shipping internal alpha builds that had passing unit tests but no manual quality assurance; if there were unexpected issues, Dropboxers could contact the responsible engineers directly. And if the new internal alpha build was completely unusable, IT could help us do complete reinstalls of Dropbox at the company — though the development of DropboxMacUpdate and its equivalent on Windows had drastically reduced this risk.

Automating builds was not ground-breaking new technology, but it required pulling together a bunch of disparate parts of our ecosystem. For example, we added an authenticated API endpoint to configure the official build numbers served via autoupdate, which had previously only been accessible via a web interface. We hooked into Changes, our build-job orchestration framework, to check the status of tests and builds. We called the API for Phabricator, our task tracking and code review tool, to make and check tasks that humans could open or close to indicate blockers to release. We also wrote templated emails and Slack messages to communicate status.

We were able to completely automate internal alpha builds by late March 2016, and in the process increase the release cadence to once a day! There were a few main takeaways from this project:

Automate the most annoying points of manual work first. Then work outwards, automating tasks based on their impact-to-effort ratios. For example, refreshing our dashboard to figure out when all the jobs were done for a specific build was time-consuming and boring. It was relatively easy to write an endpoint that polls this status and then emails a person with the result. In contrast, actually setting the official build number configuration was relatively unobtrusive for an individual, but sensitive from a security perspective, so it was worthwhile to be slower and more careful developing automation for it.
Write really good unit tests for your tools, too. When we began writing code to automate the build making process, we had to make sure it actually worked. We could run the code to kick off a new official version of the Desktop Client 10 times a day to test it end-to-end – but the flurry of builds would confuse engineers, add load to our build infrastructure, and potentially be a bad experience for our internal users as they repeatedly autoupdated. Plus, we would end up with the same familiar bad cycle from product code: every change had to be followed up with a manual test. We made improvements to CI in part 1 precisely to avoid manual testing! Therefore, it was worth the investment to write a robust unit test suite for the build automation tools.
Don’t tie things together if they don’t need to be tied together. This should be an obvious software engineering principle — separation of concerns, do one thing well, etc. But when automating a manual process, it’s easy to implement what a human would have done by having a computer do exactly the same thing. In our case, an engineer would arrive in the morning, kick off a build, test it, and deploy. So, our first iteration of build automation was a monolithic script that did all those things in exactly that order, and failed if any part didn’t succeed. As mentioned in Pt 1, capacity was a bottleneck at the time, so retrying the whole script added load and dragged down the overall CI and impeded engineers’ ability to commit code. To help with this, we began running the “build” aspect of the process early in the morning when load was low. However, we still wanted to wait until work hours in SF to deploy a build. Otherwise, if there was a significant problem, either a desktop engineer would have to be disturbed in the off-hours, or Dropboxers using an internal build in other parts of the world wouldn’t be able to use Dropbox. Breaking up the script into its component parts, i.e., making the build early in the morning but delaying deployment until later in the day — solved both problems effectively.

3. Automated integration testing

After builds were automated, we had fixed two of the big manual parts of trying to get new versions out of the door: keeping tests green so that we could ship on any commit, and kicking off and deploying builds. However, in the old system, an on-call Desktop Client engineer still ran a basic test set of install, update, and sync — by hand. Note that this person was a Software Engineer, not a Quality Assurance Engineer who specialized in testing. We had made the decision to ship with only unit tests for internal users, but since there were large regressions sometimes, we weren’t willing to do the same for external “beta” users.

Enter the Desktop Client Test Infrastructure team. Their mission was to automate end-to-end tests. Their challenge was that larger-scope tests have more opportunities for system flakiness.

Here are a few examples of those challenges:

Our integration tests run against the production server, meaning that issues in a different part of Dropbox can cause failed client tests.
The tests run against a wide variety of platforms. For example, implementing a hook into the user interface (e.g., to click a button) is different across every platform.
They also had to account for the third party programs that Dropbox interacts with to test things like the Dropbox Badge.
The integration tests were “realistic”, but only used a small set of test users to run. This introduced an insidious bug — a typical user wasn’t logging into hundred of different instances of the Desktop Client per day. The flow by which a user logs into a specific Desktop Client instance, or “host”, is called “linking”. On the server side, there was logic that loaded and traversed a list of every single previous host each time a new host was linked to a user. This caused HTTP 500 service errors from the server that were not seen by regular users!

To try and control, and then quash, flakiness, the Test Infra team strategically and incrementally stabilized the framework. First, they ran a basic sync test continuously, fixing issues as they came up, until there were 1000 consecutive passing test runs on each platform. Then, they gradually expanded the scope of tests, making sure that each new coverage area was stable before moving on.

Once the first set of tests were written, the other big challenges were in process and culture:

Evangelizing the integration test framework. When writing the basic suite of integration tests, the Test Infrastructure put a lot of effort into making a generalized framework that other teams could use to test their new feature. However, product engineers, who primarily want to write and ship product code, had to be convinced that it was a worthwhile investment to write end-to-end tests. Therefore, the Test Infra team conducted onboarding sessions and provided support. Eventually, integration tests caught enough bugs to prove their worth and they became the cultural norm.
Generating trust and ownership. Developers at Dropbox already wrote unit tests, understood them, and trusted the unit test framework enough to assume a failure was their responsibility to fix. In contrast, a product engineer who wrote a new integration test that failed was more likely to assume the framework was at fault, and assume that the Test Infrastructure team should investigate first, as they were responsible for the framework. This generated a lot more work for the Test Infra team.
Debugging failures. The integration test framework is relatively “thick”, with more code and Dropbox-specific customization than the test tool Pytest on which it is built. This means that engineers who want to debug a failure have more to learn. So, the Test Infra team put together a debugging playbook to spread the knowledge, and put a lot of effort into trying to make failures easier to understand and fix.
Triaging potential flaky failures. First, the quarantine logic discussed in Part 1 that came with our unit test suite needed to be ported over to integration tests — and as previously mentioned, the large surface area of integration tests makes flakiness even more probable. Second, routing and fixing flaky or failing tests once they occurred required additional cultural change and process. (See the “Triage Routing” section below for more about how this reflects other considerations of distributed development.)

4. Monitoring and metrics during rollout

Once a new version of Dropbox is launched, we want to make sure that it’s working well on real user computers.

Since at least 2008, Dropbox has used Community Forums to post new Desktop Client builds for enthusiasts to try and give feedback. Our beta testers are thoughtful and have surfaced many issues, but as of early 2016, investigating a specific reported issue is relatively difficult. For one, issues on the forums take the form of prose posts and tend to describe symptoms of the problem through product behavior. From only one report, the underlying issue is often difficult to isolate as it could have happened at any time and be caused by any component. In addition, forums posts are designed for answering questions and community discussion, so they are aggregated by user-defined topics, rather than timestamps or automatically detected content. To find all related reports of an issue, an engineer has to sort through several topics, find reports of similar application behavior, fill out an internal task, and do enough investigation to assign a severity.

As we quadrupled the cadence at which we were releasing new official versions of Dropbox, we would potentially be quadrupling the overhead required to find and translate these user reports into bugs. Since, as of late 2015, ~60% of critical issues detected after feature freeze came from beta user reports, we couldn’t simply stop paying attention. Therefore, we needed a solution to these laborious but essential human-scale processes. As usual, this solution was to automate them!

The goal of beta user bug reports is to get an idea of when things are going wrong, so that we can fix them before releasing to a wider audience. However, the Desktop Client itself has a lot more information than the end user about its internal state — it can catch and report exceptions, or report events based on when various code paths are executed. As of early 2016, Dropbox already had systems in place to collect tracebacks and aggregate event analytics, but they were primarily used to launch new features. For example, as an engineer iterated on a new widget, they would add logging to detect whether users were interacting with it and to ensure that it was operating correctly. They would also search for and fix tracebacks that indicate internal exceptions related to their new code. Once the feature was stable it was common that no one would look at those analytics again, except sometimes when there was a user-reported regression. This meant performance of the feature, or the entire Desktop Client, could slowly degrade over time and no one would notice.

We realized we could be much more systematic about analytics by permanently tracking a set of metrics over time and by creating guidelines for analytics on new features. This had three big upsides: we could catch more issues before any external user noticed, engineers wouldn’t spend all their time sifting through posts, and the information would contain more context about internal application state.

Release Gating Metrics

We wanted to track quality statistics over time. But how do we ensure that a regression in performance actually results in a fix? This is why we introduced a quality framework in the summer of 2016, followed by “Release Gating Metrics” in the autumn. We began by requiring each product team to implement a dashboard tracking the features they owned. The team could choose whatever metrics they believed to be most important but the metrics had to have a baseline that indicated quality, and not whether or not users simply liked the feature. A few of the most important are designated “release gating metrics”. These are always consulted before a release, and if they cross a pre-assigned threshold, the new version is delayed until the underlying issues are found and fixed.

Let’s take account sign-up, for example. We could track the total number of Desktop Client-based sign-ups, but if we launched a promotion or redesigned the sign-up interface to be more attractive, the number of new accounts might spike and hide if the sign-up flow itself became glitchy or started to lag. Therefore, to capture quality rather than popularity, we might track the amount of time from application start to the appearance of the sign-up window and then define an acceptable duration. The team could create an alert in case it started taking too long or failed to launch too often.

Larger experimental user pools

Robust monitoring is most useful when it captures information early, while there is still time to fix regressions ahead of the final release. However, the metrics and exception traceback volumes also need to be statistically significant to be actionable — giving five users hot-off-the-press versions of the Desktop Client would not catch many issues. This was particularly important as we began to move faster, and each version spent less time “baking” with beta users.

Expanding the experimental user pool turned out to be surprisingly simple: the Dropbox Desktop Client already had an option to “Include me on early releases” in the settings panel. Previously, we released to these users after posting a version in forums for a while, right before rolling out to the rest of world. However integration tests bolstered the quality of the Desktop Client sufficiently that we no longer needed this step, and simply began including these users in the beta builds as soon as they were released. This expanded our pool of beta users by about 40-fold and diversified the number of configurations that the experimental code ran on so that exception tracebacks that might only be reported in specific edge cases were more likely to show up.

Altogether, we still find issues through internal and external human reports, but the additional logging and analytics have made reports easier to debug, and the volume has remained manageable as overall development velocity increases.

5. Triage routing

At this point in the story, we now had a thorough suite of indicators of how the Desktop Client could be going wrong. Metrics could dip below satisfactory levels, integration tests might fail, a manual bug report could be filed, or tracebacks may be reported when internal exceptions occurred. How would we translate these into actual actionable bugs and fixes? Improvements 1-4 unlocked the ability for many teams to develop on desktop, including teams with goals spanning the web, mobile, and desktop platforms, but there was no longer a single manager aware of all developments. The Desktop Platform team could investigate everything, but they would never have time for larger foundational projects and probably would lack context if a product team implemented a feature that was causing bugs. With the advent of distributed development, we now also had distributed ownership.

The solution was to be really explicit about who was responsible for which component of the Desktop Client, and automate as much as possible about routing. Every part of the Desktop Client is given a “Desktop Component” tag on Phabricator, our bug tracking and code review platform. As teams shift and features are added, a central authority keeps track of ownership. If an issue can be clearly mapped to a specific component, it gets assigned the relevant tag and directly sent to the owning team. This way, the Desktop Platform team, instead of being responsible for investigating everything, is only responsible for doing any leftover first-level routing and for taking care of the things their team specifically owns.

To assist in first-level routing, bugs are auto-assigned based on the code paths in the stack traces. When manual reports come in from our internal users, we make them as routable as possible by implementing an internal bug reporter, which prompts for information and adds some metadata. We have similar templates for problems bubbled up from external users. Generally speaking, if an issue is reported by hand that was not caught by our existing metrics, we strive to add monitoring and/or integration tests to catch similar problems in the future.

6. Safeguarding the stabilization period

Code flux tends to be proportional to product quality risk. In other words, the more that changes in a codebase, the more testing and verification is needed to be sure it works well. Our previous process put the onus on engineers to make only the changes that they deemed safe, with the guideline that no new features could be added after a build was made available externally. However, seemingly innocent changes that fixed a bug in one place could easily cause issues on another platform, especially without integration tests as guardrails to ensure that every commit preserved basic functionality.

To replace these nebulous guidelines, we implemented a strict and objective set of requirements for allowing changes after “code freeze,” so that the testing and monitoring after freeze would accurately represent the quality of the final release. The primary benefit of this was predictability. Code could now go out on a regular cadence without too much time spent trying to re-establish test coverage and quality verification after last-minute commits. The downside was that teams had to accept missing a deadline if they could not stabilize their feature completely before code freeze.

This was painful at first, especially as we first had to go through a transitional 4-week release cadence (instead of the 2-week one we have today). Engineers had to get used to bug bashing thoroughly ahead of code freeze and trust that if they missed one date, the next would come around in a few weeks as scheduled. The first time we implemented this safeguarding, I personally forgot to turn on a single feature flag (explained in more detail below) and had to wait a whole release cycle — and I had been working on the new release process!

The Triage Council

As for how this enforcement works: we gathered 5 senior technicians on the desktop client codebase to form the “Triage Council” and gave it a charter and explicit guidelines to accept or reject proposed changes once code freeze happened. (This is also when a release branch is created in the repository.) The Triage Council would have a lot of technical context, but be tasked only with upholding the existing rules. This had two advantages: these senior engineers weren’t at risk of burning out on playing “bad cop” or making difficult decisions (they could always just point to the charter); and other engineers would approach them with a good understanding of the requirements to make a last-minute change, or “cherry-pick”, to the release branch.

So, what can be cherry-picked?

Fixes to critical regressions of existing functionality
Disablement of a new feature
Resolution to SEVs (severity incidents), which themselves have an explicit set of criteria.

We later added a fourth category as well:

Test or tools-only changes (not application code). We wanted to be able to use improvements to our automated build and test infrastructure on all branches.

There is an explicit block on pushing code to a release branch without Triage Council approval, but there is also an escalation process. If someone wanting a cherry-pick — but rejected by the Triage Council — thinks that the rules were misapplied, or that another there should be an exception for another reason, such as hitting a product deadline, they can appeal further to our VP of Infrastructure.

To keep improving our processes, each cherry-pick is followed up with a post-mortem review, which strives to identify what the root cause behind an issue is, why a solution was not found earlier, and how we can prevent similar issues from occurring again.

Code gating

One important way to make all of this possible was to boost cultural support for remote and binary code gating. There was already a robust set of tools for grouping users and hosts in order to show to A/B experiments on web (Stormcrow). These experiment flags can also can be passed down to the Desktop Client to fork logic. We now expect that any risky change is gated by Stormcrow flags so that they can be turned off safely without making a code change in the desktop codebase. Some changes, of course, happen when we cannot guarantee web connectivity. These are expected to be “feature-gated” within the Desktop Client codebase, meaning that changing a single variable from True to False would turn off an entire section of code. These feature gates can also be configured to turn on code for only certain types of builds, so that, for example, we could get feedback from Dropboxers for brand-new (but highly experimental) features a few months before we were ready to commit to that product strategy externally.

In summary

All of the process changes and infrastructure and tooling investments built on top of one another to emphasize each others’ efficacy. They also tended to achieve one of the following ends:

Reduce and automate “Keep The Lights On” work, operational work that often scales linearly with code change by:

Improving reliability and capacity of our build and test infrastructure
Automating build-making, saving an engineer/week of time.
Defining a scalable ownership and triage system.
Enacting explicit process for last-minute changes, including an emphasis on feature flagging new or risky code to respond quickly to issues.

Automatically affirm quality, via testing and diagnostics such as:

Leveraging a Commit Queue to raise the baseline quality of committed code.
Developing an end-to-end test platform to reduce reliance on manual quality assurance.
Implementing Release Gating Metrics and more robust logging and analytics to track real-world performance.

Together, they allowed us to accelerate our release cadence, from publishing a new major version every 8 weeks to every 2 weeks, in order to shorten the feedback cycle on new features and code changes, while sustaining high quality and avoiding undue operational burden on the engineers responsible for maintaining the platform.

// Tags

// Copy link

Link copied