Since its inception, the Dropbox Android app has had challenges sharing files and folders with special names. These problems ranged from being unable to see folders with Cyrillic characters to crashing when attempting to load content within certain paths.
The root cause of the problem was easy to identify, as we’ll explain, but the obvious fix was far too risky to attempt. Instead, by going against conventional wisdom, we devised and carried out a plan that was both low-risk and low-cost.
How did this happen?
Historically, Dropbox has used normalized paths to uniquely identify files and folders. Normalized paths are file or folder paths that have been converted to a standard form that’s operating-system agnostic. For example, we would normalize the path /Documents /CAT.jpg as /documents/cat.jpg. Getting rid of uppercase letters, spaces, tab characters, and other little things, lets the file names and paths work on both an Android phone and an Apple tablet, as well as any Windows desktop.
But paths must be normalized in a precise way, and this normalization logic must be replicated perfectly on all platforms. If it’s not done properly, miscommunication between a server and client can occur during file/folder operations. That results in show-stopping bugs for the user and their files.
The choices we had initially made for a simple normalization solution on Dropbox servers turned out to be tricky to implement correctly on other platforms. The Dropbox Android client, in particular, ended up with its path normalization logic implemented incorrectly.
In hindsight, two decisions led us into the situation.
Decision #1: Removal of spaces
We had chosen to strip away space characters from the ends of folder names, but we retained other whitespace characters (e.g. tab characters). This would later conflict with Android’s normalization of the same paths.
Decision #2: Python 2.5 vs Cyrillic
Our second decision had been made unintentionally. When we first developed Dropbox servers for our 2007 debut, our engineers used the most stable version of Python at the time: Python 2.5. The server therefore used Python 2.5’s unicode.lower() function in its path normalization logic.
Unfortunately, Python in those days had poor internationalization support, plus its lowercase function was flawed. It did not take the context into account—the Greek character sigma (Σ) can have a different lowercase form depending on where it is used (σ or ς).
And in some cases, Python 2.5 was outright missing the lowercase mapping for certain characters. The lowercase form of the Cyrillic character Uk (Ꙋ) is ꙋ, but Python 2.5 returns Ꙋ.
Python 2.5 was ported over from the original Dropbox code to maintain backward compatibility. That’s why its lowercasing behavior, a relic of 2006, is still running on our servers.
How Path Normalization on Android went wrong
Dropbox’s original path normalization implementation for Android did not account for these two quirks in the server’s implementation.
First of all, Android removes all leading and trailing white space characters from folder names. This may strip away characters that our server does not. For instance, the server considers a folder name consisting of a single tab character to be correctly normalized. When the Android client normalizes this same path, it will delete the character since a tab is a whitespace character. This results in a folder with no name, which breaks the file path.
Second, Android uses Java’s lowercase function to normalize paths. Java’s lowercase function has support for more characters than Python 2.5’s lowercase function, and it also considers context. This results in mismatches over the lowercased form of certain paths between the Android client and the server.
For example, consider a folder named with three Sigma characters: ΣΣΣ. The Android client would lowercase ΣΣΣ as σσς. This is because the lowercase of Σ is σ in the middle of a word, but it is ς at the end of the word. The server, however, would lowercase ΣΣΣ as σσσ.
This disagreement over the normalized path of a folder results in user-breaking bugs, such as folders not being displayed in the app.
Path normalization is a low-level function used widely in the Dropbox app. Problems within that logic may cause bugs to appear in a wide range of places. We couldn’t accurately estimate the extent of the problem’s impact on users, but clearly it made Dropbox behave mysteriously for more than a few of them.
Finding the Best Solution
There was an obvious fix to the problem: Change the way the path is normalized on Android to match the server’s imperfect methods. But this approach would run into a serious problem of its own.
Path normalization covers a broad code surface: It spans at least 12 source code files for the Android app, with nearly 100 locations where normalization functions are called. It is used in many crucial file operations such as upload, move, and delete. If a new bug were to be introduced within this logic, it could result in loss of user data for users who weren’t already at risk to the existing bugs. That risk was flat-out unacceptable.
How could we de-risk the fix? We saw two sensible ways. Unfortunately, neither of these theoretically elegant solutions would work well in practice.
Option 1: Ditch paths, switch to an ID-based file system
Using paths to uniquely identify files and folders has a lot of built-in drawbacks, of which the incompatible path formats across platforms is only one. A system that uses unique IDs rather than file and folder paths would make these problems go away. This is actually the direction that Dropbox is heading anyway.
The biggest downside is the amount of work required. The use of Dropbox paths as a unique identifier in the app is deeply rooted. It would take multiple engineers several months. Due to the engineering cost, it would be difficult to prioritize this option, even though Dropbox would eventually get it done. In fact, this solution was proposed back in 2016 but has yet to be prioritized at the time of writing.
Just as important from a user perspective, this big a change would also introduce risks for all users. It’s a major change not to be rushed into.
Option 2: Use feature-gating to roll out a fix
Dropbox has an in-house system called Stormcrow that lets us roll out changes to our apps on any platform in a way that lets us turn off a feature change immediately if there’s a problem, without needing to build, test, and ship another update to the app to back it out.
But Stormcrow doesn’t get initialized early enough in the app’s runtime to gate this particular fix. The Dropbox app contains many components we call managers, for example the FavoritesManager which keeps track of the user’s favorite files and folders on their device, rather than needing to connect to a server every time from, in this case, their Android device. Many of these managers need to store Dropbox paths in a local disk cache (the “disk” these days is often solid-state memory). On app startup, these managers reload paths from disk, which requires our path normalization logic. During this time, Stormcrow is unavailable due to its dependencies on some of those very files being reloaded.
So we can’t use Stormcrow to turn changes in path normalization code on and off. There are ways around this problem, but this solution has its own issues. The Stormcrow interactor is the keeper of all feature gate states. So the path normalization logic would need access to the Stormcrow interactor in order for the normalization logic to be gated. We could, in an obvious and elegant solution, pass the Stormcrow interactor as an argument to the path normalization logic.
Wrong. From a simple search we could see that there were at least 100 call sites in our production code where paths are normalized, and over 600 call sites in the unit tests we run before we even think about shipping production code. We would need to change code in over 700 places—and this assumes the Stormcrow interactor code would be accessible in all the many places necessary to reach into its data structures.
Besides the volume of work required, the risk of changing a constructor used in 800 places is, once again, much higher than we’d want to take. The solution is not impossible. It is just too costly and complicated. The whole idea of Stormcrow is to reduce engineering work and user risk in making changes, not increase it.
Option 3: A global static variable. Seriously.
Using a Stormcrow was actually the best idea, but the implementation was much too complex. We needed to simplify it.
First, let’s restate the problem in a simpler way: we need access to a state (a Stormcrow gate) in multiple places without needing major refactors. If we take a step back and prioritize a working product over software engineering elegance, a trivial solution emerges that lets us go with Stormcrow: Use a global static variable.
Bear with me.
A global static variable checks both critical boxes:
- It can be accessed at app launch.
- It can be accessed everywhere.
The only drawback is the general negativity associated with global static variables by software engineers, for good reasons:
- Too much scope. Global variables can be accessed anywhere; they can be modified anywhere. This can make code hard to reason about.
- Difficult to test. Static variables can maintain state between unit tests, making unit tests also hard to reason about.
Although these drawbacks seem pretty terrible, they’re only applicable when the global static variable is used as a long-term solution. In this case, we only want to use a global static variable while gating our fix in case it has its own problems. As soon as we’re sure that our fix works reliably without new risks, we can remove the global static variable from the code completely.
Checking gate state at app launch
Although the global static variable gives the path normalization logic access to the gate state, it only solves half of the problem. The gate’s state still needs to be made available at app launch.
The solution to this problem was found, again, by simplifying the problem to its most basic form: We need a value before we can compute it.
In this case, we can use the last computed value. Specifically, we can cache the state of the gate once the Stormcrow interactor is ready, then use it during the next app launch.
The full solution is as follows:
- When the app first launches, check if we have a cached value of the state of the gate on disk. If we do, read it and set the global static variable accordingly. If we don’t, default to use the old path normalization logic.
- Once the Stormcrow interactor is initialized, register a listener on it and cache the state of the gate on disk.
- Conditionally use our modified path normalization code via Stormcrow, based on the state of the global static variable.
See how that works? In this specific case with customers’ data at risk, a global static variable made our code much, much simpler and much less risky to them than redesigning file storage or adding code that sticks its fingers into Stormcrow.
Once we’d stepped back to look at the situation in a larger perspective, our short-lived global static variable would let us take the best long-term path: Roll out the obvious, simplest fix in a way that could be instantly redacted. Avoid major refactoring to Stormcrow code to handle one short-lived code change. And take the time we need to build better, ID-based file management wisely.
Global static variables are almost never the right solution. Almost. In this case, solving our Android customers’ crashes quickly and safely—for them as well as us—was worth the inevitable comments and tweets we’ll get about omg a global static.
Building and Testing
Implementing the fix was straightforward. The majority of our work was gathering information to prove the proposed gating solution would work. Research was vigorous, due to the risks. The key behavior to verify was that toggling the feature gate on and off did not cause any issues with Dropbox paths that were persisted to disk.
To verify this, we ran an audit for every data source that contained a Dropbox path. We found that all but one data source already took into account the unreliability of Dropbox paths stored on the client. The exception was the metadata database, which stores a cached version of a user’s entire file system. Our solution was to clear the metadata database each time the state of the gate changed, a solution proven effective in past projects.
Besides the audits, analytic events were added to track cases where the new and old path normalization logic disagreed. For privacy reasons the event could only report if an inconsistency had been detected, without specifying the actual path and file names involved.
The new path normalization was fully unit tested. We also organized a small bug bash, in which no major bugs were found.
After the release build with the fix inside had shipped to 100% of customers, we rolled out a test run of the new path normalization logic. The test run did not change any logic; instead whenever a path was normalized using the old code it would also normalize the path using the new logic. The two normalized paths were then compared and the result was logged. The path normalized using the new logic was then discarded.
During the test run, we measured the rate at which any pathname inconsistencies were reported. Either too high, or 0.0%, would indicate a problem with our new code. It hovered around 0.1%, which seemed reasonable since we did not expect a large number of folders with Cyrillic names.
Once we were satisfied with the results of the test run, we slowly rolled out the fix across our user base via the Stormcrow feature gate, and monitored analytics.
During the roll out crashes revolving around empty folder names fell to zero. This heavily implied that the fix was a success. Further confirming it, we didn’t received any new customer support tickets related to the fix. After waiting long enough to trust we had succeeded, we removed all of the gating logic. Farewell, global static variable. Our first and most obvious fix from the start of this story was now live in our customers’ apps.
What we learned
Any engineer will nod in agreement at three lessons we took away from the project. But you haven’t really learned them until you’ve gotten over yourself and applied them to a challenging project. When facing time pressure and resource constraints, it can be tough to recall:
- Keep things simple
- Challenge what you believe
- General guidelines are general
The use of a global static variable was critical for this fix. It kept everything simple. It goes against general guidelines, but that’s the point of this story: General guidelines by definition have exceptions.
Often when we learn, we create mental shortcuts. We simplify. For instance, let’s say we try a new approach X, but quickly learn that X was the wrong choice. We might generalize and think X is bad, period, when the truth is that X was bad given the situation. Even in cases where we recognized this, over time we can forget the situation while ingraining the generalization.
These generalizations can be helpful. They allow us to quickly rule out bad solutions, rather than repeat them, and help us focus on identifying good ones. But these generalizations don’t always work. Occasionally they rule out a good solution.
When faced with a difficult problem, it might be valuable to stop and look for which your own assumptions could be challenged. The time and effort you stand to save scales with the complexity of the solution. Ask yourself: Is there something far simpler that I’ve told myself isn’t an option? Maybe it’s the best option of all.