As you may already know, Dropbox runs our infrastructure on a hybrid model—a mix between on-premise hardware and the public cloud. Known as Magic Pocket, it’s our own custom built multi-exabyte infrastructure. After our original migration to Magic Pocket in 2015, our primary goal was to manage large capacity growth at scale. This presented us with challenges including power efficiency and concern for our carbon footprint. We were able to check the box of powering hundreds of millions of users, but we needed to ensure we were being responsible in how we operate.
As we scaled up our team, we started looking past just KTLO (keeping the lights on) toward how to operate more efficiently. After our first year of deployment, we had a bit of sticker shock from our utility bills. We pondered the implication of these high bills not just for Dropbox, but for the environment. Other teams at Dropbox were asking the same questions. Our peers were setting ambitious sustainability goals.
The result was achieving a major milestone in our journey to carbon neutrality: all of our data center storage server power is covered by 100% renewable electricity.
We decided to focus our sustainability efforts on customer data storage because—the Edelman 2021 Trust Barometer found that customers are 5.7% more likely to trust companies that embrace sustainable practices. For us there’s a simple metric: You can store all your stuff on Dropbox knowing our data loads are covered by 100% renewable electricity.
We’ve identified three strategies to get to where we are today, and to keep going to even greater sustainability. We’re not here to brag about it—we hope to inspire others to join in bringing real change to the world we share.
Maximize power usage effectiveness
Power usage effectiveness (PUE) assesses how efficiently we’re able to leverage the power we consume within our data centers. When we set a goal to achieve best in class PUE, we looked to existing industry benchmarks. In 2015, our PUE was consistent with industry averages, but we felt we could do better.
We set out to achieve optimum efficiency through industry best practices throughout our deployments. These strategies include the implementation of outside air economization, thermal containment solutions, and maximizing power utilization through our spaces.
- Outside air economization: in our data centers, hardware needs to be cooled 24/7 because inlet temperatures must be stable to prevent system failure caused by the overheating of components. With outside air economization, our systems bring in outside air at lower temperatures. This reduces the amount of cooling needed to maintain appropriate temperatures. A comparable example would be to think of a hot San Jose apartment: on cool days it makes more sense to open a window to natural cooling than blast the AC—both energy usage and cost are lower.
- Thermal containment solutions: Using our window-vs-AC example, when we first moved to Magic Pocket, we cooled entire data centers without taking into account how air could be wasted through leaks or issues with airflow. It’s as if our air conditioner were running on hot days with the window open, letting cooled air escape rather than do its job in the room. We ran a computer fluid dynamics thermal analysis— model to determine where there are air flow inefficiencies that can be fixed. We identified significant opportunities for improvement. We retrofitted existing sites, and made it a standard to include this form of containment in all new data center construction.
- Maximizing power utilization: Across the industry, data centers are designed to power a specific power range. Any capacity that isn’t used is wasted—at a cost both in dollars and environmental impact. At Dropbox, we aim to reach 85% capacity, a sweet spot for usage that’s efficient yet still gives us extra capacity to handle power spikes.
Through these practices we now operate our data centers at a PUE level which is top of class in our industry. By 2020, we were operating at 17% below the industry average. In an effort to avoid stagnation, though, we continue our collaboration and solution exploration with current and future data center providers.
Optimize overall power consumption
We don’t just want to use our power more effectively. We want to use less of it. There are several tactics that have let us reduce the energy we use.
Quickly power down decommissioned hosts
At Dropbox we have a continuous flow of servers that have reached their end of life. As we continued to collect data on our operations, we noticed there was a gap in time between decommissioning hosts and actually powering down the servers. Engineers once manually intervened in most aspects of server maintenance, including provisioning and decommissioning.
To keep from running these unused servers, we leveraged ClusterOps’ Pirlo system, originally built to provision servers, to roll out an automation service that powers down a server host immediately after it is decommissioned. This simple change has saved us an estimated 5% in power over each server’s lifespan.
Leverage a lower power state for servers in our free pool
When a server is deployed into Dropbox data centers, it’s first tagged as being available in our free pool. This means the server is sitting online but idle, waiting for one of our services to allocate it for use.
When evaluating a full rack of servers, we saw we were consuming just below 5 kilowatts of power per rack even as it sat idle. We looked for ways to reduce the power consumption of servers not yet in active use.
We’re currently in the process of introducing, a new state in our data centers: HDD Standby. It spins down a server’s disk drives while the operating system is still running, reducing power while still keeping the server alive and available to be identified and allocated to a service. Spinning disk drives use a lot of power. HDD Standby will deliver an estimated 50% power savings on our storage hosts and 25% savings on Hadoop Distributed Filesystem hosts.
Although the savings are substantial, we have come across challenges. Even when hosts are sitting in free pool waiting to be allocated, we still need to monitor the health of all drives, to make sure they can be allocated as needed. Before HDD Standby the drives could respond immediately, since they were already spun up. Now when that query comes in, the server has to wake up the spun down drives, which makes each query take longer.
We can optimize this process, though. Previously we would query drives in series, which means the server would spin up each drive one at a time, check it, then move on to the next. In the last few weeks we’ve pushed code that allows drive queries to be run in parallel. This way all of the drives can be queried and spun back down at the same time. The result is a 99% decrease in time it takes to run a query—time during which most of a server’s disk drives don’t need to be spun up.
Rightsizing our capacity
There is a fine balance between having enough capacity to mitigate supply / demand risks, such as sudden spikes in traffic caused by regressions or rack delivery delays caused by chip shortages, and having too much idle capacity that makes our system inefficient.
Based on historical data, we’ve built a solid model to determine the right amount of capacity we should have as we go forward. We constantly measure our capacity efficiencies and adjust our supply and demand accordingly.
In previous iterations of Dropbox, capacity planning was done on an annual basis and was decided by historical data and input from service owners. As we’ve expanded the team and expertise, we’ve moved to a monthly planning model, plus we now actively monitor our systems to ensure capacity is being properly used. This enables to build a more reliable model from more data points, which increases accountability and allows for quicker, more nimble pivots in capacity planning.
We always try to use our hardware to its fullest potential by optimizing all layers of data processing, from hardware and operating systems to TCP congestion protocols and compression algorithms. On the data center side, we’re constantly looking at how we can get the maximum use of already-purchased servers through infrastructure-wide improvements.
One of the main initiatives on that front is moving our orchestration platform to Kubernetes. This should give us several major efficiency benefits:
- Multi-tenancy. Kubernetes’ bin packing allows us to put multiple services of different shapes (in terms of CPU/memory/network dimensions) on the same server to maximize the overall resource usage.
- Oversubscription. In some cases, we can safely overcommit some resource types (e.g. CPU) without degrading the latency of their work. This can improve the efficiency of bursty batch jobs.
- Background jobs. Internal “spot instances” can be utilized by low-priority jobs (e.g. data and metadata verifiers). This can lower utilization during peak hours and increase it during idle times, optimizing how much power we need and how effectively we use it.
Internally, Dropbox handles around 10x more metadata verification queries than it does queries from users. These queries come from many sources: MySQL replication verifiers (pt-table-checksum), filesystem verifiers, block verifiers, cold storage verifiers, security verifiers, etc. All of these can benefit from being handled as low-priority job types that maintain steady progress during off-peak hours.
Densifying our fleet
We continue to evaluate and adopt new hardware—see our case study on Magic Pocket— that packs more capacity into the same space, getting more storage and more processing without adding more servers, more racks, or more rooms of them. This densification—doing more with the same amount of hardware and energy—goes a long way to making our operations more carbon efficient.
Our recently introduced Magic Pocket storage platform uses 20TB drives and offers 43% more storage capacity compared to our previous storage platform. A single storage enclosure can now store over 2PB—up nearly 4x in two generations.
On the compute side, our current platform introduced 48 cores into a single CPU socket, a 3x increase from the previous generation.
Going forward, hardware makers are developing new technology to continue densification. Hard drive vendors have a roadmap that reaches to 35TB per drive. CPU vendors have announced processors with 128 threads, and are pursuing higher core counts.
Services can be optimized to take advantage of densification by scaling up and/or utilizing multi-tenancy. But even without these additional optimizations. infrastructure compute per watt continues to improve.
Source 100% renewable electricity
We looked far and wide through our operations to ensure we’re as efficient as possible. We feel confident we’re using industry best practices across the board and are constantly auditing and revisiting our efforts. Moving forward, we want to ensure the energy we need to operate our data centers is renewable, starting with storage.
In 2021, we’re making huge investments to procure renewable energy, as just one of many companies cited in a Goldman Sachs report on making all their investments more sustainable. We’ve committed to making the direct power consumption of our storage platform 100% carbon neutral. We’re starting there because we know it’s important to our customers, but it’s important to us, too.
Results (so far)
We’ve reduced our datacenter carbon footprint in the last 1.5 years by 15%. But ensuring the electricity powering our data centers is covered by 100% renewable energy is only the beginning. Because we run a hybrid model, this means working with our public cloud partners, to make sure we meet sustainability goals not just on our own premises, but globally through our partners.
A call to action
As more companies begin their sustainability journey, we challenge them: Don’t take the easy way out and only look for carbon offsets! We believe that all companies can look internally and find other ways to minimize their carbon footprint in the datacenter space.
Right now, four-fifths of U.S. energy comes from fossil fuels that contribute to climate change. Only a tenth comes from renewable sources. You need to understand that difference between carbon credits (which don’t reduce emissions) and carbon offsets (which do), but all the offsets in the world won’t get us as a planet to where we need to be.
Bill Gates’ new book, How To Avoid A Climate Disaster, is worth reading if you’re serious about making the sort of changes we did. He lists ways that even small startups can make differences that add up to a healthier climate: Get the local chamber of commerce to plant trees. Be an early adopter of greener tech, such as our densified racks. Connect with government-funded researchers who don’t have your proven ability to bring their innovations to market.
“Doing only the easy things won’t solve the problem,” Gates writes. “That means accepting more risk … Companies and their leaders need to be rewarded for making bets that could move us forward on climate change.”
Interested in helping us build a more efficient supply chain or flexible data center strategy? We’re hiring! Dropbox works Virtual First, which means remote rather than office work is the primary experience for all Dropboxers. Join us!