What’s new with Robinhood, our in-house load balancing service

// By Yi-Shu Tai • Oct 23, 2024

Robinhood is the internal Dropbox load balancing service we deployed in 2020. It is responsible for routing all our internal traffic between servers to balance service load. Before we built Robinhood, most Dropbox services suffered from uneven load distribution among backends. Hardware differences throughout our fleet and the limitations of our prior load balancing algorithms led to reliability issues due to overloaded instances. In order to solve this problem, we often had to over-provision a service’s fleet, which inevitably increased our hardware spend—a pricey and avoidable headache.

Robinhood solves these long-standing load balancing problems at Dropbox scale, across our entire global data center footprint. Last year, we introduced the latest iteration to Robinhood: By leveraging proportional–integral–derivative (PID) controllers, Robinhood can now more quickly and effectively manage load imbalances. This has not only improved the reliability of our infrastructure, but yielded significant hardware cost savings. And with an increase in AI workloads that power our latest intelligent features, effectively managing demands on our GPU resources is more critical to the business than ever. 

The challenge of load balancing at Dropbox

Our in-house service discovery system can scale to hundreds of thousands of hosts across multiple data centers around the globe. Some Dropbox services have millions of clients; however, we cannot allow each client to create connections to every server instance. This approach puts too much memory pressure on servers, and TLS handshakes during server restarts can overwhelm servers. Instead, we can use a service discovery system, which gives each client a subset of servers to connect to.

Without other information, the best load balancing strategy a client can use is a round-robin of the list of addresses given by our service discovery system. However, the load on each server instance can be quite imbalanced with this method1. While increasing the subset size is an easy mitigation, it won’t completely eliminate the imbalance and would just give service owners another parameter to deal with. And there’s another, deeper issue. Even if we send the same number of requests to each server, the underlying hardware might differ from one server to the next. In other words, a request would consume a different amount of resources on different hardware classes.

At its core, the issue is that clients do not have visibility into server load. Years ago, we attempted to solve the problem by having servers attach the load in the response headers. Clients could then perform load balancing themselves by picking the least-loaded endpoint in the subset of addresses. The results were promising, but there were still several downsides. The approach required code changes to servers and clients for the special load header, which was hard to adopt globally. More importantly, the results were good, but they weren’t good enough.

In 2019, we officially decided to build Robinhood. This new service, built on top of our existing in-house service discovery system, collects load information from servers and attaches it to the routing information. Robinhood leverages Envoy’s Endpoint Discovery Service, which incorporates the load information into endpoint weights so that clients can perform weighted round-robin. Now that the gRPC community is adopting the Envoy xDS protocol, Robinhood is compatible with both our Envoy and gRPC clients2. Another reason to build a new service was that, to our knowledge, there was no existing load balancing solution that met our needs at the time.

After a few years in production, the results have been promising. We successfully reduced fleet size by 25% for some of our largest services, resulting in substantial hardware cost savings each year. Reliability has also improved, thanks to fewer over-utilized processes.

Below, we break down the architecture of Robinhood to illustrate how we created a far superior system for balancing service load.

The architecture of Robinhood

A deployment of Robinhood within a single datacenter

As shown in the illustration above, a Robinhood instance is deployed to each of our data centers and consists of three parts: the load balancing service, a proxy, and routing database.

Load balancing service (LBS)

This is the heart of Robinhood. The LBS is responsible for collecting load information and generating routing information with endpoint weights. Since we have multiple instances updating the routing info for a service concurrently, we use our in house shard manager to assign the primary worker for each service. In addition, each service is independent, so we can shard the LBS by service and scale it horizontally.

Proxy

The proxy is responsible for routing a service’s load information to the corresponding LBS partition within the data center. A nice side effect of this setup is that the proxy also reduces the number of connections directly to LBS processes. Without a proxy, every LBS process would have to be connected to all nodes within our infrastructure. Instead, LBS processes are only connected to the proxy, which greatly reduces the memory pressure on the LBS. Also, because the proxy is only connected by nodes within the same data center, it can be scaled horizontally. This pattern is used in many parts of our infrastructure to protect services from receiving too many TLS connections.

Routing database

The routing database is a ZooKeeper/etcd-based database that stores routing information for services, such as hostname, IP address, weights generated by the LBS, etc. ZooKeeper and etcd can notify all watchers in real time of any node/key changes, and it scales pretty well for our read-heavy service discovery use case. The eventual consistency guaranteed by ZooKeeper/etcd is good enough for service discovery as well.

A closer look at the load balancing service

The goal of load balancing is to ensure that the utilization of every node is equal to the average utilization. We use a PID controller to keep the utilization of each node almost the same as the average utilization. The LBS creates a PID controller for each node and uses the average utilization as the setpoint. The LBS then uses the output of the PID controller as the delta to the endpoint weight and normalizes the weight among all endpoints of the service. While it takes a couple of adjustments for a new node to converge on the average utilization, the PID controller works quite well for load balancing.

The LBS is designed to handle a variety of scenarios that can affect load balancing, from node restarts to missing load reports. To maintain optimal performance, the LBS has implemented several strategies to handle these edge cases, which are detailed below.

  • LBS start up. The LBS keeps load information and PID controller states in memory. During an LBS restart (which can occur due to a normal push, node rotation, hardware failure, etc.), the LBS does not immediately start updating weights but rather waits a short period of time for load reports to come in. For PID controller weights, LBS restores them by reading the endpoint weights from the routing database.
  • Cold start node. New nodes frequently join the service fleet, and so it’s important we prevent thundering herd issues. Since a new node typically has an initial utilization of 0, LBS sets the weight of the new node to a low endpoint weight and lets the PID controller ramp up the node to the average utilization.
  • Missing load reports. Failures are common in distributed system environments. For example, load reports of some nodes might be delayed or never actually arrive because of network congestion or hardware failures. LBS skips these nodes during weight updates, meaning endpoint weights stay the same for those nodes since it doesn’t know whether to increase or decrease the weight of those nodes. However, if a large portion of load reports are missing—currently configured at 15%—the average utilization calculation can be off, so it might not have an accurate setpoint to adjust node weights. For safety, LBS skips the weight update phase entirely in this case.
  • Utilization metric. CPU utilization is the most popular metric for load balancing at Dropbox. For services not bottlenecked by CPU, the number of in-flight requests is a good alternate measurement. Therefore, we implemented LBS to support load balancing based on CPU and/or in-flight requests.3
  • Limitations. The PID controller constructs a feedback loop to keep the utilization of the node close to the target value (the average utilization, in our case). This means that if there is little feedback—for example, in the case of a very low traffic service, or very high-latency requests measured in minutes—the load balancing won’t be as effective. We argue that services with high latency requests should be asynchronous.

Routing across data centers

An LBS instance handles load balancing within the data center. For cross-data center routing, there are different considerations. For example, we want to route requests to the closest data center to reduce the round trip time for the requests. Therefore, we've introduced a locality config for defining traffic splits between destination data centers:

{
  # client data center -> traffic split between destination data centers
  zone_1: {
    "zone_1": 100,
  }
  zone_2: {
    "zone_2": 50,
    "zone_1": 50,
  }
}

This example config indicates that for clients in zone_1, 100% of requests are sent to zone_1, and for clients in zone_2, requests are evenly split between zone_1 and zone_2.

The service discovery system utilizes this config to build Endpoint Discovery Service responses for clients. gRPC clients and Envoy perform two layers of weighted round-robin. The load balancer first selects the zone and then selects endpoints within that zone. Additionally, we support hot reload for changes to the locality config, allowing service owners to perform real-time failovers between data centers.

Evaluating the performance of our load balancer

We measure load balancing performance with a max/avg ratio. For example, if the service owner chooses to load balance based on CPU, we use maxCPU/avgCPU as the indicator of performance. The reason is that service owners usually provision their service based on the maximum utilization among nodes, and the main purpose of load balancing is to reduce the size of the fleet. The PID controller load balancing strategy can achieve a max/avg ratio close to 1.

This graph shows the max/avg CPU and p95/avg CPU of one of our biggest Envoy proxy clusters. After enabling PID controller-based load balancing, the two metrics dropped close to 1. The max/avg ratio dropped from 1.26 to 1.01, showing a 20% (1.01/1.26 ~ 0.8) improvement.

The graph above shows the quantile breakdown of CPU utilization per node. After enabling PID controller-based load balancing, the max, p95, avg, and p5 almost consolidated into a single line. 

Let’s look at another good example:

Now, this graph shows the max/avg CPU and p95/avg CPU of another one of our biggest database frontend clusters. After enabling PID controller-based load balancing, the two metrics dropped close to 1. The max/avg ratio dropped from 1.4 to 1.05, showing a 25% improvement.

Finally, this graph shows the quantile breakdown of CPU utilization per node. After enabling PID controller-based load balancing, the max, p95, avg, and p5 almost consolidated into a single line once again. 

Why we built a config aggregator

Robinhood exposes several options for service owners to choose from, and can even apply changes dynamically. Service owners create and update the Robinhood config for their services from within their service directory in the codebase. We then store these settings in our config management service, a convenient library that receives any changes to Robinhood’s config in real-time. 

However, we cannot simply build and push Robinhood’s mega config regularly from the codebase due to several problems:

  • If a breaking change is introduced by a config push, it's risky to press the rollback button because we don’t know how many other services have also made changes since the last push.
  • The team that owns Robinhood is also responsible for each mega config push. This means that the Robinhood team would have to get involved in every breaking config push—which is a waste of engineering time, since most incidents can be resolved by the service owner.
  • Each push takes hours to deploy to multiple data centers in order to minimize potential risks.

To address these problems, we build another small service: the config aggregator.

Instead of storing one Robinhood mega config in the config management service, we break the config into per-service configs. Each per-service config only contains the configuration for that particular service. This way, service owners can update and deploy their changes at any time without worrying about being affected by changes in other services. In the event of a breaking change, service owners can also roll back config pushes or apply fixes during incidents without having to involve the Robinhood team.

To simplify the LBS and keep it focused on its primary task, we built another service—the config aggregator—which collects all the per-service configs and construct the mega config for LBS to consume. The config aggregator watches per-service configs and propagates the changes to the mega config in real-time. 

The config aggregator also provides a tombstone feature to prevent accidental deletion of a service's Robinhood config. When a service owner pushes a change to delete a service from the Robinhood config, the config aggregator puts a tombstone mark on the entry of the service instead of removing it immediately. The actual removal happens after several days. This feature also solves a race condition that could result from the different push cadences between the Robinhood config and other routing configs (e.g., Envoy config).

One downside of our config management service is that it's not currently versioned. We periodically backup the mega config in case we need to revert the LBS config back to a known good state.

A quick note on our migration strategy

It can be risky to switch load balancing strategies all at once. This is why we enable service owners to configure multiple load balancing strategies for a service in Robinhood. 

The LBS writes the weighted endpoints list generated by different strategies into different entries in the routing database. At Dropbox, we have a percentage-based feature gate, so we implement a mixed strategy where clients use the weighted sum of the weights generated by two load balancing strategies as the endpoint weight. For example, endpoint A might be weighted at 100 based on PID-based load balancing and 200 based on simple round-robin strategy. If we set the feature gate to 30% for PID-based load balancing, the weight of endpoint A becomes 100*0.3 + 200*0.7 = 170. This way, we can ensure that every client sees the same weight assignment for endpoints while gradually migrating to the new load balancing strategy.

What we learned

In designing and implementing Robinhood, we learned several key lessons about what works and what doesn't. By prioritizing simplicity, minimizing client changes, and planning for migration from the outset, we were able to streamline the LBS's development and deployment, and avoid costly pitfalls.

  • Configuration should be as simple as possible. Robinhood introduces many options for services owner to configure. However, for most cases what they need is a provided default setting. A good, simple default config—or even better, zero config—can save tons of engineering time.
  • Keep client changes simple, too. It can take several months to roll out changes to internal clients; although most deployments are pushed weekly, many are deployed only once a month, or not at all, for years. We learned that the more changes we could shift to the LBS, the better. For example, we decided early on to use weighted round robin for our client design, and we have not changed it since—which has significantly accelerated our progress. Limiting most of our changes to the LBS also reduces reliability risks. This is because we can roll back changes in the LBS within minutes if necessary.
  • Migration should be planned at project design phase. A migration takes a huge amount of engineering time. There are also reliability risks to consider. It’s not fun, but it’s important work. When designing a new service, think about how to smoothly migrate existing use cases onto the new service as early as possible. The more you ask of service owners, the more migration becomes a nightmare—especially for fundamental infrastructure components. The migration process for Robinhood was not well-designed from the very beginning, so we ended up spending much more time than expected reimplementing the process and redesigning the configuration. The amount of engineering time required for a migration should be a key metric for success.

After roughly a year in production, it’s safe to say that the latest iteration of Robinhood effectively addresses our long-standing load balancing challenges. The PID controller algorithm at its core has yielded promising results—showcasing significant performance improvements in our largest services—and we’ve gained valuable insights into the design and operation of load balancing services at Dropbox-scale.

Special thanks to Mikhail Dolinin, Sai Teja Duthuluri, Nikitha Girish, Jared Hance, Itsik Hefez, Pardhu Konakanchi, Andrew Lazarev, Miguel Michel, Ruslan Nigmatullin, Pranesh Pandurangan, Yi-Shu Tai, Jialin Xu, and all past and present members of the runtime team and services platform teams for their contributions.

Footnotes

1 Let N, M, and s denote the number of servers, number of clients, and the subset size of addresses, respectively. The number of clients a server connects with follows a sample of the binomial distribution B(M, s/n). As mentioned, clients perform simple round-robin on the set of addresses given by service discovery. Therefore, if each client sends roughly the same amount of requests, the load distribution on the server side is similar to a binomial distribution. 

2 We extended our existing service discovery system to support the gRPC xDS protocol (A27). As of the date of this blog, gRPC clients do not support weighted round-robin on endpoint weights from the control plane, so we implemented a custom weighted round robin picker based on earliest deadline first scheduling

3 We encountered an interesting case where services sometimes become stuck with degraded I/O. In such situations, the CPU remains low on those nodes, and LBS starts increasing the weight of the node to bump its CPU to the average, leading to a dead spiral. As a solution, we ended up using the maximum of CPU and in-flight requests as the load measurement to balance the service.

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit dropbox.com/jobs to see our open roles, and follow @LifeInsideDropbox on Instagram and Facebook to see what it's like to create a more enlightened way of working. 


// Copy link