Handling system failures during payment processing requires real-time identification of the issues in addition to offline detection, with the goal of eventual consistency. No matter what goes wrong, our top priority is to make sure that customers receive service for which they’ve been charged, and aren’t charged for service they haven’t received. Accurate payment processing is a crucial element in being worthy of trust, a core Dropbox company value.
In a standard system of this kind, failures might result in page load errors or a failed database transaction. System failures during a charge request can result in uncertainty about where the money for that request ended up: is it in our company’s account or still in the customer’s account? These system failures are extremely rare, but when processing as many transactions a day as Dropbox does, even a small probability can lead to multiple occurrences a day. Designing payments infrastructure that can resolve issues such as these is vital to keeping our customers’ trust and providing our finance team with accurate information.
In order to understand how system failures can disrupt payment processing, it’s important to understand each step involved in handling a customer’s purchase. When a customer visits the Dropbox website and elects to buy one of our products, we ask the customer to enter their payment information on the purchase form. After the customer submits the form, the system collects their payment information and securely sends it, as well as the amount we want to charge, to one of our external partners responsible for processing that type of payment information. For the purpose of this discussion, we’ll assume that the payment information in question is credit card information—not PayPal or other payment methods that Dropbox accepts. When our credit card partner receives the credit card information, they verify that the card is valid, store it for future charges (e.g. monthly recurring billing), and then attempt to charge the specified amount to the card. If the verification or charge fails, the credit card processor sends us a response containing a descriptive error code. In case of failure, we’ll refresh the purchase form, tell the customer that the charge attempt failed and ask the customer to try again. Otherwise, if the charge is successful, the credit card processor will respond with a success message as well as a token that we can use to reference the saved credit card for future charges. Upon receiving this success response, we store the payment result in our records. Finally, we will turn on the customer’s service—commonly called provisioning.
As illustrated by the diagram above, we require communication with our external payment processor in order to complete the charge. This external communication involves side effects—changes in state as a result of the communication request. In particular, we care about whether money is moved from the customer’s account to the merchant’s account (Dropbox, in this case). In the presence of system failures, it can be unclear whether this occurred or not after making a request to an external system.
There are three main failure points of this charging system described above:
- The network connectivity between Dropbox and the processor is disrupted causing communication timeouts or lost information. The result is either the external partner did not get our charge request or we did not get their charge response.
- The external partner has an internal error or machine failure causing us to not receive a charge response from the processor.
- We have an internal error or machine failure. Depending on the timing of this failure, two things could happen—either we are unable to send the charge request or unable to receive the charge response.
All of these failure scenarios result in one of two distinct situations:
- The charge request is never processed by the external partner so no money is transferred. This is caused by either a) the network failing while the charge request is in flight to the external partner, b) the external partner has a system failure before they have a chance to process the charge request or c) we have a system failure before we send the charge request.
- The money is transferred but not recorded in our system. This is caused by either a) a network failure from the external partner back to us so we never receive the charge response or b) a system failure on our side prevents us from processing the charge response.
In both cases, the core scenario is the same: a charge request was made but never marked completed in our system and is thus in an unknown state. After detecting occurrences of this scenario, the solution is to discover whether the charge request actually went through or not, then address this charge appropriately.
The solution: No charge unaccounted for
Detecting lost charges
In order to detect incomplete charge requests, we record each charge request in our database before sending the information to our external partner. In this charge record, we store a customer identifier, the charge amount, as well as which payment processor the request will be sent to. The charge record also has a status attribute that tracks which part of the process the charge is in. Before we perform the charge, the charge record’s status is set to created. Next, we send the charge request to our external payment partner. When we receive the charge response from our partner, we update the charge record with a new status based on the response, normally either declined or successful.
This status attribute of the charge request allows us to determine if a charge request was left in an unknown state. If the charge request has either the declined or successful status, then the charge response was correctly received and processed by our system. If the charge request has the created status, it’s necessary to look at the charge request’s creation time to figure out whether the request is in an unknown state or not. It’s possible that the charge request was only recently sent (milliseconds ago) and we could still get a charge response for it in the future. If the creation time is more than a couple minutes in the past (exact value depends on the timeout configurations) then we know that the charge request would’ve timed out by now so this request must be in an unknown state. To summarize, charge requests are in an unknown state if they have the created status and are more than a couple minutes old.
A common way to solve a lost request is to simply reissue the request. However, this is not safe when the request has effects that should only happen one time. Each charge request could result in money being transferred out of a customer’s account. We never want to charge the customer multiple times for the same item so reissuing the charge request is dangerous. Even if we refund the extra charges later, the customer still sees the funds momentarily taken out of their account and this breaks the trust we want to establish with the customer. Since we don’t have an infallible detection system for the previous charge request’s state, it’s safer to abort the purchase attempt. Therefore, the system doesn’t grant a customer their Dropbox service until we have confirmation of a successful charge. The important result of this design decision is that if we discover a charge was successful but we have no record of it due to system failures, then the payment needs to be refunded since we would not have turned on the customer’s service in this case.
Determining the Charge Status
Now that there is a way to identify the transactions in an unknown state and clear steps on how to handle them if the customer was charged, the next step in this solution is to discover whether the charge went through or not. The charge status can usually be discovered by communicating with the external payment processor. Most payment processors provide a convenient API to look up a charge’s status by either a merchant identifier or a transaction identifier. The merchant identifier, otherwise known as a merchant order number, is a unique identifier supplied by the merchant (Dropbox in this case) to reference this charge request. The transaction id, that we internally refer to as the external transaction id, is determined by the external partner at the time of the charge and referenced in the charge response. Thus, we will only know the external transaction id for a charge request if we received and processed the charge response. In the case of system failures, as discussed in the problem description, we do not receive a charge response so we do not have the external transaction id. That leaves the merchant order number as our only available option to perform an API lookup with. Since Dropbox formulates and sends the merchant identifier to the external payment processor, we have access to it at the time we’re making the charge request and store the value on the charge request record.
Using this merchant identifier, we do a lookup for matching transactions using the payment processor’s API. If a matching transaction is found, we update the charge request record with either a declined or successful status as appropriate based on the transaction’s status. On the other hand, if a matching transaction could not be found, we need another way to resolve the transaction status. This case is possible if the external payment processor has a system failure after they perform the charge but before they are able to record the charge in their own system. In addition, this case is also caused by an internal error on our side if the merchant identifier is not correctly recorded for the transaction so we are unable to use the merchant identifier with the processor’s lookup API. In the case when the lookup API cannot be used, the transaction’s status can still be found in the processor’s settlement files. Every payment processor offers settlement files available for download for each merchant that they service. These settlement files contain a list of every successful transaction that was processed on behalf of that merchant in addition to other information, normally split into 24 hour time periods. Each settlement record includes the external transaction id and merchant identifier fields mentioned earlier, so if lookup through the processor’s API fails, a search through the settlement file for a matching record may be successful. If a match is found, then the charge record’s status is changed to successful. If a match is not found, then the charge record’s status is changed to error to acknowledge that something went wrong during the charge request and we are unable to determine what occurred.
Additionally, the settlement file allows us to discover any successful charges which we have no record of due to internal bugs in our system or rare database failures. For this reason, we set up a background process which parses these settlement files and verifies that we have a charge record in our system for each settlement record. For any settlement record without a charge record, a charge record is created with information from the settlement file. With this process, we assert that all successful charges will have a corresponding record in our system. Note, this achieves the goal of eventual consistency since the settlement files arrive up to several days after the charge was performed and we don’t make the charge record until we have the settlement file.
Reversing a Successful Charge
Occasionally, a charge is successfully applied for a user but we aren’t notified about it right away and thus don’t provision service for the user. In such cases, we need to return the customer’s money as soon as we are notified by the payment processor of the charge. There are two ways to reverse a payment: voiding or refunding.
The decision of which method to use is influenced by many things. First, the cost of using an external payment processor involves fees that are assessed on each transaction that we perform through their platform. Voiding a charge, which is basically cancelling it, normally does not cost a fee. Refunding a charge, however, involves performing another payment in the opposite direction for which we need to pay a fee. Therefore, voiding a charge is cheaper than refunding the charge. Second, a voided charge will not show up on a customer’s end of month bank statement at all. Conversely, refunding the charge results in both the original charge and the refund payment being present on the bank statement. This could come as quite the surprise for the customer. From the customer’s perspective, the purchase form submission either returned an error or crashed with a 500 error (if an internal system failure occurred) and yet the customer sees evidence that we charged them. Even though we returned the money, this is still a negative experience for the customer. Third, if the charge request is successful and then we refund this charge later, there is a clear period of time between the charge and the refund during which the customer has less money in their account than they should have. Generally, this is a small amount of money but for some customers this could have a serious impact on their ability to complete other transactions while they are waiting for the refund. For these reasons, voiding is superior to refunding.
Unfortunately, the ability to void a transaction depends on the how long it’s been since the charge was completed. To understand why the timing matters, it is necessary to know the steps of fulfilling a charge request. When the payment processor receives the charge request, they record the request in their system, verify the payment information and then ask the credit card company to perform the charge. The credit card company responds that they accept the charge and apply it to the card. At this point the payment processor responds to us, the merchant, to say that the charge was successful. However, at this point, the payment is only “submitted for settlement” and the charge may not have settled on the card yet. Settlement means that the funds have been transferred and the charge can no longer be cancelled. For this reason, voids can only occur during the time window when the payment is in the “submitted for settlement” state, but not yet settled. This time window generally lasts less than 24 hours. If this time window has passed, then a refund must be performed instead.
Regardless of which method is used to reverse the transaction, once the reversal is complete, then our records and the customer’s account are now in the correct state. This combination of immediate mitigation and eventual consistency protects us from losing track of payments due to system failures which allows us to confidently assert that we are aware of all payments flowing through our system. This is just one of the ways that the monetization platform team makes sure that Dropbox is being worthy of trust.