Design a Payment System Like a Senior Engineer

Let's design a payment system backend that integrates a payment service provider, something like Stripe, PayPal, Lemon Squeezy, or any other payment service provider. This is one of the few systems where a bug won't just cause users to have bad user experience, but it might cause you to make money disappear from real people's account. That's why the core challenge in this type of system is not the performance because it doesn't matter if the design is not correct overall. So, every design decision should ask for just one question, which is what happens if something goes wrong in the middle of the transaction. So, let's do a complete system design blueprint for building a payment system backend like this and see how a correct implementation looks like. When it comes to integrating payments in your system, there are actually three approaches that you can take. So, let's see what these are and which one we are designing. The first one is a no-code checkout solution. This is where you use something like Stripe, PayPal, or Lemon Squeezy checkout links. In this case, Stripe offers hosted checkout pages and also payment links that you can create, and it requires no backend integration at all. Let's say if you have just one product subscription that you're selling without any other tiers, it's just paid and free, then for the paid you can just redirect them to the Stripe checkout link for the subscription where user signs up and comes back to the platform after this is confirmed. The pros of this is that you can go live in hours or even minutes by just creating the checkout link for the subscription and adding it into your front end to just redirect users to this page whenever they click on upgrade. And Stripe handles everything starting from the UI of that page to selecting the payment method all the way to confirmation of the payment. The downsides, of course, you have no custom checkout experience. They should to the Stripe's page to complete the checkout. And this is the least customizable, which means you may hit limits very fast if you add a second tier, and now users need to upgrade, downgrade between the tiers, or if it's an e-com store, then you cannot just use Stripe payment links because based on what's in the basket, you may have different amount to process. The other approach is to build your own payment processor. This means connecting directly to the card networks like Visa, MasterCard, and acquiring banking licenses so that you can also handle the compliance. The upside is, of course, you will have the lowest fees possible because you own the payment processor, and this makes sense if you have massive scale and you're processing a lot of payments. You can think of Amazon, for example, here. They process a lot of payments every month, every minute, so, of course, even if this reduces the fees by 0.1%, it's still worth it for them because they deal with massive amounts of payments. And the other pro is that you have full control over everything here. The downside is the cost to get there because you will need years of licensing and legal work to have your own payment processor, and this will also require massive upfront cost and complexity to build your own payment processor, and then integrate it into your payment system. That's why there is another option that exists in the middle, and that's actually the most popular among companies because not everyone works or operates at Amazon's scale. They just want to have some payment processor inside the system to process the payments. And that option is to integrate a payment service provider, something like Stripe, PayPal, Lemon Squeezy, or any other provider that you choose based on the region, the fees, and other factors. In this case, the payment service provider handles the entire payment processing chain on your behalf, including card network integrations, banking relationships, fraud detection, compliance, and also global payment method support. The pros of that is that you don't have to go through licensing or banking setup. You just integrate it and connect to the provider. You can also easily process the payments globally by just integrating this into your system. The downsides here are mainly referring to the transaction fees. You have to pay at least 2.9% for every transaction plus 40 cents per transaction. That's in case of Stripe. If it's global, it's a bit higher. And if it's something else like code card or other providers, then they charge you even more than that. And the backend still needs careful design, which means it's nowhere similar to the first option where you just create the payment link and redirect them to the payment. Majority of companies from startups to mid-level to even enterprises, they use this option in the middle where they integrate a PSP. And that's why we will cover this situation where we are integrating something like Stripe and creating the payment system backend for this system. So, now that that part is clear, let's move on to the functional and non-functional requirements. And let's start from the functional ones. First of all, customers must be able to make one-time payments and save the payment methods for reuse. All the merchants or the platform must be able to accept payments with also split payment option. This third point might be optional depending on the type of platform. If it's, let's say, an e-commerce platform where you're selling products, then there is no subscription there. They just purchase whenever they come back to the platform. And if it's a software, then it's very likely that you will need to have subscriptions. So, you need to be able to also support subscriptions and billing including the trials and smart retries. The other critical part to this is refunds if they are part of the product. And here you need to provide both full and partial refunds, and it must flow through a dedicated refund state machine. One of the important part here is that we need to handle webhooks as the source of truth for the payment status. And the entire payment life cycle must be modeled as a finite state machine with enforced transitions. Apart from this, you also have very important non-functional requirements, which are first of all correctness of the system. This means that you need to provide exactly once processing. You cannot have a double charge on the same purchase. That will basically mean you're deducting the same amount twice from the user's account. For availability, depending on the system, you might have some number like five nines uptime. But in this type of system, the more important part is consistency. So, we should always prioritize consistency over availability because payment systems are one of the few domains where the CAP theorem should should favor consistency over availability. This means that, let's say, user checks their balance or checks the status of the transaction. A read operation that shows incorrect amount there is way worse than just having a brief unavailability instead of that. And the last important piece is reliability, which refers to having item potency at every layer of the system starting from client to back end, back end to Stripe, and also for webhook processing. For the scale estimates where this makes sense, roughly, it can be from half million to 1 million webhooks processed daily. We can expect to handle up to 100 reads per second. This is the Stripe's default API rate limit, and after that you will hit a rate limit on Stripe, and you'll need to try the next transaction after that. For end-to-end API latency, we need to provide somewhere up to 5 seconds, and for internal service calls, we should strive for less than 100 milliseconds. If it's anywhere beyond these numbers, then usually it makes sense to go with your own payment processor. That's when you hit the scale of these large companies, but for anyone from startups to even large enterprises, these numbers are when it makes sense to go with Stripe or some other PSP. Let's first of all understand how money actually moves through the card network. Usually there are six entities that participate in every card transaction. These are the cardholder, the merchant, which is the business or individual selling the product or the service. Then there is the PSP, which refers to payment service provider. This is either Stripe or any other provider you choose to integrate. You also have the acquiring bank that's involved, which is the merchant's bank, and then also the card network, whether it's Visa or MasterCard, and the issuing or cardholder's bank. And whenever user makes a payment, it flows through three phases, and each has different timing and failure modes. So, let's see what these are. In the first phase, this is where the cardholder submits the payment with Stripe elements. These are technically iframes hosted by Stripe. Even if they look like part of your website, they are not because we use a library for this, which is called Stripe.js, and then you implement Stripe elements into your front end by providing your client secrets. And this is a JavaScript library, so you just include the script of Stripe elements. If it's React, you have to install the package and so on. So, coming back to the first phase, whenever merchant receives the payment details, we create a payment intent with the payment service provider. This payment intent is a stateful object that Stripe uses to track a single payment from start to finish. So, you can think of this as the source of truth for the entire transaction. Once this payment intent is created, Stripe authorizes the request with card network, and then card network validates the card and also the funds with the issuing bank. So, in case this is approved, let's say the card details are correct and also the funds exist in the card, then we get the approval and the issuing bank holds the funds here for processing. After this, the card network sends the authorization code and the payment intent is confirmed from Stripe side. And once this is confirmed, we can also show that the payment is accepted to the client, to the card holder. This can also be declined. In this case, either the card details are incorrect or there are not enough funds in the bank, so we get the decline code and we get that the payment intent failed, and that's what we show to the customer. So, assuming the payment is confirmed, this is where we move to the phase two, which is capturing. This is where the merchant tells the acquiring bank to finalize the authorized amount. This can be either immediate or delayed. In case it's digital goods or subscriptions, then the capture can happen immediately. Stripe's default is to auto capture, and if you want to change this, then you need to change it in the settings in the code. Let's say it's physical products like you're ordering from Amazon or it's hotels like you're booking some hotel in booking.com. In this case, the capture is delayed, right? Until the fulfillment time. So, in that scenario, we will just capture on fulfillment. In both cases, at some point, the capture is confirmed. The important part here is that our system should have two states for this. One should be authorized and the other one should be captured. So, if we don't capture it immediately, then we first set it to authorized, and then we set it to captured whenever the payment is processed. And then there is the third phase, which is clearing and the settlement. At each business day, captured transactions are batched and sent to the card network for clearing. This is where the network will also calculate the fees, and that's when the transfer of the funds will happen from the bank. And once that's confirmed, the card network also transfers minus the assessment fees that they take. And Stripe also takes their fee and send you a confirmation that the payment is captured at this point. Now that we understand the payment life cycle, let's see how the high-level architecture would look like for integrating such PSP into a payment back end. We will have two blocks, let's say, of this design. One of them will be the synchronous path, and the other one the asynchronous, because if you process the payment, and especially if you capture it later, then you don't need to delay this the UI for the user. We have the front end up. This is where we also integrate that Stripe Elements library, either the JavaScript library or you install it into a framework like Angular or in a library like React or whatever you're using for the front end. The very first step is of course the payment needs to be requested from the front end to the back end. This is where we have a payment API service to process this. This will receive the payment requests from the checkout and also validate the input. First important point is that we need to check the item policy keys here from client to back end. We check this against the database and create or retry the payment record and after that we return an immediate response to the client. This is where we will have the Stripe integration layer and we will create the payment intent for this transaction and we will return the client secret to the front end which uses Stripe.js to complete the payment including any 3D Secure challenges. So this layer we just create the payment record, dispatch the work and return to be available for processing any other payments. On the next step this is where the heavy lifting is done by the payment service provider. So if you have let's say Stripe as an integration then we make an API call to Stripe API. This layer abstracts all Stripe specific API calls behind an interface and it handles Stripe's error types, maps them to internal error codes, attaches item policy keys to every post request and manages the timeouts. So Stripe at this point will handle all that complexity of contacting the card network and also contacting the bank but that doesn't mean that our job is done here because once the payment is confirmed, Stripe will send us a webhook event. This is where we need to also have our async path handling where first of all, we need to have the webhook receivers. This is where we listen for the webhook coming from the Stripe. We need to do three important things here. First of all, we need to verify the Stripe's signature using the Stripe signature header. Then we need to store the raw event, whatever we received from Stripe, and the other important thing here is that we need to respond within seconds since Stripe has a timeout here, and all business logic should happen asynchronously after this. If we made this synchronous as well and tried to process everything before retrieving the response, then this might timeout and Stripe will no longer accept that 200 response from our webhook receiver. So, that's why we confirm it immediately and then process it later by enqueueing it into a message queue. This is the next step. This is how we will decouple the webhook receipt from processing, and we need to use a queue that guarantees at least once delivery, something like Kafka that supports the transactional outbox pattern. So, when a payment state changes committed to the database, an outbox record is written in the same transaction. Then, we relay to the queue by separating the process. So, on the other hand, we need to dequeue and process the transaction. This is where we have the background job workers. These workers handle four categories of async work. First of all, the webhook event processing by dequeuing events from this message queue and applying the state transitions. We should also have retry workers, which are solely for retrying the failed calls with exponential backoff. The third one will be reconciliation workers. These will be for daily comparison of our internal records against Stripe's. And also another worker for stuck payment detecting. Mainly for alerting on payments which are in the intermediate states. For that we also need to maintain our own database and our own ledger. So, the payment database, this is where the system will have the records. It will store the current state of every payment and importantly this is where we will store the item potency keys. Now we'll get more to this when we get to discuss the database. And apart from that we also need to have the ledger which provides double entry bookkeeping. Meaning every money movement is recorded as a balanced pair of debit and credit entries here. And it should be append only. Meaning if a mistake needs a correction, we're only allowed to insert a new entry but never update or delete existing entry in the ledger. Once we have the high-level design, next we need to also design the database. Because even if we integrated PSP that handles the payments for us, that still doesn't mean that we cannot track payments. We still need to track the payments, the refunds and also item potency keys. This is what will help us to ensure that we are processing all the payments only once and that if something goes wrong in the middle of the transaction, item potency keys are what will help us to prevent issues from happening. So, let's start from the payments. This is where we will track the payments that we are handling. We will have of course ID for each payment. And this will also need to link to the item potency key. This is what links to the item potency keys table. And other things that we will store is of course the amount for each payment, which is an integer. In case we support multiple currencies, then we will also store the currency. So, if it's not USD, then we specify whether it's in euros or something else. Also, the status of each payment. And if you remember on the high-level diagram, every time we start a payment, we create a payment intent in Stripe. So, this is where we will store the Stripe payment intent ID. And also ID of the customer who made the payment, merchant ID, payment method type, which is either Apple Pay, Google Pay, or something else, or maybe just direct card payment. Any extra metadata that we might need, or maybe we need to decide to add in the future, that won't necessarily change the structure of this entity of payments, we will store it in a JSON format. And we'll also keep track of the timestamps for created and updated at and the version. When it comes to indexing for this table, we will mainly use these three properties: the payment intent ID, the customer ID, and also the merchant's ID to be able to query these efficiently. And apart from that, also partially on the status, excluding the terminal states. Now, one of the important fields here is the idempotency key. This has a unique constraint that serves as the first line of defense against any duplicate payments that might try to occur. For this, we will maintain a separate table, which will be named as idempotency keys. This is where we store all the idempotency keys and link them back to the payments table. And apart from that, we might also need to have customer ID, request path, request parameters, and some more data. The important parts here are the recovery point, first of all, because it tracks progress through multi-step operations like it started, then we had payment created, then Stripe called, then we finished. So, at every point we will update the recovery point. And if for any reason the process crashes in between the steps, a retry with the same key will resume from the last committed recovery point rather than restarting the whole payment. It will also have timestamps here, and one of the important ones is locked at field, which prevents concurrent processing of the same key. Another entity that we will need here is the payment events entity. This will basically be the immutable audit log, so nothing is being updated here. We can only insert new data into this table. This will connect to the payments table through the payment ID. And the important things that we'll store here is that every state transition will generate a new event in this table, which means that we need to have the event type. The source field here distinguishes whether the transition was triggered by an API call, a webhook, or the reconciliation process. If you remember the worker we had on the high-level design, that's what will update these payment events. We will keep track of the previous status of this payment and also the new status. Since we cannot update any rows here directly, we will insert what was the previous status before updating this and what's the new status after inserting this. And since we need to handle also the refunds, we need to have a refunds entity that will again link by the payment ID to the payment table, so that we know which payment is refunded in this row. For the refund to be processed, you will again contact the payment service provider. And in our case, if it's Stripe, then it will generate the refund ID as well for us to keep track of this refund. Another thing is that refunds will have their own item potency keys and also their own status life cycle. This is separate from the payments. So, if the payment is completed, here we are only tracking the status of the refund, not the overall payment. And we also need to have ledger entries here that again links back to the payment table by payment ID. These entries follow double entry bookkeeping, meaning every money movement creates debit card pair grouped by the transaction ID here with a constraint that debits and credits must balance. Now, let's see how the item potency works because this is one of the core focuses or one of the important parts of this type of system. Let's see how this works. The implementation of item potency key should operate at all three layers here. At the API layers, the client generates this UUID item potency key and sends it with the payment request. Then inside of the server, we have begin transaction, attempt to insert this item potency key into the item potency keys table that we saw in the above diagram. This is where we will check if the key exists because maybe we already have this item potency key created. If yes, then we will also check the recovery point here. If it is finished, then we return the cached response. And if the key exists, but the recovery point is incomplete, then we just resume from that point. And in case the key is new, meaning this is a new payment intent that we created, we proceed with processing the payment flow as you saw on the high-level design. And on the completion, we update recovery point to finished and store the response. And after that we commit the transaction. For the recovery points, we will have either started or payment created or Stripe called or finished. Finished is the last state, so if crash happens at any point before this step, then the retry resumes from that last recovery point. Let's say we crashed in the middle of payment created, then we will resume from this flow to here and continue with Stripe called and then to finished. This idempotency key is also used at the Stripe layer, so the same key is forwarded to Stripe when we create the payment intent through the idempotency key HTTP header. Then what Stripe does with this is they store this within 24 hours or up to 40 days, and they use this to then send you the webhook at the end. So, also at the last third layer, which is the webhook layer, Stripe may deliver same event multiple times, and for that part we can have some table called, let's say, processed webhooks events, and we can store each event's ID in this table. We do this so that before processing, the handler can check if the event ID exists. And if it does, then this event is skipped. And one of the hardest edge cases that we might encounter here is if crash happens in between creating the idempotency record and completing the Stripe call. We can solve this by using atomic phases pattern, so that we break down the operation into phases separated by foreign state mutations. With this approach, each of the phases can commit its progress to the idempotency key's recovery point, and that's why we have the recovery point in the idempotency key's table here. Because with each phase, we can commit its progress to the item potency keys recovery point. And if the process crashes for whatever reason in the middle, let's say this crashes at this step of Stripe called, then before recording Stripe's response, the retry starts from that point. And since Stripe itself supports item potency keys, then retrying the Stripe call now is safe. And it will return the original result if the call is complete it or it will process if it didn't. One of the requirements was also to ensure exactly once processing, so let's see one of the options on how we can achieve this, which is by optimistic locking and compare and swap on status. So, first part, which is the optimistic locking, we will every time use the version column and we will also check where the version is equal to whatever version we expect to see here. And if another process updates this first, then this means that zero rows will be affected as part of this update and the caller here, whoever is trying to update this by query, will know that it lost the race. And other than that, we will also use compare and swap on status, which will combine concurrency control with state machine validation in single operation. Here we can atomically verify three things. First of all, if the payment exists, next if it's the correct source state for this transition and that no concurrent update has occurred. So, if rows affected is zero, then the transition is rejected. And in the payment life cycle, we can have the valid transitions from each state to other states and this is just an example assuming we have these states in the payment life cycle. So, it can be created. This is when the payment is created for the first time, then it will be into pending mode, then authorized, then captured, then settled. At this point, the payment is processed, but in case we need to also process a refund, then we'll have refund pending. We will have partially refunded, then other states as well, like disputed. And if payment fails at any step, then we will have failed status, canceled, and refunded. And for each of them, we need to map the valid transitions. For instance, we can't jump directly from created state into captured state. This shouldn't be allowed. So, from created, we can only go to pending or failed or canceled. And this is very important to have this enforced at all the layers, so that our application code has valid transitions dictionary, which maps this, and this is checked before any transition attempt. And let's also cover the availability non-functional requirements. Let's say we promised to provide five nines of availability, assuming there is roughly 5 minutes of allowable downtime per year, if you have five nines of availability. What we need to do on our end to ensure this is first of all to have availability zones, more than one at least, for each we will have the payment API and the Stripe integration layer in each of these zones. And we'll have a load balancer in front of these availability zones to map these between the payment APIs. If we chose to host this in AWS, let's say either on ECS or EC2, Amazon from its part provides us with four nines of availability, meaning if you want to achieve this number, then we need to have multiple availability zones to increase the availability of overall system. And since the payment API service is stateless, this means that any server can handle any request. So, we can run multiple copies of this behind another even regional load balancer inside of these availability zones. And if one server crashes, the others will still keep on going. Since Stripe is an external service that we don't control, we will have circuit breaker in between. Let's say this request takes more than 40 seconds per request instead of the normal one to two seconds that Stripe provides. In this case, the API threads that come to the Stripe API will start piling up. And each of the new requests created from the client will wait in the queue. That's why we have this in between to detect that Stripe calls are failing or slow, and after hitting the fresh hold, it will trip and stop making requests to Stripe at all, so that instead of immediately returning an error, this will keep your threads free. On the other side where we have the async processing, first of all, we have the webhook receiver. This can be even a lambda function, and it runs on completely separate service from payment API. And this will add all the webhook events into the message queue, so that even if something crashes on the background workers, events will just keep piling up in the message queue and will get processed once the workers come back, so that nothing is disappearing from this. And for the database, we can have leader follower replication, so that we run one primary database plus replicas that stay in the sync. So, if one of the replica databases goes down, we will read from the other ones. And even if the primary database goes down, a replica can take over automatically, and we will use synchronous replication, so that no data is lost during the switch.

Design a Payment System Like a Senior Engineer

Full Transcript

Need a transcript for another video?