Realtime Distributed Deduplication: How storagebox Works

February 26, 2021

StorageBox is a Python module that allows you to put items inside a storagebox (commonly referred to as the "Item Bank") and take items outside of the "Item Bank". As is the case with any physical storage box, the number of items you put in, is exactly the number of items you can take out. Nothing is lost and/or duplicated due to multiple distributed nodes doing simultaneous read/writes.

StorageBox is ideal for projects that need to be developed really quickly without needing to worry about all potential race conditions. From a running cost prespective, It's also ideal for projects that receive a burst of requests at a specific point in time and then things start calming down.

At the current state, StorageBox works with Amazon's DynamoDB but the same algorithm can be used with other databases!

How does it work behind the scenes?

Think of a Digital Box

Let's start simple and imagine your run a movie store! You have 3 voucher codes you'd like to give to the first 3 customers that subscribe to your email newsletter. The customer subscribes, the voucher code shows up on screen!

You create a database table containing the 3 voucher codes.

Once a customer subscribes to your newsletter, the webserver

Pulls a voucher code from the table.
Checks if the customer already had received a voucher code before.
Deletes it from the original table.
Assigns it to that customer to enable Step 2 checks in the future.
Returns it to the customer.

If we draw a time diagram for this flow, this is what it would look like

Things keep going well as long as your customers are honest human beings (or maybe movie-loving robots from the future o.O) who subscribe to your newsletter at clearly well separated time intervals.

However, a number of problems can happen if

More than one customer is subscribing simultaneously.
A customer subscribes simultaneously from multiple computers using the same email address.

An example list of problems that can happen.

Many Customers subscribing at the exact same instance can cause different webserver workers to pull the exact same voucher code and assign it to all customers before the other workers have a chance to delete it.
A customer subscribing 2 times with the same email address can be lucky and have the first request read a voucher code from the database and pass checks. Afterwards, the second request arrives JUST IN TIME after the first request's voucher code has been deleted from the table but not yet assigned to that customer. This would allow that second request to pass checks as well and return another voucher code to that customer.

Here is a time diagram for Problem 1

Also, here is a time diagram for Problem 2

What StorageBox does is that it solves these 2 problems for you.

Ensuring Isolation

To solve the first problem, where multiple customers registering at the same time might have the same voucher code, we need a web worker to be sure that once it has a voucher code that it's working with, there are no other workers considering the same voucher code.

To make sure that's the case, StorageBox assumes that if a worker is allowed to delete a voucher code, then that worker has the right to consider it in future steps. It does this by using DynamoDB's optimistic locking techniques. Therefore in this case, the statement is "Delete this voucher code IF it exists".

We alter our algorithm to be

A newly introduced problem here is the extra roundtrip the worker needs to make to the database to fetch a new code if it was unable to delete the first one. StorageBox mitigates this problem by asking the database to return 30 codes simultaneously. It then starts a for loop where it tries to delete the first one. If it succeeds, then good, it breaks outside the for loop. Otherwise, it tries to delete the second one, and so on!

If it was unable to delete any one of these 30 codes, then it fetches another 30 and tries again!

This 30 limit is a default one and can be configured by altering a configuration parameter called

"STORAGE_BOX_CONCURRENT_FETCH_LIMIT".

In our case, up to 30 workers can run simultaneously without needing to make another roundtrip to fetch new codes.

However, this can also be further optimized! Another problem this solution introduces is that one of those 30 workers would have to make 29 failed deletes before the 30th one finally works. These are still 29 delete roundtrips. StorageBox solves this problem by randomly shuffling the 30 codes it fetched before starting the deletion for loop.

Our updated algorithm now looks something like this.

At this point we can be confident that no 2 workers will be considering the same code simultaneously. However, the same customer can still get 2 different voucher codes. To solve this problem, we need to update our algorithm to reliably implement deduplication.

How Deduplication Works

To reliably implement deduplication, we first need to identify what is useful as a deduplication ID (English Version: If a customer sends 2 requests, which attribute of the customer can we use to know that these 2 requests belong to the same customer, and should therefore get the same voucher code). In our case, the deduplication ID is the customer email that they used while subscribing.

StorageBox takes in this value from you and it then goes to another DynamoDB table. That table is referred to as the Deduplication Table.

StorageBox tries to add a record to the Deduplication Table with the deduplication ID being the Primary/Partition key and the voucher code being referred to as the "item_string". It also employs another DynamoDB conditional expression for that request that sums up to, "Add this record IF no other record with the same deduplication_id exists".

If the addition succeeds, then that implies that this customer has no previous deduplication entry so the code is returned immediately. If the addition fails, then StorageBox would return the voucher code back to the Item Bank. StorageBox would then send another request to get the currently assigned voucher code from the deduplication table.

After applying these final changes, our algorithm looks like this

Final Review

Multiple Customers Subscribing Simultaneously

If multiple customers are subscribing simultaneously, then each one of the workers fetching voucher codes for them would only be able to delete one unique code and move forward with the rest of the steps. Therefore, no 2 customers would get the same voucher code.

Same Customer Subscribing From Many Computers Simultaneously

If one customer is subscribing multiple times from many computers, then each one of the workers fetching voucher codes for a request would acquire one unique code and move forward with the rest of the steps.

Notice that each of these unique codes has been removed from the Item Bank at this point.

Afterwards, only one deduplication entry would be successfully added for one of those unique codes and returned directly to the customer.

All other workers would fail to create their own deduplication entries and would thus return their unique codes back to the Item Bank. Afterwards, they will fetch the existing deduplication entry to get the voucher code assigned by the first worker and return it to the customer.

BONUS: Both Problems Combined

Let's imagine you have 100 subscriptions coming in. You think that they come from 100 unique customers but it's actually coming from 20 customers, each one of them is doing 5 requests, and all 100 requests come in at the same time. Short answer is that StorageBox will handle this for you too and only 20 codes will be assigned. The details of the process will be left as an exercise to you :)

Other Considerations and Q&As

What else can I use StorageBox for?

The example given here assumed a movie store giving out voucher codes but, it can be used for many other use cases, even when the item being deduplicated has no intrinsic value.

For example, you may have a use case where you only want the first 1000 people to click a link to get an offer. However 10,000 clicks on the link come in at the same time, 6000 of them are from unique users. Using StorageBox, you can populate the Item Bank table with numbers from 1 to 1000.

Afterwards, whenever a user clicks the link, you ask StorageBox to assign one of those numbers and use the User ID as a deduplication ID. Since you only put in 1000 items into the Item Bank, only 1000 requests would get one of these numbers and work successfully. Notice that the number itself is meaningless, but being able to fetch one of those meaningless numbers implies that you're one of the 1000 users who get the offer!

Another example can be a professor who has 150 students. They want the students to book one of their 3 classes with each class hosting 50 students.

The professor can create 3 Item Bank tables and populate the first one with numbers from 0 to 49, the second one from 50 to 99, and the third one from 100 to 149.

The professor can then create one deduplication table and use StorageBox to fetch a number from anyone of the 3 tables (depending on the student's request) and add a deduplication entry for it.

The possibilities are endless!

I am building X, Is StorageBox the most optimized solution for me?

No, it depends on what X is. StorageBox tries to be as general case and abstract as possible. It might be the best solution for your app or there might be further optimizations that you can make that only work with your app.

Why DynamoDB, can I use something else?

StorageBox aims to be as lightweight as possible with serverless technology users being the main intended audience (but not the only audience off-course). Most serverless apps use DynamoDB so it made the most sense.

If you'd like to use something else, you can! StorageBox currently has 2 abstract classes, one for the ItemBank and the other for DeduplicationRepo. As long as you implement these 2 classes and make the same guarantees, you're good to go!

How Should I Configure DynamoDB?

StorageBox leaves this up to you! For simplicity's sake, you can use On-Demand throughput. If you can predict sustained load ahead of time, you can use provisioned throughput to decrease your costs.

Just bear in mind that all those ConditionalExpressions StorageBox uses count as Strongly Consistent writes from DynamoDB's throughput.

Can I Use Any Datatype as Items and/or Deduplication IDs?

No, only strings are allowed. If you'd like to use other datatypes, be sure you have a way to serialize them into strings and deserialize them back.

Is There An Optimized Way To Add My Items To The ItemBank?

Yes, the item bank repo defines a method called batch_add_items that you can use to add items to your ItemBank table.

If you're interested in knowing how it works, here it is.

Installation

StorageBox can be installed from PyPI by typing

pip install storagebox

Here is a link to the PyPI project. Shameless plug: If you're looking to learn how to implement your own serverless private PyPI repository, go ahead and read Building A Completely Serverless PyPI Repository On AWS.

Conclusion

If you're looking to get started quickly, StorageBox can remove the burden of how to solve all of the different race conditions. The tradeoff however is a larger DynamoDB bill.

Depending on your case, you might still find it cheaper than the alternative or you might find better, more specialized ways to cut costs for your specific project.

Keep Exploring :)

Comments

mohamed maged26 February 2021 at 21:21
This comment has been removed by the author.
ReplyDelete
Replies
mohamed maged26 February 2021 at 21:24
Awesome thanks peter for writing this It's insightful
ReplyDelete
Replies

Add comment

Search This Blog

Peter's Blog

Realtime Distributed Deduplication: How storagebox Works

Think of a Digital Box

Ensuring Isolation

How Deduplication Works

Final Review

Multiple Customers Subscribing Simultaneously

Same Customer Subscribing From Many Computers Simultaneously

BONUS: Both Problems Combined

Other Considerations and Q&As

What else can I use StorageBox for?

I am building X, Is StorageBox the most optimized solution for me?

Why DynamoDB, can I use something else?

How Should I Configure DynamoDB?

Can I Use Any Datatype as Items and/or Deduplication IDs?

Is There An Optimized Way To Add My Items To The ItemBank?

Installation

Conclusion

Comments

Post a Comment

Featured Posts

Using Squid, Apache and Python To Develop a Captive Portal Solution

Binding Class Attributes In Python (Part 2)

Hello World!!

Building A Completely Serverless PyPI Repository On AWS

Binding Class Attributes In Python

Bypassing VOIP ISP block in Asterisk