Realtime Distributed Deduplication: How storagebox Works
StorageBox is a Python module that allows you to put items inside a storagebox (commonly referred to as the "Item Bank") and take items outside of the "Item Bank". As is the case with any physical storage box, the number of items you put in, is exactly the number of items you can take out. Nothing is lost and/or duplicated due to multiple distributed nodes doing simultaneous read/writes.
StorageBox is ideal for projects that need to be developed really quickly without needing to worry about all potential race conditions. From a running cost prespective, It's also ideal for projects that receive a burst of requests at a specific point in time and then things start calming down.
At the current state, StorageBox works with Amazon's DynamoDB but the same algorithm can be used with other databases!
How does it work behind the scenes?
Think of a Digital Box
Let's start simple and imagine your run a movie store! You have 3 voucher codes you'd like to give to the first 3 customers that subscribe to your email newsletter. The customer subscribes, the voucher code shows up on screen!
You create a database table containing the 3 voucher codes.
Once a customer subscribes to your newsletter, the webserver
- Pulls a voucher code from the table.
- Checks if the customer already had received a voucher code before.
- Deletes it from the original table.
- Assigns it to that customer to enable Step 2 checks in the future.
- Returns it to the customer.
Things keep going well as long as your customers are honest human beings (or maybe movie-loving robots from the future o.O) who subscribe to your newsletter at clearly well separated time intervals.
However, a number of problems can happen if
- More than one customer is subscribing simultaneously.
- A customer subscribes simultaneously from multiple computers using the same email address.
- Many Customers subscribing at the exact same instance can cause different webserver workers to pull the exact same voucher code and assign it to all customers before the other workers have a chance to delete it.
- A customer subscribing 2 times with the same email address can be lucky and have the first request read a voucher code from the database and pass checks. Afterwards, the second request arrives JUST IN TIME after the first request's voucher code has been deleted from the table but not yet assigned to that customer. This would allow that second request to pass checks as well and return another voucher code to that customer.
How Deduplication Works
Multiple Customers Subscribing Simultaneously
Same Customer Subscribing From Many Computers Simultaneously
BONUS: Both Problems Combined
Other Considerations and Q&As
What else can I use StorageBox for?
I am building X, Is StorageBox the most optimized solution for me?
Why DynamoDB, can I use something else?
How Should I Configure DynamoDB?
Can I Use Any Datatype as Items and/or Deduplication IDs?
Is There An Optimized Way To Add My Items To The ItemBank?
pip install storagebox