This 30 limit is a default one and can be configured by altering a configuration parameter called
"STORAGE_BOX_CONCURRENT_FETCH_LIMIT".
In our case, up to 30 workers can run simultaneously without needing to make another roundtrip to fetch new codes.
However, this can also be further optimized! Another problem this solution introduces is that one of those 30 workers would have to make 29 failed deletes before the 30th one finally works. These are still 29 delete roundtrips. StorageBox solves this problem by randomly shuffling the 30 codes it fetched before starting the deletion for loop.
Our updated algorithm now looks something like this.
At this point we can be confident that no 2 workers will be considering the same code simultaneously. However, the same customer can still get 2 different voucher codes. To solve this problem, we need to update our algorithm to reliably implement deduplication.
How Deduplication Works
To reliably implement deduplication, we first need to identify what is useful as a deduplication ID (English Version: If a customer sends 2 requests, which attribute of the customer can we use to know that these 2 requests belong to the same customer, and should therefore get the same voucher code). In our case, the deduplication ID is the customer email that they used while subscribing.
StorageBox takes in this value from you and it then goes to another DynamoDB table. That table is referred to as the Deduplication Table.
StorageBox tries to add a record to the Deduplication Table with the deduplication ID being the Primary/Partition key and the voucher code being referred to as the "item_string". It also employs another DynamoDB conditional expression for that request that sums up to, "Add this record IF no other record with the same deduplication_id exists".
If the addition succeeds, then that implies that this customer has no previous deduplication entry so the code is returned immediately. If the addition fails, then StorageBox would return the voucher code back to the Item Bank. StorageBox would then send another request to get the currently assigned voucher code from the deduplication table.
After applying these final changes, our algorithm looks like this
Final Review
Multiple Customers Subscribing Simultaneously
If multiple customers are subscribing simultaneously, then each one of the workers fetching voucher codes for them would only be able to delete one unique code and move forward with the rest of the steps. Therefore, no 2 customers would get the same voucher code.
Same Customer Subscribing From Many Computers Simultaneously
If one customer is subscribing multiple times from many computers, then each one of the workers fetching voucher codes for a request would acquire one unique code and move forward with the rest of the steps.
Notice that each of these unique codes has been removed from the Item Bank at this point.
Afterwards, only one deduplication entry would be successfully added for one of those unique codes and returned directly to the customer.
All other workers would fail to create their own deduplication entries and would thus return their unique codes back to the Item Bank. Afterwards, they will fetch the existing deduplication entry to get the voucher code assigned by the first worker and return it to the customer.
BONUS: Both Problems Combined
Let's imagine you have 100 subscriptions coming in. You think that they come from 100 unique customers but it's actually coming from 20 customers, each one of them is doing 5 requests, and all 100 requests come in at the same time. Short answer is that StorageBox will handle this for you too and only 20 codes will be assigned. The details of the process will be left as an exercise to you :)
Other Considerations and Q&As
What else can I use StorageBox for?
The example given here assumed a movie store giving out voucher codes but, it can be used for many other use cases, even when the item being deduplicated has no intrinsic value.
For example, you may have a use case where you only want the first 1000 people to click a link to get an offer. However 10,000 clicks on the link come in at the same time, 6000 of them are from unique users. Using StorageBox, you can populate the Item Bank table with numbers from 1 to 1000.
Afterwards, whenever a user clicks the link, you ask StorageBox to assign one of those numbers and use the User ID as a deduplication ID. Since you only put in 1000 items into the Item Bank, only 1000 requests would get one of these numbers and work successfully. Notice that the number itself is meaningless, but being able to fetch one of those meaningless numbers implies that you're one of the 1000 users who get the offer!
Another example can be a professor who has 150 students. They want the students to book one of their 3 classes with each class hosting 50 students.
The professor can create 3 Item Bank tables and populate the first one with numbers from 0 to 49, the second one from 50 to 99, and the third one from 100 to 149.
The professor can then create one deduplication table and use StorageBox to fetch a number from anyone of the 3 tables (depending on the student's request) and add a deduplication entry for it.
The possibilities are endless!
I am building X, Is StorageBox the most optimized solution for me?
No, it depends on what X is. StorageBox tries to be as general case and abstract as possible. It might be the best solution for your app or there might be further optimizations that you can make that only work with your app.
Why DynamoDB, can I use something else?
StorageBox aims to be as lightweight as possible with serverless technology users being the main intended audience (but not the only audience off-course). Most serverless apps use DynamoDB so it made the most sense.
If you'd like to use something else, you can! StorageBox currently has 2 abstract classes, one for the ItemBank and the other for DeduplicationRepo. As long as you implement these 2 classes and make the same guarantees, you're good to go!
How Should I Configure DynamoDB?
StorageBox leaves this up to you! For simplicity's sake, you can use On-Demand throughput. If you can predict sustained load ahead of time, you can use provisioned throughput to decrease your costs.
Just bear in mind that all those ConditionalExpressions StorageBox uses count as Strongly Consistent writes from DynamoDB's throughput.
Can I Use Any Datatype as Items and/or Deduplication IDs?
No, only strings are allowed. If you'd like to use other datatypes, be sure you have a way to serialize them into strings and deserialize them back.
Is There An Optimized Way To Add My Items To The ItemBank?
Yes, the item bank repo defines a method called batch_add_items that you can use to add items to your ItemBank table.
If you're interested in knowing how it works, here it is.
Installation
StorageBox can be installed from PyPI by typing
Conclusion
If you're looking to get started quickly, StorageBox can remove the burden of how to solve all of the different race conditions. The tradeoff however is a larger DynamoDB bill.
Depending on your case, you might still find it cheaper than the alternative or you might find better, more specialized ways to cut costs for your specific project.
Keep Exploring :)
This comment has been removed by the author.
ReplyDeleteAwesome thanks peter for writing this It's insightful
ReplyDelete