Building A Completely Serverless PyPI Repository On AWS
So maybe you're a team focused on Python and you've been looking into the best ways to deliver your packages to other members of your team or even to production environments. You start having thoughtful debate about the
But then reality hits you; In order to use pip, you need a PyPI repository to host your code and you don't want all your team's hard work to be out on the public PyPI repository for obvious reasons. So you start looking around learning how to build a private PyPI repository and you stumble upon PEP 503 -- Simple Repository API. Huzzah! You finally figured out how this thing works but then you go like Oh man, so I have to get a machine, install web server software on it, go through all the trouble of setting it up and using best practices, and I have to pay for that machine even if I'm not using it. There must be a better way to do this!
SPOILER ALERT: I think the picture at the top of this blogpost gave it away, but there is! However, go read PEP 503 first if you haven't done so yet!
Before we move on, this article discusses concepts and is not a step-by-step guide, you can search google if you don't already know how to do a certain step or just leave a comment down there.
First things first, let's get your files in there.
Creating an S3 Bucket
As you might well know, S3 or Simple Storage Service is a storage solution offered by AWS. S3 lets you store your files in folders also known as buckets. We need to create an S3 bucket.
This S3 bucket will be used to store all your package wheels, archive files, html files and so on. Here is how we're going to structure things.
.
├── index.html
├── package1
│ ├── index.html
│ └── package_1_versioned_wheel.whl
└── package2
├── index.html
└── package_2_versioned_wheel.whl
Note: All the index.html files must be written in accordance with PEP 503.
Your wheel and archive files can be generated by your favorite packaging tool, maybe setuptools or poetry?
You will also need to figure out a way to generate your HTML files and get them to S3. There are some cool offerings already on the internet like this, s3pypi which I personally used as a starting point. (Note: S3PyPi is not compatible out of the box with everything this blogpost recommends so read PEP 503 first, and mess around with S3PyPi read the article and you will know what you need to change in there to get it running).
You will also need to figure out a way to generate your HTML files and get them to S3. There are some cool offerings already on the internet like this, s3pypi which I personally used as a starting point. (Note: S3PyPi is not compatible out of the box with everything this blogpost recommends so read PEP 503 first, and mess around with S3PyPi read the article and you will know what you need to change in there to get it running).
Anyway, I will leave the figuring out of how to get the files in S3 to you.
We also need to enable static website hosting on this bucket so that /some/path/ can get automatically translated to /some/path/index.html. This becomes useful because when using pip, it will most likely open the package's page without /index.html at the end of the path; so using the website endpoint of the S3 bucket does this automatically.
Now you have a fully functional PyPI repository at the website endpoint, but are we done?
Not yet, the problem is that this website endpoint is public, so everyone can see your packages now and you certainly don't want that. Moreover, the default website endpoint that gets used when you enable static website hosting on S3 uses HTTP and pip would prefer that you use HTTPs instead of HTTP. Otherwise, it just keeps nagging over and over again. So how do we solve these 2 problems? The answer is 1 problem at a time, we'll start with the HTTPs one by creating a CloudFront distribution.
We also need to enable static website hosting on this bucket so that /some/path/ can get automatically translated to /some/path/index.html. This becomes useful because when using pip, it will most likely open the package's page without /index.html at the end of the path; so using the website endpoint of the S3 bucket does this automatically.
Now you have a fully functional PyPI repository at the website endpoint, but are we done?
Not yet, the problem is that this website endpoint is public, so everyone can see your packages now and you certainly don't want that. Moreover, the default website endpoint that gets used when you enable static website hosting on S3 uses HTTP and pip would prefer that you use HTTPs instead of HTTP. Otherwise, it just keeps nagging over and over again. So how do we solve these 2 problems? The answer is 1 problem at a time, we'll start with the HTTPs one by creating a CloudFront distribution.
Creating a CloudFront Distribution
CloudFront is a Content Delivery Network (CDN) hosted by AWS that allows you to serve content from different edge locations around the world. The origin server can be any normal HTTP server or an S3 bucket. We will use a CloudFront distribution to serve our S3 content over HTTPs.
CloudFront can serve S3 content in 2 ways.
CloudFront can serve S3 content in 2 ways.
- CloudFront can use S3 REST API to get files from your S3 bucket.
- CloudFront can use the S3 static website endpoint that we enabled earlier like any other HTTP endpoint.
You might be wondering which one will we be using and the answer is both. To know why let's look at the Pros and Cons of using a REST API for this purpose.
- Pro: We can use a CloudFront Origin Access Identity (OAI) to allow files from the bucket to only be fetched through CloudFront by the S3 REST API. This means that no-one can be able to bypass the CDN and go through to the origin.
- Con: When using the S3 REST API, requests to /some/path/ do not get translated automatically to /some/path/index.html because this is not a web server.
Now let's look at the Pros and Cons of using the S3 website endpoint for this purpose.
- Pro: Requests to /some/path/ will automatically get translated to /some/path/index.html.
- Con: The S3 website endpoint cannot be restricted to only allow requests from CloudFront contrary to how you would use an Origin Access Identity (OAI) in the REST API scenario.
- We start by creating a CloudFront OAI.
- We create a CloudFront distribution.
- We add 2 cache behaviors for the CloudFront distribution.
- The first cache behavior serves that path pattern "*.*" which means that the path must belong to a file (which would usually be the wheel file or a deliberate attempt to fetch an index.html file). This one would connect to the S3 bucket via the S3 REST API and should use the OAI we created earlier.
- The second cache behavior serves everything else (this would be the default) and would serve all paths that do not have a dot [.] in them like /some/path/. This cache behavior would connect to the S3 bucket over the static website endpoint.
- We make all .html files inside the S3 bucket public, everything else should allow access only to the CloudFront OAI we created earlier.
This means that all attempts to fetch files will go through the REST API, and all attempts to fetch paths (which would be eventually translated to the same path but concatenated with /index.html) would go through the website endpoint.
For example,
- If someone attempts to fetch a wheel file, CloudFront passes this through the REST API. If someone else attempts to fetch the wheel file by going directly to the S3 REST endpoint or the S3 website endpoint (that is assuming they know the name and region of the S3 bucket), they will get ACCESS DENIED because the file would only be accessible to the OAI's principal.
- If someone attempts to fetch an html file and types in .html explicitly, CloudFront passes this request through the REST API. However the person can fetch the file by going straight to the REST or static website endpoint because we made those public earlier. That is assuming they know the name and region of the S3 bucket off-course.
Okay so we forced anyone who wants to download our code to go through CloudFront but how do we restrict access to it? We need some mechanism to implement authentication and for the purpose of this demonstration, we are going to find a way to implement HTTP basic authentication which I believe best suits our needs here but maybe you can try your luck with another scheme. There are 2 ways of achieving this.
- Using CloudFront's Lambda@Edge, we can run a Lambda on each request and verify if it has the correct HTTP basic auth header. I didn't try this myself but it's worth mentioning.
- We can use Amazon's Web Application Firewall (WAF) to check that requests contain the correct auth headers. This is a bit hacky, but that was the solution I used because it was the simplest and it was good enough.
Setting Up Web Application Firewall
We can set up a Web Application Firewall (WAF) to only pass the request if it has the "Authorization" header. Here comes the hacky part; According to RFC 7617, HTTP Basic Authorization header contents follow the pattern "Basic: base64(username:password)" so we can hardcode WAF to only check for these pre-defined strings.
For example: if we have an employee named "Aladdin" and he wants his password to be "OpenSesame"; we would tell WAF to check for the "Authorization" header and that its contents must be "Basic QWxhZGRpbjpPcGVuU2VzYW1l" where "QWxhZGRpbjpPcGVuU2VzYW1l" is the base64 encoding of "Aladdin:OpenSesame". You can read more about how HTTP Basic Auth works here.
Setting up WAF this way would force all requests going to the CloudFront distribution to contain the Authorization header with the pre-defined string (or any number of strings you add). All other requests get a 403.
Seeing that in the previous section, all requests to package and wheel files were forced to go through CloudFront, all requests to download them must be authenticated or they would get a 403.
Potential Shortcomings Of This Solution
Now we have completely set up a serverless solution for a PyPI repository that would need initial effort investment only in building the way to upload the S3 files and update the HTML files (which you would have done any way). But then it's just a set and forget; the cost is negligible and the only time you need to look at it is when a team member joins/leaves to add/remove the rule that gives them access.
However, no system is without a tradeoff. Here are the potential shortcomings of this system.
- Anyone who knows the bucket name and region can check your HTML files which would allow them to know package names and their semantic versions. They won't be able to download any code however. While this is mostly harmless, you might not want that for some reason.
- Requests that do not contain the "Authorization" header will receive a 403 response instead of a conventional 401. It is more of a nuisance rather than an actual shortcoming but it may be adding an additional layer of Security Through Obscurity which is not really required, but hey, why not? If you strictly need 401s maybe you can look at the Lambda@Edge option instead of using WAF.
- If you forget to set up the S3 files to never be cached, in both S3 and CloudFront, you can have some delay before your new push becomes usable.
- If you setup a script that does the push to S3 that also updates the html files with the new pushes, you may have a race condition on your hands. This is because multiple team members can push data at the same time and therefore, update different versions of the same HTML file. To fix this, you can just let the script push the update metadata to an SQS queue and have it trigger an AWS Lambda that collects the updates and does the necessary changes to the HTML files. Alternatively, you can also use a Lambda but instead of pushing the update metadata to SQS yourself from the script, you can instead have S3 Event Notifications trigger the Lambda and refactor the Lambda code accordingly.
I would love to hear any insights, feedback, comments, or questions you have on this blogpost and thank you for reading this far.
nice article thanks
ReplyDeleteThanks Mohamed, You are welcom ^_^
Delete