How can we provide easy Jupyter Notebook access to PUDL? #2610
Replies: 2 comments 5 replies
-
Not much to add, but wanted to second Kaggle and Colab as the places that have always stood out for free compute resources. Also it's Google all the way down; Kaggle was bought by Google back in 2017. As far as product longevity, I'd speculate Kaggle out lasts Colab because free-for-users compute directly helps grow their platform and thus their core business. Colab on the other hand is probably a quite expensive branding exercise for Google. But as long as "AI" is hot it's hard to imagine them letting go of it. Or of Kaggle. |
Beta Was this translation helpful? Give feedback.
-
The context of this problem seems like it will change significantly for the better once we move to distributing data instead of code, which we intend to do in the next few months. Once we make that switch, will the 2i2c hardware limitations still be a concern? 6GB is not enough to run the whole ETL, but it seems like plenty to mess around with a SQLite database. CEMS processing runs in dask so is out-of-core capable, which should meet the needs of the few people who really want to dive into the weeds of full-granularity CEMS data. Also who is the intended audience for this? Someone has to be comfortable with python/R/SQL but uncomfortable with downloading a SQLite db and working locally. I'm sure somebody is in the intersection of that Venn diagram but I can't see it being that big? I'm wondering if we should just keep 2i2c for now for the sake of simplicity, and defer the switching costs until hosting actually becomes expensive, if it ever does. I agree that switching to a platform like Kaggle would be preferable, but in a world of budget priorities, I'd vote for spending this time fixing our data distribution system instead. I think that would be more appreciated by our users. |
Beta Was this translation helpful? Give feedback.
-
One type of user that we would like to serve with PUDL data is folks who are comfortable doing semi-programmatic, exploratory, interactive analysis using Python in a Jupyter Notebook, but who aren't necessarily comfortable dealing with Python environment setup, or running a local notebook server. Ideally we'd be able to drop somebody into an example notebook that has access to all of the data and appropriate software packages with nothing more than a URL.
We've had a "pilot" JupyterHub managed by 2i2c for quite a while, but it hasn't seen a lot of use by us or others, for a couple of reasons:
The original idea was that it would be very useful to have scalable computational resources (e.g. backed by a Dask cluster maybe on GKE) that were easily accessible for ourselves and others, with always fresh data + software ready to go.
2i2c is understandably wondering what we actually want to do longer term, and I think we are wondering whether this is the best way to provide interactive Notebook access to PUDL. So what are the other options out there these days, and what do we actually need?
Desired setup
Questions
Notebook + Data Hosting Options
2i2c
2i2c is a non-profit interactive computation infrastructure provider run by many of the same folks behind Project Jupyter.
Advantages
Questions
Binder
For lightweight live Jupyter Notebooks, mybinder.org has been the go-to solution for a while. Unfortunately
the Binder Project (which is run by many of the same people as 2i2ic) is running out of compute credits and is scaling down. The volume of data that we’re trying to provide access to, and the CPU/RAM resources that are required to effectively work with our data are also well beyond what mybinder offers I think.
JupyterLite
JupyterLite runs a notebook server in-browser using Pyodide. We could create and maintain a JupyterLite deployment containing our data and a Python environment. Folks would go to the deployed URL and the data / environment would be downloaded into their browser's sandboxed filesystem.
Advantages
Questions
Kaggle
Kaggle provides a remarkable quantity of resources for free. I created a PUDL dataset on there to test it out.
Advantages
datapackage.json
and data packages now support annotating SQLite DBs.Questions
Google Colab
Google Colab provides access to Jupyter notebooks running on GCP resources. They offer a free tier, and two paid tiers, with increasing resources. They are vague about what the resources backing the various plans, but this post suggests ~16/32/52GB of memory for the free/Pro/Pro+ subscriptions, which cost $0/$10/$50 per month.
Advantages
Questions
Beta Was this translation helpful? Give feedback.
All reactions