-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate the data release process #2756
Comments
We currently have space limitations with the s3 bucket (100 GB), so we need to decide how often we will do data releases and how long we will retain old releases. I asked the AWS folks if there is a ceiling to how much additional storage space we can request. They didn't tell me the limit, but they said they can increase our storage to 1 TB given the size and frequency of our data releases. Our data releases are currently about 20 GB though it's likely they will get bigger! We could cut down on the size by:
If we assume our releases grow to 40 GB (more output tables and new datasets!) we could have 25 data releases available at once. This equates to roughly 2 data releases a month retained for a full year. This sounds reasonable to me. We'll have to make it clear that releases expire on s3 after a year. If users need to depend on old versions, they can always pull the data from Zenodo or the GCS bucket using requester pays. |
I think if we get to putting out monthly (or even quarterly) data releases we will be in great shape! So it seems like we should easily be able to store at least a couple of years worth of releases in free S3 buckets, with Zenodo as the free, public, citation-friendly, cold-storage for anything older. To start testing a script for pushing releases to Zenodo, we might initially use their Sandbox server, and rather than only trying to push on a tagged release, push on any nightly build that succeeds off of
We don't have much in the way of software distribution infrastructure right now. We'll keep tagging the commits associated with persistent data releases, and those tags will keep getting archvied on Zenodo automatically. Do we want to keep pushing I don't thing generating |
Minimal changes required:
Nice-to-have:
|
I've pulled the nice-to-haves into #3326 so we can close this issue. |
Once #1973 and the implementation of #2517 are complete, we can move to data-only releases. There is some automation we'd like to create to make this process as smooth as possible.
The plan right now is to do a semi-manual data release based on the
v2023.12.01
tag and take notes on the process, so we can do an automatic release within the next 2 weeks containing all the post-rename tables. It's just a draft at the moment, but the v2023.12.01 release will be available at:10.5281/zenodo.10275052
Resolved Questions
Open Questions
main
and only if the tag starts withv20*
.conda-forge
? I'm inclined to find a low overhead but not guaranteed to work / be reproducible way to do this just so the package can ge installed without git reference gymnastics.v2023.12.01
intov2023.12.1
However, GitHub and the outputs on GCS & S3 don't do this, and the tag we applied on GitHub isv2023.12.01
so this could result in some confusion / annoyance. Do we want to use the no-leading-zeroes version of CalVer?For Future Consideration
nightly-YYYY-MM-DD
commit to tag withvYYYY.MM.DD
, and then we should be able to look up their nightly build outputs and distribute them without needing to do a build at all. Doing a release would only take a few minutes then, and could hopefully be done on a GitHub runner. The biggest single file we need to distribute right now is CEMS, at 6GB. So if we're downloading from S3 and re-uploading one file at a time to Zenodo, that shouldn't be a problem. With the whole release being under 10GB we could probably download it all and re-upload it, but that probably won't be true forever. Or we could just do releases on a bigger runner with more disk. They should only take as long to run as it takes to download and re-upload the data. Or maybe in the New Zenodo API there's some way to copy directly from cloud-to-cloud without downloading the files locally at all? A boy can dream.fsspec
to transfer the local build outputs up to both S3 and GCS. Seems like something to integrate into the pythonization of the build / deploy script.Tasks
The text was updated successfully, but these errors were encountered: