-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update datastore to read from Zenodo's new InvenioRDM API #2939
Comments
Response from Zenodo regarding this issue:
|
It looks like an easy way to repro this locally is:
I can start messing around with "relying on the links they're sending in the payload," though I wonder about performance. |
We also have a pytest option that tells pytest not to use cached input data, and instead download it directly from Zenodo no matter what which will reproduce this error:
I re-tried the
|
We can get the new Zenodo record by changing the diff --git a/src/pudl/workspace/datastore.py b/src/pudl/workspace/datastore.py
index 225068b37..6808fecf4 100644
--- a/src/pudl/workspace/datastore.py
+++ b/src/pudl/workspace/datastore.py
@@ -257,7 +257,7 @@ class ZenodoFetcher:
api_root = "https://zenodo.org/api"
else:
raise ValueError(f"Invalid Zenodo DOI: {doi}")
- return f"{api_root}/deposit/depositions/{zenodo_id}"
+ return f"{api_root}/records/{zenodo_id}/files"
def _fetch_from_url(self: Self, url: HttpUrl) -> requests.Response:
logger.info(f"Retrieving {url} from zenodo") This gives us a JSON response with an
However, this still leaves us with the issue that our Options:
The second option has a bunch of complications:
My pitch is to do the quick patch of My other pitch is to split this ticket into "be able to read data from Zenodo again" and "be able to write data to Zenodo again" - since the changes will largely be in different repositories anyways. |
Reading Zenodo's response again, I'm concerned that they're saying there is no durable URL we can use in the data packages to download an archived file, given the record ID and the filename, and the only way to reliably obtain a download path is to interact with the API, which won't work with the requirements of the data packages. |
I think the immediate problem is being able to construct the correct URLs based on information that's available in the old archives, but this is just a bandaid so that the old archives aren't useless and we can keep using them while things switch over to the new system, which I think just means some hacky changes in With the 2nd option above, aren't we ignoring the (now broken) paths in the datapackages in either case? In the first, we're reconstructing the URL based on the new pattern, and in the 2nd we'd be using the API to obtain the path rather than constructing it. But in the long run we need durable static URLs that point to the individual resources (files) in the archives that we can put in the |
I explained this issue to Zenodo tech support and got another response... with another way to construct the download
For example:
Incidentally, that link only took 14 seconds to download 131MB (though trying the previous API link now, it also downloaded in 14 seconds, so maybe there's just a factor of 10x variation in how fast things download). |
On October 13th Zenodo switched over to using InvenioRDM as their backend. This switch seems to have broken all of the URLs in all of the
datapackage.json
files in every archive we've ever made, rendering them no longer usable by our current software.In the immediate term, it would be great if we can just change the software to work with both the old and new link formats, so we don't have to update all of our archives right now. But then we should update the archivers to start creating archives with the new link format, and ideally also to use the new API for the work they do, but that may not be absolutely necessary, if the old API has been preserved -- which they say it should have been but... we'll see.
Tasks
Valid file download links take the form:
It appears that an API key is no longer required to download data which is great! You can do:
Downloads from these URLs appears to be relatively performant. Working from a GCP VM I got about 10 Mbit/s download speeds, which I think was similar to what we had with the direct storage links before. IIRC, we originally decided to use the direct links to the storage buckets because downloads were much faster than with the API constructed links.
The text was updated successfully, but these errors were encountered: