Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update datastore to read from Zenodo's new InvenioRDM API #2939

Closed
1 of 2 tasks
zaneselvans opened this issue Oct 14, 2023 · 7 comments · Fixed by #2942
Closed
1 of 2 tasks

Update datastore to read from Zenodo's new InvenioRDM API #2939

zaneselvans opened this issue Oct 14, 2023 · 7 comments · Fixed by #2942
Labels
zenodo Issues having to do with Zenodo data archiving and retrieval.

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Oct 14, 2023

On October 13th Zenodo switched over to using InvenioRDM as their backend. This switch seems to have broken all of the URLs in all of the datapackage.json files in every archive we've ever made, rendering them no longer usable by our current software.

In the immediate term, it would be great if we can just change the software to work with both the old and new link formats, so we don't have to update all of our archives right now. But then we should update the archivers to start creating archives with the new link format, and ideally also to use the new API for the work they do, but that may not be absolutely necessary, if the old API has been preserved -- which they say it should have been but... we'll see.

Tasks

Preview Give feedback

Valid file download links take the form:

https://zenodo.org/api/records/{record_id}/files/{filename}/content

It appears that an API key is no longer required to download data which is great! You can do:

curl https://zenodo.org/api/records/8346646/files/datapackage.json/content

Downloads from these URLs appears to be relatively performant. Working from a GCP VM I got about 10 Mbit/s download speeds, which I think was similar to what we had with the direct storage links before. IIRC, we originally decided to use the direct links to the storage buckets because downloads were much faster than with the API constructed links.

@zaneselvans zaneselvans added the zenodo Issues having to do with Zenodo data archiving and retrieval. label Oct 14, 2023
@zaneselvans zaneselvans moved this from New to Backlog in Catalyst Megaproject Oct 14, 2023
@zaneselvans
Copy link
Member Author

Response from Zenodo regarding this issue:

Thanks for reporting the issues, and apologies for the troubles it's causing. We'll try to bring back the ability to use the old links (it will probably be by end of the week as things look right now). That said, you should as far as all possible rely on the links we're sending in the payload. This makes your API integration less prone to URL changes. We'll make sure we make this much more clear in the documentation.

We're still considering the API documented on developers.zenodo.org valid, though it's built as compatibility layer and despite having had many integrations test out the API before launch we've still found a lots of small issues that we're trying to iron out. We'll eventually deprecate the API once dust settles down, but that will be with a 1 year migration period from announced. You can find the new API documentation on https://inveniordm.docs.cern.ch/reference/rest_api_index/ for now. Once we announce the API deprecation, we'll have the developers.zenodo.org updated as well.

@jdangerx
Copy link
Member

It looks like an easy way to repro this locally is:

  1. set the local_cache_path to None:
    diff --git a/test/unit/settings_test.py b/test/unit/settings_test.py
    index 9d4eacba6..5f8b18c85 100644
    --- a/test/unit/settings_test.py
    +++ b/test/unit/settings_test.py
    @@ -263,7 +263,7 @@ def test_partitions_with_json_normalize(pudl_etl_settings):
    
     def test_partitions_for_datasource_table(pudl_etl_settings):
         """Test whether or not we can make the datasource table."""
    -    ds = Datastore(local_cache_path=PudlPaths().data_dir)
    +    ds = Datastore(local_cache_path=None)
          datasource = pudl_etl_settings.make_datasources_table(ds)
         datasets = pudl_etl_settings.get_datasets().keys()
         if datasource.empty and datasets != 0:
  2. run this unit test:
    $ pytest test/unit/settings_test.py::test_partitions_for_datasource_table
    

I can start messing around with "relying on the links they're sending in the payload," though I wonder about performance.

@zaneselvans
Copy link
Member Author

We also have a pytest option that tells pytest not to use cached input data, and instead download it directly from Zenodo no matter what which will reproduce this error:

pytest --tmp-data test/unit/settings_test.py::test_partitions_for_datasource_table

I re-tried the curl download above this morning, both locally and from a VM, and it was much much better than my test over the weekend. I got about 10 Mbit/sec download speeds, which I think is similar to what we were getting before:

curl https://zenodo.org/api/records/8346646/files/datapackage.json/content

@jdangerx
Copy link
Member

jdangerx commented Oct 16, 2023

We can get the new Zenodo record by changing the Datastore's _get_url(doi) method:

diff --git a/src/pudl/workspace/datastore.py b/src/pudl/workspace/datastore.py
index 225068b37..6808fecf4 100644
--- a/src/pudl/workspace/datastore.py
+++ b/src/pudl/workspace/datastore.py
@@ -257,7 +257,7 @@ class ZenodoFetcher:
             api_root = "https://zenodo.org/api"
         else:
             raise ValueError(f"Invalid Zenodo DOI: {doi}")
-        return f"{api_root}/deposit/depositions/{zenodo_id}"
+        return f"{api_root}/records/{zenodo_id}/files"
 
     def _fetch_from_url(self: Self, url: HttpUrl) -> requests.Response:
         logger.info(f"Retrieving {url} from zenodo")

This gives us a JSON response with an entries key, which includes records with the following important fields:

  • key: the filename
  • links.content: the new download path

However, this still leaves us with the issue that our datapackage.json points at the old Zenodo API endpoints.

Options:

  • In the existing DatapackageDescriptor.get_resource_path method, we can detect the old-style links and replace them with generated ones according to the format f"https://zenodo.org/api/records/{self.doi}/files/{name}/contents
  • We could also try to use the new API to get the official new path - which would mean matching each entries.[].key to the datapackage.resources.[].name - this means we just always ignore the paths in the datapackage, which seems kind of silly, but also better conveys that Zenodo is the actual keeper of the filepaths.

The second option has a bunch of complications:

  • should we then persist the Zenodo files/entries API output in the archive itself? Seems like it defeats the purpose of having the most up-to-date filepaths for each resource.
  • should we re-request the Zenodo filepaths every time we try to access the datastore? Seems like a lot of excess calls to Zenodo.

My pitch is to do the quick patch of get_resource_path for now, so our code works at all, and then think about the bigger questions maybe as part of the archiver changes.

My other pitch is to split this ticket into "be able to read data from Zenodo again" and "be able to write data to Zenodo again" - since the changes will largely be in different repositories anyways.

@zaneselvans
Copy link
Member Author

Reading Zenodo's response again, I'm concerned that they're saying there is no durable URL we can use in the data packages to download an archived file, given the record ID and the filename, and the only way to reliably obtain a download path is to interact with the API, which won't work with the requirements of the data packages.

@zaneselvans
Copy link
Member Author

I think the immediate problem is being able to construct the correct URLs based on information that's available in the old archives, but this is just a bandaid so that the old archives aren't useless and we can keep using them while things switch over to the new system, which I think just means some hacky changes in datapackage.py

With the 2nd option above, aren't we ignoring the (now broken) paths in the datapackages in either case? In the first, we're reconstructing the URL based on the new pattern, and in the 2nd we'd be using the API to obtain the path rather than constructing it. But in the long run we need durable static URLs that point to the individual resources (files) in the archives that we can put in the datapackage.json. Otherwise we don't have data packages. If Zenodo is going to constantly be changing the URLs associated with archived files this whole setup is not going to work.

@zaneselvans
Copy link
Member Author

zaneselvans commented Oct 16, 2023

I explained this issue to Zenodo tech support and got another response... with another way to construct the download

What I would say is that the file URLs are not persistent links. It's not a guarantee we'll keep them stable, yet we try as much as possible (and we're not going to change this in the near future - last change was 7 years ago). And when I say to follow the links in the payload it's to be based on HATEOS principles (which obviously doesn't solve your issue). The main reasons for changing anything with the files is usually to due to performance or distribution of the data (e.g. the current change is to more easily integrate a third-party storage interface). Ideally it would be nice if the DOI infrastructure could allow linking directly to the file as well, which could give you persistent links, but unfortunately that's not possible.

Now I fully see your issue in that some tools who will use the data need static links (also for performance/efficiency). The best link you can use now is:

https://zenodo.org/records/{record_id}/files/{filename}

For example:

curl --output ferc1-2018.zip https://zenodo.org/records/8326634/files/ferc1-2018.zip

Incidentally, that link only took 14 seconds to download 131MB (though trying the previous API link now, it also downloaded in 14 seconds, so maybe there's just a factor of 10x variation in how fast things download).

@jdangerx jdangerx moved this from Backlog to In progress in Catalyst Megaproject Oct 16, 2023
@jdangerx jdangerx moved this from In progress to In review in Catalyst Megaproject Oct 16, 2023
@jdangerx jdangerx changed the title Update datastore and archivers to work with InvenioRDM API Update datastore to read from Zenodo's new InvenioRDM API Oct 16, 2023
@jdangerx jdangerx moved this from In review to Done in Catalyst Megaproject Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
zenodo Issues having to do with Zenodo data archiving and retrieval.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants