-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icebox: Migrate to new Zenodo InvenioRDM API #184
Conversation
This reverts commit eb7e52c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you've successfully made archives in prod. Which is good, and means we have probably one of the few functioning Zenodo API clients in the world.
However, tests are failing, and there are some weird data modelling issues e.g. DepositionVersion
not having a submitted
field, which don't feel ready for merge into main
.
There's also just lots of undocumented behavior that we're using, but I'm not sure is necessary, but we can't really tell without the documentation or lots of trial and error.
I think we have a few options here:
- merge this into
main
, and make a big cleanup pass later when Zenodo is stabilized - keep this un-merged, wait for Zenodo to stabilize, and then make a cleanup pass before merging; in the meantime, do all manual archive creation on this branch
- keep this un-merged, but investigate whether the legacy API integration that's currently on
main
works again - they had mentioned trying to get it back online and it appears to be at least somewhat back. - refactor this to make a "new zenodo API depositor" that shares the same public API as the legacy zenodo API depositor and lives alongside it (instead of replacing the existing zenodo depositor), so we can switch between the two as Zenodo figures out what's happening
I think that if we can't make prod archives using the legacy API integration on main
and we can make prod archives with this version, we should just merge into main
and fix things once zenodo calms down a bit.
It would maybe make sense to wait a few more days and see if we can get sandbox tests passing. It looks like the records are now being successfully published on sandbox, even though the publish request returns a 500. So maybe in a few days the sandbox will be less flaky and we can actually try to run these tests.
@@ -20,9 +20,9 @@ async def get_resources(self) -> ArchiveAwaitable: | |||
"""Download EIA-860 resources.""" | |||
link_pattern = re.compile(r"eia860(\d{4})(ER)*.zip") | |||
for link in await self.get_hyperlinks(BASE_URL, link_pattern): | |||
year = link_pattern.search(link).group(1) | |||
year = int(link_pattern.search(link).group(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope: we should stick mypy
in the pre-commit config so these sorts of type errors don't slip through in the future. This will probably mean fixing a ton of type errors in a separate PR before we can do that.
parser.add_argument( | ||
"--refresh-metadata", | ||
action="store_true", | ||
help="Regenerate metadata from PUDL data source rather than existing archived metadata.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the Zenodo metadata, like "creators", "communities", "DOI" etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, creators, title, keywords, basically all metadata. The outcome is the same as if we'd run initialize
except for version and link to the pre-existing depositions. We need this because there's no other way to migrate old-type depositions, but it's also nice to have in case we ever want to update metadata for an existing archive.
mt = MEDIA_TYPES[filename.suffix[1:]] | ||
# Remove /api, /draft, /content from link to get stable path | ||
if "/api" or "/draft" or "/content" in file.links.self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good case for regex...
stable_path = re.sub(r"/(api|draft|content)", "", file.links.self)
r"\d{6,7}$", | ||
published_deposition.conceptrecid, | ||
published_deposition.doi, | ||
) | ||
if self.sandbox: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: Our sandbox/prod DOI setup is a little awkward, I wonder if we can stop having to check sandboxiness in so many places.
"sponsor", | ||
"supervisor", | ||
"work package leader", | ||
]: # Unclear to me what roles are allowed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, that sounds frustrating. How did you figure out this list in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By manually copying over the list of options on the Zenodo site, essentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, our original data model here matched up much better with Zenodo's API! 🪦
at the original.""" | ||
draft = await depositor.get_new_version(initial_deposition) | ||
|
||
draft = await depositor.get_new_version(initial_deposition, data_source_id="test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the initial_deposition is a DepositionVersion
which doesn't have a submitted
field - which then breaks when we check it in get_new_version
.
The submitted
field isn't anywhere in the docs for the GET /api/records/{id}
endpoint, but it does get returned from the API in both the records/{id}
endpoint as well as /records/{id}/versions/latest
, so we could just add that to DepositionVersion
.
But, submitted
is not an expected response field, according to the docs.
But, the corresponding response field in the docs, is_published
, is actually not being returned from the API.
And there's a third status
field which seems to correspond to submitted-ness that doesn't show up in the docs either.
In any case - we should do something so that get_new_version
can handle the DepositionVersion
response from publish_deposition
and the Deposition
response. I suppose we can just work off of the actual shape of the data instead of the docs.
"record": "public", | ||
"files": "public", | ||
}, | ||
"files": {"enabled": True}, # Must always be True for Zenodo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we just want this to be True
all the time because we want to enable file attachments? Or does it have to be True
because Zenodo throws a cryptic error if it's not set to True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It throws an error if True, not documented anywhere. As far as I can tell it's technically a setting that the network configurator can change to allow metadata only records but this doesn't seem to be enabled in Zenodo. This is ok since we only ever want to upload records with files. It also matches the behavior of old Zenodo API (see the expected failure for the empty_deposition
test, e.g.
) | ||
# Reserve DOI for deposition | ||
doi_response = await self.reserve_doi(response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find anything about this reserve_doi
thing in the docs/support emails/github issues - what is this doing and how did you figure it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the "reserve_doi"
link returned in the list of links for a file. It's not listed anywhere other than being described as an option for developers using InvenioRDM to enable. Made some informed guesses about what it would take based on this publish method that wound up being right, if I remember right.
@@ -239,7 +244,9 @@ async def run(self) -> RunSummary | Unchanged: | |||
for name in files_to_delete | |||
] | |||
|
|||
self.new_deposition = await self.depositor.get_record(self.new_deposition.id_) | |||
self.new_deposition = await self.depositor.get_draft_record( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pleased that the changes to orchestrator.py
are so light / the interface to the depositor didn't change much!
Once Zenodo stabilizes and the docs exist, we can maybe think about what the depositor interface should look like going forward... something along the lines of:
- create new deposition
- update the deposition file contents (all the draft state management might want to be hidden from the outside)
- publish the deposition
But - since we're most likely going to be tied to Zenodo for the forseeable future, and I don't expect them to be changing the API willy-nilly anytime soon, maybe that's not a super useful refactor.
We'll close this for now since we're not actively working on it for a minute and will probably have to redo a bunch of stuff anyways. But, this will be super useful documentation for when they finally do release the new API. |
See #183 for detailed issue description. Zenodo's endpoints for their API have changed, breaking our existing archivers. Let's get them migrated.
create_deposition
get_deposition
get_record
get_new_version
publish_deposition
delete_deposition
create_file
delete_file
update_file
datapackage.json
to use new pathcreators
metadata to match new format https://inveniordm.docs.cern.ch/reference/metadata/#creators-1-ndelete_deposition
.5281
prefix and new values/content
extension from datapackagerefresh_metadata
flagcensus_dp1tract
,eia176
,eia860
,eiawater
,epacems
,ferc60
and update sandbox dois for all data sourcesOut of scope:
Check that the code works for:
--sandbox
--dry-run
--initialize
Updates about remaining kinks in the transition:
--auto-publish
on the sandbox server - I get a "Attempt to decode JSON with unexpected mimetype: text/html" 404 error. Underlying this is a 504 Gateway Timeout for the sandbox. This also means the tests fail.creators
field response != thecreators
field expected format. This means you can'tGET
a deposition andPUT
its metadata back. Updating a draft record with one field (e.g. theorder
field for thefiles
dictionary) also seems to wipe all the rest of the inputted values for themetadata
. Very annoying behavior that means we should avoid updating records by any means other than linking once they're created.