Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to create sandbox data release in nightly builds. #3158

Merged
merged 2 commits into from
Dec 14, 2023

Conversation

zaneselvans
Copy link
Member

Overview

Attempt to create a sandbox data release after the nightly builds succeed. This is just meant to test the data release publication process until we switch over to creating real releases from tagged builds.

I think the workflow needs to be added to main before it'll work in the scheduled builds though.

I added the ZENODO_SANDBOX_TOKEN_PUBLISH to the organization secrets.

@zaneselvans zaneselvans added zenodo Issues having to do with Zenodo data archiving and retrieval. release Tasks directly related to data and software releases. labels Dec 14, 2023
@zaneselvans zaneselvans self-assigned this Dec 14, 2023
@zaneselvans zaneselvans linked an issue Dec 14, 2023 that may be closed by this pull request
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me. Left a comment about log handling.

@@ -122,6 +127,8 @@ if [[ $ETL_SUCCESS == 0 ]]; then
if [ $GITHUB_ACTION_TRIGGER = "push" ] || [ $GITHUB_REF = "dev" ]; then
copy_outputs_to_distribution_bucket
ETL_SUCCESS=${PIPESTATUS[0]}
zenodo_data_release 2>&1 | tee -a $LOGFILE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given our current logic, the logs from the release script will show up in log file that is sent to slack but they won't show up in the log file copied to the GCS bucket. I think copy_outputs_to_gcs should be called towards the end of the script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a problem here now that we're doing some post-processing for distribution -- removing files we don't want to distribute, gzipping the SQLite DBs. What we copy to GCS is for forensic purposes, which is different from what we're currently shipping to AWS, Kaggle, Zenodo, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's right. I guess it's not a huge deal if we don't include the zenodo release logs in the GCS bucket. We could also change the script to dump the outputs to GCS after the ETL runs then dump the logs after most of the post-processing/distribution logic has happened.

p.s. would love to rework this system so all of the logs are collected for us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'd be better to capture the additional steps in the logs, or only save the outputs we distribute to the nightly build buckets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an overhaul of the nightly build system is definitely on the docket for next year.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand your question:

Do you think it'd be better to capture the additional steps in the logs, or only save the outputs we distribute to the nightly build buckets?

Are you asking which logs should be written to the log file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind I was seeing it as either or based on the ordering, but re-saving the logfile fixes everything.

fi
fi

# This way we also save the logs from latter steps in the script
gsutil cp $LOGFILE ${PUDL_GCS_OUTPUT}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I added a re-copy of the logfile after everything.

Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! This should be merged into main because we are updating a workflow file.

@zaneselvans zaneselvans changed the base branch from dev to main December 14, 2023 22:14
@zaneselvans zaneselvans force-pushed the nightly-sandbox-data-release branch from 36706d7 to 081182c Compare December 14, 2023 22:20
@zaneselvans zaneselvans marked this pull request as ready for review December 14, 2023 22:24
@zaneselvans zaneselvans merged commit 66c73b4 into main Dec 14, 2023
11 of 15 checks passed
@zaneselvans zaneselvans deleted the nightly-sandbox-data-release branch December 14, 2023 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release Tasks directly related to data and software releases. zenodo Issues having to do with Zenodo data archiving and retrieval.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Automate the data release process
2 participants