Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to create sandbox data release in nightly builds. #3158

Merged
merged 2 commits into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/build-deploy-pudl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ jobs:
--container-env DAGSTER_PG_HOST="104.154.182.24" \
--container-env DAGSTER_PG_DB="dagster-storage" \
--container-env FLY_ACCESS_TOKEN=${{ secrets.FLY_ACCESS_TOKEN }} \
--container-env ZENODO_SANDBOX_TOKEN_PUBLISH=${{ secrets.ZENODO_SANDBOX_TOKEN_PUBLISH }} \
--container-env PUDL_SETTINGS_YML="/home/mambauser/src/pudl/package_data/settings/etl_full.yml" \
--container-env PUDL_GCS_OUTPUT=${{ env.GCS_OUTPUT_BUCKET }}/${{ env.COMMIT_TIME }}-${{ env.SHORT_SHA }}-${{ env.COMMIT_BRANCH }}

Expand Down
10 changes: 10 additions & 0 deletions docker/gcp_pudl_etl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,11 @@ function copy_outputs_to_distribution_bucket() {
aws s3 cp "$PUDL_OUTPUT/" "s3://intake.catalyst.coop/$GITHUB_REF" --recursive
}

function zenodo_data_release() {
echo "Creating a new PUDL data release on Zenodo."
~/devtools/zenodo/zenodo_data_release.py --publish --env sandbox --source-dir $PUDL_OUTPUT
}


function notify_slack() {
# Notify pudl-builds slack channel of deployment status
Expand Down Expand Up @@ -125,9 +130,14 @@ if [[ $ETL_SUCCESS == 0 ]]; then
if [ $GITHUB_ACTION_TRIGGER = "push" ] || [ $GITHUB_REF = "dev" ]; then
copy_outputs_to_distribution_bucket
ETL_SUCCESS=${PIPESTATUS[0]}
zenodo_data_release 2>&1 | tee -a $LOGFILE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given our current logic, the logs from the release script will show up in log file that is sent to slack but they won't show up in the log file copied to the GCS bucket. I think copy_outputs_to_gcs should be called towards the end of the script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a problem here now that we're doing some post-processing for distribution -- removing files we don't want to distribute, gzipping the SQLite DBs. What we copy to GCS is for forensic purposes, which is different from what we're currently shipping to AWS, Kaggle, Zenodo, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's right. I guess it's not a huge deal if we don't include the zenodo release logs in the GCS bucket. We could also change the script to dump the outputs to GCS after the ETL runs then dump the logs after most of the post-processing/distribution logic has happened.

p.s. would love to rework this system so all of the logs are collected for us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'd be better to capture the additional steps in the logs, or only save the outputs we distribute to the nightly build buckets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an overhaul of the nightly build system is definitely on the docket for next year.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand your question:

Do you think it'd be better to capture the additional steps in the logs, or only save the outputs we distribute to the nightly build buckets?

Are you asking which logs should be written to the log file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind I was seeing it as either or based on the ordering, but re-saving the logfile fixes everything.

ETL_SUCCESS=${PIPESTATUS[0]}
fi
fi

# This way we also save the logs from latter steps in the script
gsutil cp $LOGFILE ${PUDL_GCS_OUTPUT}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I added a re-copy of the logfile after everything.


# Notify slack about entire pipeline's success or failure;
# PIPESTATUS[0] either refers to the failed ETL run or the last distribution
# task that was run above
Expand Down
12 changes: 6 additions & 6 deletions environments/conda-linux-64.lock.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading