Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the deploy.yaml workflow to run the model on CI #26

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Oct 25, 2023

This PR updates the deploy.yaml workflow added in #25 to run the model on CI using AWS Batch jobs. It also adds a cleanup-terraform.yaml workflow to delete any created AWS resources when PRs are closed.

All of the Batch resources required by PR runs are provisioned by the code added in this PR using Terraform; however, some resources that remain consistent across Batch jobs (e.g. IAM roles and policies) are instead managed outside of this repo and referenced here using Terraform data sources. We will manage these resources in the https://github.com/ccao-data/aws-infrastructure repo instead.

Base automatically changed from jeancochrane/22-infra-updates-build-and-push-a-docker-image-for-the-model to master October 25, 2023 20:13
@jeancochrane jeancochrane force-pushed the jeancochrane/23-infra-updates-add-a-workflow-to-run-the-model-on-prs-and-workflow_dispatch branch from 18b5e85 to 1686c4d Compare November 2, 2023 20:29
@jeancochrane jeancochrane force-pushed the jeancochrane/23-infra-updates-add-a-workflow-to-run-the-model-on-prs-and-workflow_dispatch branch from 1686c4d to 5de0712 Compare November 2, 2023 20:35
@jeancochrane jeancochrane force-pushed the jeancochrane/23-infra-updates-add-a-workflow-to-run-the-model-on-prs-and-workflow_dispatch branch from 5de0712 to 202f941 Compare November 2, 2023 20:40
@jeancochrane jeancochrane temporarily deployed to deploy November 2, 2023 20:40 — with GitHub Actions Inactive
@jeancochrane jeancochrane marked this pull request as ready for review November 3, 2023 14:45
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff @jeancochrane, thanks for putting this together. No major issues but some tweaks around the edges. Really excited to have this up this year.

@@ -0,0 +1,87 @@
name: Setup Terraform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think factoring this out would be wise, considering it's a pattern we're very likely to re-use. How about we make a ccao-data/actions repo, since we're starting to replicate lots of actions in general.

#
# Images are built on every commit to a PR or main branch in order to ensure
# that the build continues to work properly, but Batch jobs are gated behind
# a `deploy` environment that requires manual approval from a codeowner.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# a `deploy` environment that requires manual approval from a codeowner.
# a `deploy` environment that requires manual approval from an @ccao-data/core-team member.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2a85c14.

Comment on lines 14 to 17
# "*-assessment-year" are long-lived branches containing the most up-to-date
# models for a given assessment cycle, and hence we consider them to be
# main branches
branches: [master, '*-assessment-year']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: We actually probably don't need a long-lived *-assessment-year branch. We used this pattern in the GitLab days because we had a secondary staging branch on most repos. I thought we might replicate it for modeling just to keep the master branch stable, but now that our PR practices are much tighter I don't think we need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 2a85c14.

.github/workflows/deploy.yaml Show resolved Hide resolved
.github/workflows/deploy.yaml Show resolved Hide resolved
.github/workflows/deploy.yaml Show resolved Hide resolved
# workflow that provisions these resources
resource "aws_batch_job_definition" "main" {
name = var.batch_job_name
platform_capabilities = ["FARGATE"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: How difficult is it to switch to jobs run by EC2? Just curious since we may want the GPU capabilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be feasible! I switched to Fargate for the MVP since I was finding EC2 to be difficult to debug, but I'd be interested in taking another crack at it now that I have a better sense of how the IAM permissions and networking need to be configured in the Fargate context. I think the key challenge will be making sure that we can get instance logging configured in such a way that failures are debuggable. I opened up #57 to track this effort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Not especially worried about this yet but it may become important once we start training lots of models (December).

terraform/main.tf Show resolved Hide resolved
.github/workflows/deploy.yaml Show resolved Hide resolved
.github/workflows/cleanup-terraform.yaml Show resolved Hide resolved
@@ -0,0 +1,119 @@
#!/usr/bin/env bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this logic is shared between the steps that check for job startup and job completion, I factored it out into a shared script.

# derive a timeout in second units. There is no equivalent timeout for running
# jobs, because those timeouts can be set on the Batch level, whereas startup
# timeouts are not controllable by Batch
BATCH_JOB_POLL_STARTUP_MAX_RETRIES=60
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could make sense to eventually make these attributes configurable, particularly once we factor this logic out into a shared action or workflow, but for the sake of faster iteration I decided to hardcode them for now since bash argument parsing is so annoying.

@jeancochrane jeancochrane requested a review from dfsnow November 6, 2023 21:23
@jeancochrane
Copy link
Contributor Author

Requesting @dfsnow one more time to make sure I didn't miss any comments!

Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @jeancochrane. Nice work

@@ -0,0 +1,87 @@
name: Setup Terraform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pattern in r-lib/actions works pretty well. Minor versions can change the code underlying the tag and only breaking changes would warrant a major version bump. I think it makes sense to basically factor out stuff into composite actions in their own sub-directories. Curious what you think

compute_resources {
type = "FARGATE"
min_vcpus = 0
max_vcpus = 64 # Max across all jobs, not within one job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally makes sense and I like the flexibility to being able to change environment per repo and per branch. Nice thinking 👍

# workflow that provisions these resources
resource "aws_batch_job_definition" "main" {
name = var.batch_job_name
platform_capabilities = ["FARGATE"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Not especially worried about this yet but it may become important once we start training lots of models (December).

@jeancochrane jeancochrane merged commit f51c35b into master Nov 6, 2023
3 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/23-infra-updates-add-a-workflow-to-run-the-model-on-prs-and-workflow_dispatch branch November 6, 2023 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Infra updates] Add a workflow to run the model on PRs and workflow_dispatch
2 participants