-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the deploy.yaml workflow to run the model on CI #26
Update the deploy.yaml workflow to run the model on CI #26
Conversation
…low-to-run-the-model-on-prs-and-workflow_dispatch'
18b5e85
to
1686c4d
Compare
1686c4d
to
5de0712
Compare
5de0712
to
202f941
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff @jeancochrane, thanks for putting this together. No major issues but some tweaks around the edges. Really excited to have this up this year.
@@ -0,0 +1,87 @@ | |||
name: Setup Terraform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think factoring this out would be wise, considering it's a pattern we're very likely to re-use. How about we make a ccao-data/actions
repo, since we're starting to replicate lots of actions in general.
.github/workflows/deploy.yaml
Outdated
# | ||
# Images are built on every commit to a PR or main branch in order to ensure | ||
# that the build continues to work properly, but Batch jobs are gated behind | ||
# a `deploy` environment that requires manual approval from a codeowner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# a `deploy` environment that requires manual approval from a codeowner. | |
# a `deploy` environment that requires manual approval from an @ccao-data/core-team member. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 2a85c14.
.github/workflows/deploy.yaml
Outdated
# "*-assessment-year" are long-lived branches containing the most up-to-date | ||
# models for a given assessment cycle, and hence we consider them to be | ||
# main branches | ||
branches: [master, '*-assessment-year'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: We actually probably don't need a long-lived *-assessment-year
branch. We used this pattern in the GitLab days because we had a secondary staging branch on most repos. I thought we might replicate it for modeling just to keep the master
branch stable, but now that our PR practices are much tighter I don't think we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 2a85c14.
# workflow that provisions these resources | ||
resource "aws_batch_job_definition" "main" { | ||
name = var.batch_job_name | ||
platform_capabilities = ["FARGATE"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: How difficult is it to switch to jobs run by EC2? Just curious since we may want the GPU capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be feasible! I switched to Fargate for the MVP since I was finding EC2 to be difficult to debug, but I'd be interested in taking another crack at it now that I have a better sense of how the IAM permissions and networking need to be configured in the Fargate context. I think the key challenge will be making sure that we can get instance logging configured in such a way that failures are debuggable. I opened up #57 to track this effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Not especially worried about this yet but it may become important once we start training lots of models (December).
@@ -0,0 +1,119 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this logic is shared between the steps that check for job startup and job completion, I factored it out into a shared script.
# derive a timeout in second units. There is no equivalent timeout for running | ||
# jobs, because those timeouts can be set on the Batch level, whereas startup | ||
# timeouts are not controllable by Batch | ||
BATCH_JOB_POLL_STARTUP_MAX_RETRIES=60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could make sense to eventually make these attributes configurable, particularly once we factor this logic out into a shared action or workflow, but for the sake of faster iteration I decided to hardcode them for now since bash argument parsing is so annoying.
Requesting @dfsnow one more time to make sure I didn't miss any comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @jeancochrane. Nice work
@@ -0,0 +1,87 @@ | |||
name: Setup Terraform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pattern in r-lib/actions works pretty well. Minor versions can change the code underlying the tag and only breaking changes would warrant a major version bump. I think it makes sense to basically factor out stuff into composite actions in their own sub-directories. Curious what you think
compute_resources { | ||
type = "FARGATE" | ||
min_vcpus = 0 | ||
max_vcpus = 64 # Max across all jobs, not within one job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally makes sense and I like the flexibility to being able to change environment per repo and per branch. Nice thinking 👍
# workflow that provisions these resources | ||
resource "aws_batch_job_definition" "main" { | ||
name = var.batch_job_name | ||
platform_capabilities = ["FARGATE"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Not especially worried about this yet but it may become important once we start training lots of models (December).
This PR updates the
deploy.yaml
workflow added in #25 to run the model on CI using AWS Batch jobs. It also adds acleanup-terraform.yaml
workflow to delete any created AWS resources when PRs are closed.All of the Batch resources required by PR runs are provisioned by the code added in this PR using Terraform; however, some resources that remain consistent across Batch jobs (e.g. IAM roles and policies) are instead managed outside of this repo and referenced here using Terraform data sources. We will manage these resources in the https://github.com/ccao-data/aws-infrastructure repo instead.