Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle dropped webhooks for scaling up new runners #271

Open
ZainRizvi opened this issue Sep 10, 2024 · 1 comment
Open

Handle dropped webhooks for scaling up new runners #271

ZainRizvi opened this issue Sep 10, 2024 · 1 comment

Comments

@ZainRizvi
Copy link
Contributor

ZainRizvi commented Sep 10, 2024

There’s a issue with our runner scale up code that if the initial webhook for starting the job gets dropped then the system will never try to provision a fresh runner for that job.

In the Meta fleet that hasn’t been a noticeable issue since we will have enough runners of every instance type running or in standby that it still leaves a few machines available to service the dropped requests. However, the LF fleet is has fewer jobs requested of it right now (so fewer runners that might be just finishing up a job and be ready to service a request) and it also has a no idle fleet (to reduce costs), which results in this behavior.

To fix this: we should start regularly checking GitHub for all queued jobs and ensure we are provisioning enough runners to handle all those requests

Example of jobs that hit this issue: https://github.com/pytorch/pytorch/actions/runs/10779508314/job/29935842319

@zxiiro
Copy link
Collaborator

zxiiro commented Sep 11, 2024

I've been exploring different ways we might be able to handle this:

Method 1) Query GitHub API for list of queued jobs.

This is not possible with the existing GitHub API as far as I'm aware. The closest we can get is using workflow_job API which lets you query for a specific job_id that you must know in advance. Another one is workflow_run which queries for jobs associated with a specific workflow run.

Method 2) Store a list of queued jobs as they come in from webhook calls and query the GitHub API to check that those jobs move on from the queued state.

This takes inspiration from upstream ALI code which introduced a beta feature for job retry. Where we store in some kind of database or return to the SQS queue all the jobs and query the GitHub API for each job to confirm once they've moved on from queued to in progress. This would cost us 1 API call per job which could add up very quickly at PyTorch's scale I suspect we'd be at risk of hitting that 15000 per hour enterprise account limit for the project.

Without a way to get a list of queued jobs we cannot smartly handle this and we cannot periodically run the scale-up without a payload for each job type thats still queued. The webhook call only adds 1 message for a job queued job and does not send one again even if the job is still queued hours later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants