Handle dropped webhooks for scaling up new runners #271

ZainRizvi · 2024-09-10T15:25:18Z

There’s a issue with our runner scale up code that if the initial webhook for starting the job gets dropped then the system will never try to provision a fresh runner for that job.

In the Meta fleet that hasn’t been a noticeable issue since we will have enough runners of every instance type running or in standby that it still leaves a few machines available to service the dropped requests. However, the LF fleet is has fewer jobs requested of it right now (so fewer runners that might be just finishing up a job and be ready to service a request) and it also has a no idle fleet (to reduce costs), which results in this behavior.

To fix this: we should start regularly checking GitHub for all queued jobs and ensure we are provisioning enough runners to handle all those requests

Example of jobs that hit this issue: https://github.com/pytorch/pytorch/actions/runs/10779508314/job/29935842319

zxiiro · 2024-09-11T21:27:21Z

I've been exploring different ways we might be able to handle this:

Method 1) Query GitHub API for list of queued jobs.

This is not possible with the existing GitHub API as far as I'm aware. The closest we can get is using workflow_job API which lets you query for a specific job_id that you must know in advance. Another one is workflow_run which queries for jobs associated with a specific workflow run.

Method 2) Store a list of queued jobs as they come in from webhook calls and query the GitHub API to check that those jobs move on from the queued state.

This takes inspiration from upstream ALI code which introduced a beta feature for job retry. Where we store in some kind of database or return to the SQS queue all the jobs and query the GitHub API for each job to confirm once they've moved on from queued to in progress. This would cost us 1 API call per job which could add up very quickly at PyTorch's scale I suspect we'd be at risk of hitting that 15000 per hour enterprise account limit for the project.

Without a way to get a list of queued jobs we cannot smartly handle this and we cannot periodically run the scale-up without a payload for each job type thats still queued. The webhook call only adds 1 message for a job queued job and does not send one again even if the job is still queued hours later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle dropped webhooks for scaling up new runners #271

Handle dropped webhooks for scaling up new runners #271

ZainRizvi commented Sep 10, 2024 •

edited

Loading

zxiiro commented Sep 11, 2024

Handle dropped webhooks for scaling up new runners #271

Handle dropped webhooks for scaling up new runners #271

Comments

ZainRizvi commented Sep 10, 2024 • edited Loading

zxiiro commented Sep 11, 2024

ZainRizvi commented Sep 10, 2024 •

edited

Loading