You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There’s a issue with our runner scale up code that if the initial webhook for starting the job gets dropped then the system will never try to provision a fresh runner for that job.
In the Meta fleet that hasn’t been a noticeable issue since we will have enough runners of every instance type running or in standby that it still leaves a few machines available to service the dropped requests. However, the LF fleet is has fewer jobs requested of it right now (so fewer runners that might be just finishing up a job and be ready to service a request) and it also has a no idle fleet (to reduce costs), which results in this behavior.
To fix this: we should start regularly checking GitHub for all queued jobs and ensure we are provisioning enough runners to handle all those requests
I've been exploring different ways we might be able to handle this:
Method 1) Query GitHub API for list of queued jobs.
This is not possible with the existing GitHub API as far as I'm aware. The closest we can get is using workflow_job API which lets you query for a specific job_id that you must know in advance. Another one is workflow_run which queries for jobs associated with a specific workflow run.
Method 2) Store a list of queued jobs as they come in from webhook calls and query the GitHub API to check that those jobs move on from the queued state.
This takes inspiration from upstream ALI code which introduced a beta feature for job retry. Where we store in some kind of database or return to the SQS queue all the jobs and query the GitHub API for each job to confirm once they've moved on from queued to in progress. This would cost us 1 API call per job which could add up very quickly at PyTorch's scale I suspect we'd be at risk of hitting that 15000 per hour enterprise account limit for the project.
Without a way to get a list of queued jobs we cannot smartly handle this and we cannot periodically run the scale-up without a payload for each job type thats still queued. The webhook call only adds 1 message for a job queued job and does not send one again even if the job is still queued hours later.
There’s a issue with our runner scale up code that if the initial webhook for starting the job gets dropped then the system will never try to provision a fresh runner for that job.
In the Meta fleet that hasn’t been a noticeable issue since we will have enough runners of every instance type running or in standby that it still leaves a few machines available to service the dropped requests. However, the LF fleet is has fewer jobs requested of it right now (so fewer runners that might be just finishing up a job and be ready to service a request) and it also has a no idle fleet (to reduce costs), which results in this behavior.
To fix this: we should start regularly checking GitHub for all queued jobs and ensure we are provisioning enough runners to handle all those requests
Example of jobs that hit this issue: https://github.com/pytorch/pytorch/actions/runs/10779508314/job/29935842319
The text was updated successfully, but these errors were encountered: