Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for instance cordoning #972

Open
keithduncan opened this issue Dec 7, 2021 · 2 comments
Open

Add support for instance cordoning #972

keithduncan opened this issue Dec 7, 2021 · 2 comments
Labels
agent lifecycle Agent boot, job lifecycle, agent shutdown feature A feature request

Comments

@keithduncan
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Presently, when an agent is failing builds, the only way to fix it is to stop the agent (which terminates the instance) or terminate the instance directly.

In order to perform diagnosis on instances, it would be useful to be able to "cordon" an instance while stopping the agent from accepting any more jobs.

Describe the solution you'd like

Simply not dispatching to a given agent from buildkite.com would not be sufficient. Cordoning at the agent level would prevent a replacement instance from being booted in order to maintain pool capacity.

Instead, infrastructure level cordoning would remove the instance from the Auto Scaling group. Using autoscaling:EnterStandby would keep an ASG reference to the instance vs instance detach from the ASG, and the desired count would be maintained such that a replacement instance is booted.

The way I would expose this infrastructure level functionality up to the buildkite.com API and UI would be to include an agent lifecycle hook called cordon. If present when registering the agent with the API, set a flag that indicates the agent has a cordon hook that can be invoked.

In the Elastic CI Stack’s cordon hook I would either invoke the AWS CLI directly, or use an AWS SSM Automation to stop the agent systemd job and set the instance to standby.

Decoupling the agent and instance lifetimes may depend on the work started in #964 the solution may also need to take instances that set disconnect-after-job into consideration.

Describe alternatives you've considered

As above, keeping the agent alive but not dispatching to it is an inferior solution.

@keithduncan keithduncan added agent lifecycle Agent boot, job lifecycle, agent shutdown feature A feature request labels Dec 7, 2021
@keithduncan
Copy link
Contributor Author

Simply not dispatching to a given agent from buildkite.com would not be sufficient.

Some more thoughts on this. I think we could do both agent and instance cordoning, keep the agent around so it shows in the UI, but in a non-dispatchable state. The key part will be to ensure the instance and agent aren’t considered "available" by the buildkite-agent-scaler so that the pool is appropriately sized without assuming that the instance / agent is available for work.

Another factor to consider when cordoning an "agent" is multiple AGENTS_PER_INSTANCE. We wouldn’t want to pull an instance with multiple agents out of service and keep dispatching to some of the agents on it. Stopping the agent completely does seem like a more reliable way to guarantee that the instance doesn’t do any work.

@ptarjan
Copy link

ptarjan commented Feb 23, 2022

+1 to this feature request for Robinhood. We only use one agent per instance so don't have that edge case. Ideally the feature would be part of the UI next to the "Stop Agent" button.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent lifecycle Agent boot, job lifecycle, agent shutdown feature A feature request
Projects
None yet
Development

No branches or pull requests

2 participants