Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: implement client-defined deployment grace period #144

Open
andy108369 opened this issue Nov 13, 2023 · 3 comments
Open
Labels
repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Nov 13, 2023

I think the Alternative proposal (client-defined) would be ideal if the timeout (the amount of time when the lease is down because it cannot redeploy as the worker node is down) could be configured by the clients themselves in their SDL (say deployment_grace_period or tolerate_downtime in deployment manifests).

This way the clients could specify the amount of time (in hours/days) they can tolerate their app being down.

It could also be useful when providers have been running a deployment with persistent storage and they aren't willing to lose it and can accept the downtime measured even in days, just so that they will be able to get their data back. (I know they should backup the data, monitor the backups; but many don't and backups can get stale/break or corrupted)

@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Nov 13, 2023
@SGC41
Copy link

SGC41 commented Nov 13, 2023

i think the default of such setting should also be pretty high...
the only good reason to kill a deployment due to some downtime, would be in some sort of HA setup where replacements will have taken over the work anyways.

all these fast lease closures gives tenants are grief.
especially in the case of persistent storage, where they could have been working on their deployment for weeks, making potentially massive changes, which might only exist in the persistent storage.

which then is destroyed after just 10 or 30 minutes of issues from the provider side.

@vpavlin
Copy link

vpavlin commented Apr 29, 2024

I am not 100% sure this is the same case, but I'll put it here just in case it is:)

I had a deployment which takes ~20mins to start (syncing some on-chain data) and then I needed to update the image. I screwed up the image format, so the deployment failed. I then fixed the image format, but it seemed like the scheduler/node did not pick up the fix (probably due to backoff) before the lease got automatically closed.

For this case I could definitely see mysel setting this grace period to an hour if that would mean I'd not have to start from scratch in case of a small mess-up:)

@andy108369
Copy link
Contributor Author

andy108369 commented Nov 18, 2024

FWIW: The provider-defined deployment grace period can be adjusted since the provider v0.6.4:
akash-network/provider@99cb9ac

I'll update the helm charts to support this in
akash-network/helm-charts#289

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

4 participants