-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime Conditions #293
Comments
Note: We will likely need these at the "service level" as well as at the "node level" |
this sounds like a scheduling problem ("run cache on nodes with >X Gib available"). am I thinking of something different? |
I was thinking more about failure modes. "Run sidekiq as long as we can talk to the database" I think we want to "fail quickly" in situations, such that scheduling mechanisms can quickly try to address whatever problem is going on |
Maybe a better example: "Point all traffic at production as long as the backend is online, otherwise fail over to the replica" I am unsure if this is a step in the "turing complete yaml" direction again -- this is just a thought i had |
i think of all of these as scheduling issues. something needs to monitor the jobs that were started and if they're no longer running (if the service returns a failure code) then it needs to be rescheduled. what we might need is plumbing from "job" to outer aurae health check/service discovery. |
so think about edge networking and failures what do we do if a "node goes away" we should have some basic guarantees that a service wont end up running in 2 places just because wireguard broke, for example |
We will need to bake in a way for pods, cells, etc to support generic runtime conditions that will need to remain true during the duration of execution.
For example we may want an in-memory cache to only "run" as long as there is a configurable amount of memory available in the system.
This conditions will likely need to be extensible. We will want the ability to check status on various mechanisms such as remote APIs, network connectivity, local health checks, remote health checks, etc, etc
What is the best way to define these conditions in Aurae? Do we want to implement a "reverse health check" style system that will follow a proof of exhaustion style set of checks and break if something fails?
The text was updated successfully, but these errors were encountered: