-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support max-allowed Pods for nodes #3
base: dev
Are you sure you want to change the base?
Conversation
@shivramsrivastava The PR's ready. |
string core_id_substr = label.substr(idx + 4, label.size() - idx - 4); | ||
uint32_t core_id = strtoul(core_id_substr.c_str(), 0, 10); | ||
float available_cpu_cores = | ||
latest_stats.cpus_stats(core_id).cpu_capacity() * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a bit confused about why we need to touch the CPU stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are reading latest machine stats sample for a machine. And fetching particular core's cpu utilization, available cpus and storing it in PU's resource descriptor. And then we accumulate combined cpu stats of all PUs(PU - core, machine may more than 1 PUs) at the Machine's resource descriptor. While deciding the cost, drawing arcs we use collected available cpu from machine's resource descriptor.
} | ||
// Running/idle task count | ||
rd_ptr->set_num_running_tasks_below(rd_ptr->current_running_tasks_size()); | ||
rd_ptr->set_num_slots_below(FLAGS_max_tasks_per_pu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not we set the rd.num_slots_below()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current firmament code fixed the maximum number of pods that can be scheduled on PU by using constant number FLAGS_max_tasks_per_pu. So because of this when running_tasks on PU equals FLAGS_max_tasks_per_pu, capacity of arc from Machine to PU is being set to zero. So we are restricted to schedule only FLAGS_max_tasks_per_pu number of tasks/pods on the PU. So in order to remove this restriction we need to set this rd.num_slots_below() to maximum pods that can be scheduled on that PU, that too only for PU.
Code changes can be like.
1)Add new field ‘max_pods’ in ResourceDescriptor, which gets value from kubelet parameter ‘max-pods’ only once when machine is added.
2)For PU node only, while updating the resource descriptor, update num_slots_below to max_pods like below.
rd.set_num_slots_below(machine_rd.max_pods());
In this way, we can schedule max_pods number of pods on that PU not just FLAGS_max_tasks_per_pu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently max-pods
is passed to num_slots_below
. Need a new field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since num_slots_below gets changed every scheduling round, This value does not persist. So better to add new field.
rd_ptr->mutable_available_resources()->set_cpu_cores( | ||
available_cpu_cores); | ||
} | ||
// Running/idle task count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
353 to 324 is already done at 322 to 324. So should be removed. When accumulator->type_ is PU both code sections gets executed. Because 'other' is 'SINK' when accumulator is 'PU'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
353 to 354?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Sorry for the mistake.
5ba6239
to
972b86e
Compare
@shivramsrivastava PTAL |
TL;DRSchedulers should be cognizant about
PriorityInstead of a discrete logic of MAX permissible Pods on a Node, would you encourage us having a more condition tree based scheduling?
Schedulers could honor Pod Priority [
PredicateContextIt's a good start to have a
SuggestionIf we abstract the Rate-limit to a be a type of Predicate for scheduling, we could have an elegant predicate-based posidon scheduling. Reasons
|
Thanks for your feedback. This is exactly we currently support as part of Poseidon/Firmament scheduler. Node Filtering (Hard Constraints) -> Priority/Scoring (Soft Constraints) -> Taints -> Volume dependency. We are about to incorporate max. pods/node capability to go along with all of the above. If this is what you meant. |
@deepak-vij thanks for the comment. Could I request for a few reference hyperlinks to the documentation for Hard & Soft constraints? |
Can you please elaborate on the Rate-Limiting predicate? |
@shivramsrivastava Love the links! Rate Limiting
|
xref: