Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autotuner] Feature: add --cpu_budget and --timeout_per_trial #2395

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

luarss
Copy link
Contributor

@luarss luarss commented Sep 29, 2024

Rationale

  • More intuitive for end users who might just want to specify the total number of cpus to be blocked.
  • timeout_per_trial is different from overall timeout.

TODO

  • Tests for --cpu_budget -> verify the timeout is hit.
  • Tests for timeout per trial and timeout -> does it actually stop the entire training?

@vvbandeira vvbandeira marked this pull request as draft October 7, 2024 19:57
@luarss luarss self-assigned this Oct 8, 2024
@luarss luarss added the autotuner Flow autotuner label Oct 8, 2024
@luarss luarss changed the title WIP: [Autotuner] Feature: add --cpu_budget and --timeout_per_trial [Autotuner] Feature: add --cpu_budget and --timeout_per_trial Oct 10, 2024
@luarss luarss marked this pull request as ready for review October 12, 2024 13:53
@luarss luarss requested a review from vvbandeira October 12, 2024 13:54
@luarss luarss marked this pull request as draft October 16, 2024 09:30
@luarss luarss marked this pull request as ready for review October 17, 2024 00:35
@oharboe
Copy link
Collaborator

oharboe commented Nov 9, 2024

The various ORFS stages have vastly different memory and CPU needs.

How does the user characterize and balance this?

@vvbandeira
Copy link
Member

The various ORFS stages have vastly different memory and CPU needs.

How does the user characterize and balance this?

The intended usage for these knobs is to limit experiment runtime (and consequently $$ budget). These knobs do not limit how much resources a given ORFS run has access to.

CPU budget is intended to stop the experiment after the budget is spent.
Trial timeout will limit each ORFS complete flow runtime; this way, if your experiment requires 1k ORFS runs, you know you will, at most, use 1k * timout_per_trial units of time.

@luarss luarss force-pushed the feat/autotuner-budget branch from 6c1fb2d to c15d6ed Compare December 24, 2024 03:46
luarss added 15 commits January 10, 2025 14:19
Signed-off-by: Jack Luar <[email protected]>
Signed-off-by: Jack Luar <[email protected]>
Signed-off-by: Jack Luar <[email protected]>
Signed-off-by: Jack Luar <[email protected]>
Signed-off-by: Jack Luar <[email protected]>
… cpubudget prompt from hrs->seconds

Signed-off-by: Jack Luar <[email protected]>
@luarss luarss force-pushed the feat/autotuner-budget branch from c15d6ed to bff5e2e Compare January 10, 2025 14:20
@luarss luarss marked this pull request as draft January 10, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autotuner Flow autotuner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants