Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allow benchmark between multiple configs #703

Open
wants to merge 1 commit into
base: gh/H-Huang/17/base
Choose a base branch
from

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Nov 26, 2024

Stack from ghstack (oldest at bottom):

This is WIP and needs to be cleaned up but creating PR to solicit feedback.

There isn't an easy way run multiple runs together and display the metrics in a single view. We already save the metrics to tensorboard so this PR adds a data structure to retrieve these metrics and display it in a table. It is an enhancement in test_runner.py

Example:

# under test_runner.py there are 3 different configurations we want to compare metrics between
OverrideDefinitions(
    [
        [
            "--experimental.pipeline_parallel_degree 2",
            "--training.data_parallel_shard_degree 2",
            "--metrics.enable_tensorboard",
        ],
        [
            "--training.data_parallel_shard_degree 4",
            "--metrics.enable_tensorboard",
        ],
        [
            "--training.tensor_parallel_degree 4",
            "--metrics.enable_tensorboard",
        ],
    ],
    "example",
    "my_example",
    ngpu=4,
),

This can be outputted as:

Run ID: 0, args: ['--experimental.pipeline_parallel_degree 2', '--training.data_parallel_shard_degree 2', '--metrics.enable_tensorboard']
Run ID: 1, args: ['--training.data_parallel_shard_degree 4', '--metrics.enable_tensorboard']
Run ID: 2, args: ['--training.tensor_parallel_degree 4', '--metrics.enable_tensorboard']

           | wps                          | mfu(%)                       | memory/max_active(GiB)       | memory/max_active(%)         | memory/max_reserved(%)       | loss_metrics/global_avg_loss | loss_metrics/global_max_loss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0          | 55928.27734375               | 1.6978793144226074           | 0.7439417839050293           | 0.940051257610321            | 1.0439579486846924           | 7.137195110321045            | 7.143610954284668           
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1          | 97578.0234375                | 2.9622886180877686           | 1.9954571723937988           | 2.521476984024048            | 2.6999762058258057           | 7.031539440155029            | 7.066774845123291           
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2          | 10559.697265625              | 0.3205729126930237           | 1.0774049758911133           | 1.3614182472229004           | 1.4314316511154175           | 7.13456916809082             | 7.13456916809082            

H-Huang added a commit that referenced this pull request Nov 26, 2024
ghstack-source-id: 613f40b90f936d11f3ffe5f523d878ba7150f18f
Pull Request resolved: #703
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 26, 2024
@H-Huang H-Huang marked this pull request as ready for review November 26, 2024 22:09
@H-Huang H-Huang changed the title Allow benchmark between multiple configs [WIP] Allow benchmark between multiple configs Nov 26, 2024
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the MetricRetriever and print_metrics are nice. But I'm kinda curious why you felt they were needed. You can also just fire up tensorboard and compare all the metrics that way, right? (and then you get graphs of loss too).

Actually iirc I once tried to compare titan runs in tensorboard and I found the way the logdir was set up made it a bit annoying. It might be worth addressing that pain point if that is a pain point. (I might have just been running things in a dumb way too).

@@ -10,7 +10,9 @@
import subprocess
from collections import defaultdict
from dataclasses import dataclass
from typing import Sequence
from typing import Any, Dict, Sequence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE the binary files above, not sure if you meant to include those in the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants