Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BE] add integration test for the generation script #741

Closed
wants to merge 5 commits into from

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Dec 16, 2024

tianyu-l added a commit that referenced this pull request Dec 16, 2024
ghstack-source-id: 28b322d8ce59ba501e948e999d198d32d6919c56
Pull Request resolved: #741
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 16, 2024
tianyu-l added a commit that referenced this pull request Dec 16, 2024
ghstack-source-id: f037c5ad8bee21bf5e5d09a0a93ef780c967a8f2
Pull Request resolved: #741
Comment on lines 98 to 107
if seed is not None:
torch.manual_seed(seed)
# PYTHONHASHSEED can be a decimal number in the range [0, 2**32 - 1]
os.environ["PYTHONHASHSEED"] = str(seed % 2**32)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# env var for deterministic CuBLAS
# https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the set_determinism() removed from the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call of set_determinism() was added before @wconstab's PR to make RNG right, which broke this line. Given how seed is explicitly used here, I'm not sure if the code over there would indicate (would it still be correct?). Need @XilunWu's help on understanding more.

This PR tries to restore the behavior before the "BC-breaking change" and guard on its running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the set_determinism change for correct RNG affecting this code? I would hope that calling `set_determinism(seed, deterministic=True) would be equivalent to some manual stuff done here. What is the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline; refactored to use set_determinism

tianyu-l added a commit that referenced this pull request Dec 17, 2024
ghstack-source-id: 431ed69d0babd8b16f1488b6bb8f90aea4c6f983
Pull Request resolved: #741
@tianyu-l tianyu-l requested a review from fegin December 17, 2024 00:52
@@ -147,6 +141,8 @@ def test_generate(
# sequences would require https://github.com/pytorch/torchtitan/pull/686
apply_tp_minus_sp(model, world_mesh["tp"])

utils.set_determinism(world_mesh, device, seed, deterministic=(seed is not None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"seed is not None" is equiavlent to the old logic, but i still think its better to separate the concepts.

it seems mandatory to use a seed for the token sampler so that each rank uses the same token. However, it should be optional to force kernels to run in the slower 'deterministic' mode. So it should be possible to run with a seed but not assume deterministic=True

"--optimizer.early_step_in_backward",
],
[
# placeholder for the generation script's generate step
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this part WIP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I wrote it this way because for generation, the first step is to call run_llama_train.sh and the second step is to call the generating script
https://github.com/pytorch/torchtitan/pull/741/files#diff-3b751e36d12b5fa68ae66727b4d8c6ef2cce12d4f2444b46fd882942c9f4a87fR441-R447

To make it less hacky I think we need some surgery to the file & classes.

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tianyu-l
Copy link
Contributor Author

encounter ghstack issues, reopening in #749

@tianyu-l tianyu-l closed this Dec 18, 2024
@tianyu-l tianyu-l deleted the gh/tianyu-l/30/head branch December 18, 2024 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants