Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FA3 consecutive failing tests after first failure #1451

Open
benjamin-kroeger opened this issue Jan 21, 2025 · 0 comments
Open

FA3 consecutive failing tests after first failure #1451

benjamin-kroeger opened this issue Jan 21, 2025 · 0 comments

Comments

@benjamin-kroeger
Copy link

benjamin-kroeger commented Jan 21, 2025

I was testing the most recent FA3 commit 74aed78 in a docker container following the testing instructions.

pytest -q -s test_flash_attn.py

I ran into the following error multiple times after running the tests:

Consecutive testing

pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0]

FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] - RuntimeError: head_size should be a multiple of 8
FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0] - torch.OutOfMemoryError: CUDA out of memory.

Independent testing

Now if I run just the tests on their own i get a different error for the second one:

pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0]

FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] - RuntimeError: head_size should be a multiple of 8

pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0]

FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0] - RuntimeError: head_size should be a multiple of 8

Causing succeeding tests to fail

This goes as far as invalidating otherwise succeding tests as can be seen with this example.

pytest test_flash_attn.py::test_flash_attn_combine[155-2048-256-dtype1] Succeeds on its own but running a failing test ahead of it causes it to fail as well.

pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] test_flash_attn.py::test_flash_attn_combine[155-2048-256-dtype1]

FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] - RuntimeError: head_size should be a multiple of 8
FAILED test_flash_attn.py::test_flash_attn_combine[155-2048-256-dtype1] - torch.OutOfMemoryError: CUDA out of memory.

Setup

FA3 built from source with gcc12 in nvidia/pytorch:24.10-py3 docker container.
Running on a slurm node with 1 exclusively allocated H100.

I would be thankful for any hints regaring the crashes, the head_size ones as well as the out of memory failures.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant