You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] - RuntimeError: head_size should be a multiple of 8
FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0] - torch.OutOfMemoryError: CUDA out of memory.
Independent testing
Now if I run just the tests on their own i get a different error for the second one:
FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0] - RuntimeError: head_size should be a multiple of 8
Causing succeeding tests to fail
This goes as far as invalidating otherwise succeding tests as can be seen with this example.
pytest test_flash_attn.py::test_flash_attn_combine[155-2048-256-dtype1] Succeeds on its own but running a failing test ahead of it causes it to fail as well.
FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] - RuntimeError: head_size should be a multiple of 8
FAILED test_flash_attn.py::test_flash_attn_combine[155-2048-256-dtype1] - torch.OutOfMemoryError: CUDA out of memory.
Setup
FA3 built from source with gcc12 in nvidia/pytorch:24.10-py3 docker container.
Running on a slurm node with 1 exclusively allocated H100.
I would be thankful for any hints regaring the crashes, the head_size ones as well as the out of memory failures.
Thanks.
The text was updated successfully, but these errors were encountered:
I was testing the most recent FA3 commit 74aed78 in a docker container following the testing instructions.
pytest -q -s test_flash_attn.py
I ran into the following error multiple times after running the tests:
Consecutive testing
pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0]
Independent testing
Now if I run just the tests on their own i get a different error for the second one:
pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0]
pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0]
Causing succeeding tests to fail
This goes as far as invalidating otherwise succeding tests as can be seen with this example.
pytest test_flash_attn.py::test_flash_attn_combine[155-2048-256-dtype1]
Succeeds on its own but running a failing test ahead of it causes it to fail as well.pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] test_flash_attn.py::test_flash_attn_combine[155-2048-256-dtype1]
Setup
FA3 built from source with gcc12 in nvidia/pytorch:24.10-py3 docker container.
Running on a slurm node with 1 exclusively allocated H100.
I would be thankful for any hints regaring the crashes, the head_size ones as well as the out of memory failures.
Thanks.
The text was updated successfully, but these errors were encountered: