FA3 consecutive failing tests after first failure #1451

benjamin-kroeger · 2025-01-21T00:18:19Z

I was testing the most recent FA3 commit 74aed78 in a docker container following the testing instructions.

pytest -q -s test_flash_attn.py

I ran into the following error multiple times after running the tests:

Consecutive testing

pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0]

FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] - RuntimeError: head_size should be a multiple of 8
FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0] - torch.OutOfMemoryError: CUDA out of memory.

Independent testing

Now if I run just the tests on their own i get a different error for the second one:

pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0]

FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] - RuntimeError: head_size should be a multiple of 8

pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0]

FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-True-dtype0] - RuntimeError: head_size should be a multiple of 8

Causing succeeding tests to fail

This goes as far as invalidating otherwise succeding tests as can be seen with this example.

pytest test_flash_attn.py::test_flash_attn_combine[155-2048-256-dtype1] Succeeds on its own but running a failing test ahead of it causes it to fail as well.

pytest test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] test_flash_attn.py::test_flash_attn_combine[155-2048-256-dtype1]

FAILED test_flash_attn.py::test_flash_attn_race_condition[1-239-59-False-dtype0] - RuntimeError: head_size should be a multiple of 8
FAILED test_flash_attn.py::test_flash_attn_combine[155-2048-256-dtype1] - torch.OutOfMemoryError: CUDA out of memory.

Setup

FA3 built from source with gcc12 in nvidia/pytorch:24.10-py3 docker container.
Running on a slurm node with 1 exclusively allocated H100.

I would be thankful for any hints regaring the crashes, the head_size ones as well as the out of memory failures.
Thanks.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA3 consecutive failing tests after first failure #1451

FA3 consecutive failing tests after first failure #1451

benjamin-kroeger commented Jan 21, 2025 •

edited

Loading

FA3 consecutive failing tests after first failure #1451

FA3 consecutive failing tests after first failure #1451

Comments

benjamin-kroeger commented Jan 21, 2025 • edited Loading

benjamin-kroeger commented Jan 21, 2025 •

edited

Loading