misaligned memory access from transpose kernel #3701

jjsjann123 · 2025-01-14T01:24:26Z

Looks like it's caused by #3621

repro script

# CUDA devices:
#  0: NVIDIA A100 80GB PCIe
# torch version: 2.6.0a0+ecf3bae40a.nvInternal
# cuda version: 12.8
# nvfuser version: 0.2.24+git371e717
import torch
from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id2(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[1, 1024, 128], contiguity=[None, True, True], dtype=DataType.Float, is_cpu=False, stride_order=[2, 0, 1])
    T1 = fd.define_tensor(shape=[1, 32, 1024, 128], contiguity=[None, True, True, True], dtype=DataType.Float, is_cpu=False, stride_order=[3, 1, 2, 0])
    T2 = fd.ops.broadcast(T0, is_broadcast_dim=[False, True, False, False])
    S3 = fd.ops.size(T2, dim=0)
    S4 = fd.define_scalar(32, dtype=DataType.Int)
    S5 = fd.ops.size(T2, dim=2)
    S6 = fd.ops.size(T2, dim=3)
    V7 = fd.define_vector([S3, S4, S5, S6], dtype=DataType.Int)
    T8 = fd.ops.expand(T2, shape=V7)
    T9 = fd.ops.mul(T8, T1)
    #S10 = fd.define_scalar(8, dtype=DataType.Int)
    #V11 = fd.define_vector([S3, S10, S5, S6], dtype=DataType.Int)
    #T12 = fd.ops.expand(T2, shape=V11)
    fd.add_output(T9)
    #fd.add_output(T12)
    fd.add_output(T2)

with FusionDefinition() as fd:
    nvfuser_fusion_id2(fd)

inputs = [
    torch.randn(131072, dtype=torch.float32, device='cuda:0').as_strided((1, 1024, 128), (131072, 1, 1024)),
    torch.randn(4194304, dtype=torch.float32, device='cuda:0').as_strided((1, 32, 1024, 128), (4194304, 128, 4096, 1)),
]
fd.execute(inputs)

torch.cuda.synchronize()

throws

RuntimeError:  INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/runtime/executor.cpp":1421, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. CUDA error: CUDA_ERROR_MISALIGNED_ADDRESS failed with error misaligned address

The text was updated successfully, but these errors were encountered:

#3706) Reverts #3621 to fix #3701

wujingyue · 2025-01-16T00:03:40Z

This was broken when

Fuser/csrc/scheduler/transpose.cpp

Line 869 in ef6f169

existing_cache->setMemoryType(MemoryType::Shared);

puts an existing input cache (a product of cacheInputs) in shared memory but its allocation domain wasn't consistent with the input due to #3621. (It's unclear to me why inconsistency between the allocation domain of a global input and that of its shared-memory cache caused a misalignment error. However, I think that inconsistency would at least slow down the generated code.)

The following patch

diff --git a/csrc/scheduler/transpose.cpp b/csrc/scheduler/transpose.cpp
index bef6b2b7..ecfa5615 100644
--- a/csrc/scheduler/transpose.cpp
+++ b/csrc/scheduler/transpose.cpp
@@ -860,15 +860,9 @@ void scheduleTranspose(Fusion* fusion, const TransposeParams* tparams) {
       grouped_inputs_outputs[1].begin(), grouped_inputs_outputs[1].end());
   for (auto tv : grouped_inputs_outputs[1]) {
     if (tv->isFusionInput()) {
-      auto existing_cache = ir_utils::consumerTvsOf(tv)[0];
-      if (ir_utils::consumerTvsOf(existing_cache).size() > 1) {
-        auto new_cache = tv->cacheAfter();
-        new_cache->setMemoryType(MemoryType::Shared);
-        group2_and_cached_inputs.emplace(new_cache);
-      } else {
-        existing_cache->setMemoryType(MemoryType::Shared);
-        group2_and_cached_inputs.emplace(existing_cache);
-      }
+      auto new_cache = tv->cacheAfter();
+      new_cache->setMemoryType(MemoryType::Shared);
+      group2_and_cached_inputs.emplace(new_cache);
     }
   }
   // set cached outputs of group 2 to shared memory

can be a workaround (note that cacheAfter() propagates allocation by default), but I hardly believe it's the right solution.

IIUC, any scheduler that caches inputs in shared memory can be broken by #3621. Transpose and matmul are of that type. Normalization would likely suffer too.

cc @naoyam

kshitij12345 mentioned this issue Jan 14, 2025

Misaligned Address while running Qwen2 #3704

Closed

wujingyue mentioned this issue Jan 14, 2025

Revert "cacheInputs propagates allocation only for matmul schedulers." #3706

Merged

wujingyue closed this as completed in #3706 Jan 14, 2025

wujingyue closed this as completed in 5e08e1d Jan 14, 2025

wujingyue self-assigned this Jan 14, 2025

naoyam pushed a commit that referenced this issue Jan 14, 2025

Revert "cacheInputs propagates allocation only for matmul schedulers." (

c3c2093

#3706) Reverts #3621 to fix #3701

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misaligned memory access from transpose kernel #3701

misaligned memory access from transpose kernel #3701

jjsjann123 commented Jan 14, 2025

wujingyue commented Jan 16, 2025

misaligned memory access from transpose kernel #3701

misaligned memory access from transpose kernel #3701

Comments

jjsjann123 commented Jan 14, 2025

wujingyue commented Jan 16, 2025