Lower stream-parallelized `LinearOp` into Host IR AG+GEMM overlap algo #3736

samnordmann · 2025-01-20T12:37:35Z

stacked on top of:

Host irs: LinearOp with pre-allocated output #3735

We add a lowering path from Linear Op with stream-parallelized schedule to Host IR AG+GEMM overlap algorithm. This is an intermediate step towards using the comms/compute pipelined algorithm in transformer.

More precisely, the fusion:

  TensorView* in = makeContigTensor(4); //[S, DIDx(D), M/(S*d), K]
  TensorView* weight = makeContigTensor(2); //[N, K]
  TensorView* bias = makeContigTensor(1); //[N]
  TensorView* out = matmul(a, b); //[S, D, M/(S*D), N]

  fusion->addInput(in);
  fusion->addInput(weight);
  fusion->addInput(bias);
  fusion->addOutput(out);

  auto mesh = DeviceMesh::createForNumDevices(D);
  in->setDeviceMesh(mesh);
  weight->setDeviceMesh(mesh);
  bias->setDeviceMesh(mesh);
  out->setDeviceMesh(mesh);

  in->axis(1)->parallelize(ParallelType::DIDx);
  out->axis(0)->parallelize(ParallelType::Stream);

gets lower through MultiDeviceExecutor to (obtained using NVFUSER_DUMP=host_ir):

%HostIrContainer { (T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g_float[iS6{i7}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g_float[iStream7{i0}, iS8{i2}, iS9{i3}, iS10{i5}, rS11{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})) :
  GetCurrentStream into Stream 0
  T4_g_float[iS12{i0}, iS13{i2}, iS14{i3}, iS15{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g_float[iS12{i0}, iS13{i2}, iS14{i3}, iS15{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i4 ), zero_init=false, resets_to_zero=false)
  T3_g_float[iStream7{i0}, iS8{i2}, iS9{i3}, iS10{i5}, rS11{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g_float[iStream7{i0}, iS8{i2}, iS9{i3}, iS10{i5}, rS11{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i5 ), zero_init=false, resets_to_zero=false)
  FOR i105 in iS0{i0}:
    SetCurrentStream to Stream ( i105 % numberOfStreams )
    T5_l_float[ideviceIdx.x16{i2}, iS17{i3}, iS18{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS0{i0}, index = i105 )
    T6_l_float[iS19{i2}, iS20{i3}, iS21{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T4_g_float[iS12{i0}, iS13{i2}, iS14{i3}, iS15{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS12{i0}, index = i105 )
    Communication 46 (type=Allgather, team=(0 1 2 3 4 5 6 7), input=T5_l_float[ideviceIdx.x16{i2}, iS17{i3}, iS18{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), output=T6_l_float[iS19{i2}, iS20{i3}, iS21{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}))
    Wait Communication 46
    T7_l_float[iS22{i2}, iS23{i3}, iS24{i5}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T3_g_float[iStream7{i0}, iS8{i2}, iS9{i3}, iS10{i5}, rS11{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iStream7{i0}, index = i105 )
    T7_l_float[iS22{i2}, iS23{i3}, iS24{i5}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = linear(T6_l_float[iS19{i2}, iS20{i3}, iS21{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}),
                T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})      ,
          T2_g_float[iS6{i7}] (DeviceMesh{0 1 2 3 4 5 6 7})      )
    SetCurrentStream to Stream 0
    Synchronize Stream ( i105 % numberOfStreams )
} // %HostIrContainer

samnordmann · 2025-01-20T15:54:42Z

!test

samnordmann added 4 commits January 20, 2025 03:18

Host Ir: add linear op with preallocated outputs

cf3c531

slightly simplify implementation and test

9e17933

lint

d432662

Lower stream-parallelized LinearOp into Host IR AG+GEMM overlap algo

bff1abb

samnordmann changed the title ~~Overlaplower linear to hostir~~ Lower stream-parallelized LinearOp into Host IR AG+GEMM overlap algo Jan 20, 2025

lint

1fff21a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower stream-parallelized `LinearOp` into Host IR AG+GEMM overlap algo #3736

Lower stream-parallelized `LinearOp` into Host IR AG+GEMM overlap algo #3736

samnordmann commented Jan 20, 2025 •

edited

Loading

samnordmann commented Jan 20, 2025

Lower stream-parallelized LinearOp into Host IR AG+GEMM overlap algo #3736

Are you sure you want to change the base?

Lower stream-parallelized LinearOp into Host IR AG+GEMM overlap algo #3736

Conversation

samnordmann commented Jan 20, 2025 • edited Loading

samnordmann commented Jan 20, 2025

Lower stream-parallelized `LinearOp` into Host IR AG+GEMM overlap algo #3736

Lower stream-parallelized `LinearOp` into Host IR AG+GEMM overlap algo #3736

samnordmann commented Jan 20, 2025 •

edited

Loading