`EmbeddingFwdOp` node with same functionality as `F.embedding` #3649

Priya2698 · 2024-12-26T20:15:41Z

This PR adds an EmbeddingFwdOp with same functionality as F.embedding.

I am not using take_along_axis. F.embedding allows optional parameters like max_norm, padding_idx which would require further processing if implemented using take_along_axis. So I defaulted to creating a new node to guarantee performance parity.
Thunder uses prims.EMBEDDING if the optional parameters padding_idx/max_norm are specified, else it uses prims.TAKE. This prevents nvfuser from consuming embedding operator in the other cases. Hence, in Thunder, nvfuser will also directly execute ltorch.embedding. This will require a separate backward API to consume ltorch.embedding_backward and cannot reuse grad rules for prims.EMBEDDING. Hence, the EmbeddingFwdOp naming instead of EmbeddingOp.
I first plan to plumb the fwd only embedding support in Thunder while I draft the backward node which should be very similar. Thunder reviews may bring up another way of implementing this support.

github-actions · 2025-01-16T03:06:41Z

PR Reviewer Guide 🔍

(Review updated until commit `af25ce0`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪

🧪 PR contains tests

⚡ Recommended focus areas for review

Potential Logic Change

The new EmbeddingFwdOp node has been added with the same functionality as F.embedding. This change may introduce potential logic changes, especially with regards to function signatures. Reviewers should verify that the new node behaves as expected and does not introduce any regressions.

EmbeddingFwdOp::EmbeddingFwdOp(
    IrBuilderPasskey passkey,
    TensorView* output,
    TensorView* input,
    TensorView* weight,
    Val* padding_idx,
    Val* max_norm,
    Val* norm_type,
    Val* scale_grad_by_freq,
    Val* sparse)
    : Expr(passkey) {
  addOutput(output);

  addInput(input);
  addInput(weight);
  addInput(norm_type);
  addInput(scale_grad_by_freq);
  addInput(sparse);
  if (padding_idx != nullptr) {
    addInput(padding_idx);
    addDataAttribute(true);
  } else {
    addDataAttribute(false);
  }
  if (max_norm != nullptr) {
    addInput(max_norm);
    addDataAttribute(true);
  } else {
    addDataAttribute(false);
  }
}

NVFUSER_DEFINE_CLONE_AND_CREATE(EmbeddingFwdOp)

std::string EmbeddingFwdOp::toString(int indent_size) const {
  std::stringstream ss;
  indent(ss, indent_size) << out()->toString() << ",\n";
  indent(ss, indent_size + 1) << " = embedding(" << in()->toString() << ",\n";
  indent(ss, indent_size + 1) << "          " << weight()->toString() << ",\n";
  if (padding_idx() != nullptr) {
    indent(ss, indent_size + 1)
        << "          padding_idx = " << padding_idx()->toString() << ",\n";
  }
  if (max_norm() != nullptr) {
    indent(ss, indent_size + 1)
        << "          max_norm = " << max_norm()->toString() << ",\n";
  }
  indent(ss, indent_size + 1)
      << "          norm_type = " << norm_type()->toString() << ",\n";
  indent(ss, indent_size + 1)
      << "          scale_grad_by_freq = "
      << scale_grad_by_freq()->toInlineString() << ",\n";
  indent(ss, indent_size + 1)
      << "          sparse = " << sparse()->toInlineString() << ")\n";
  return ss.str();
}

std::string EmbeddingFwdOp::toInlineString(int indent_size) const {
  NVF_CHECK(false, "Tensor op can not be printed inline");
}

std::vector<PolymorphicValue> EmbeddingFwdOp::evaluate(
    const ExpressionEvaluator& ee,
    const std::vector<PolymorphicValue>& inputs) const {
  auto input = inputs.at(0).as<at::Tensor>();
  auto weight = inputs.at(1).as<at::Tensor>();
  auto norm_type = inputs.at(2).as<double>();
  auto scale_grad_by_freq = inputs.at(3).as<bool>();
  auto sparse = inputs.at(4).as<bool>();
  std::optional<int64_t> padding_idx = std::nullopt;
  if (has_padding_idx()) {
    padding_idx = inputs.at(5).as<int64_t>();
  }
  std::optional<double> max_norm = std::nullopt;
  if (has_max_norm()) {
    auto idx = 5 + has_padding_idx();
    max_norm = inputs.at(idx).as<double>();
  }

  namespace F = torch::nn::functional;
  return {F::embedding(
      input,
      weight,
      F::EmbeddingFuncOptions()
          .padding_idx(padding_idx)
          .max_norm(max_norm)
          .norm_type(norm_type)
          .scale_grad_by_freq(scale_grad_by_freq)
          .sparse(sparse))};
}
} // namespace nvfuser

Potential Logic Change

The embedding_fwd function has been added, which creates a new EmbeddingFwdOp node. Reviewers should verify that this function behaves as expected and does not introduce any regressions.

TensorView* embedding_fwd(
    TensorView* input,
    TensorView* weight,
    Val* padding_idx,
    Val* max_norm,
    Val* norm_type,
    Val* scale_grad_by_freq,
    Val* sparse) {
  auto input_domain = TensorDomain::noReductions(input->getLogicalDomain());
  auto weight_domain = TensorDomain::noReductions(weight->getLogicalDomain());
  NVF_CHECK(
      !input_domain.empty(),
      "Expected input to be atleast 1D, got: ",
      input_domain.size());
  NVF_CHECK(
      weight_domain.size() == 2,
      "Expected weight to be 2D, got: ",
      weight_domain.size());

  NVF_CHECK(
      !padding_idx || padding_idx->isScalar(),
      "Expected padding_idx to be a scalar int.");
  NVF_CHECK(
      !max_norm || max_norm->isScalar(),
      "Expected max_norm to be a scalar double.");
  NVF_CHECK(
      !norm_type || norm_type->isScalar(),
      "Expected scale to be a scalar double.");
  NVF_CHECK(
      !scale_grad_by_freq || scale_grad_by_freq->isScalar(),
      "Expected scale to be a scalar bool.");
  NVF_CHECK(
      !sparse || sparse->isScalar(), "Expected scale to be a scalar bool.");

  auto ndims_out = input_domain.size() + 1;
  std::vector<IterDomain*> out_domain(ndims_out, nullptr);

  for (auto idx : c10::irange(ndims_out - 1)) {
    out_domain[idx] = ops::newOutputIterDomain({input_domain[idx]});
  }
  out_domain[ndims_out - 1] = ops::newOutputIterDomain({weight_domain.back()});
  TensorDomain* out_td = IrBuilder::create<TensorDomain>(
      out_domain, TensorDomain::getContiguityFilledWith(out_domain, true));
  TensorView* output = IrBuilder::create<TensorView>(out_td, weight->dtype());

  if (norm_type == nullptr) {
    norm_type = IrBuilder::create<Val>(2.0, DataType::Double);
  }
  if (scale_grad_by_freq == nullptr) {
    scale_grad_by_freq = IrBuilder::create<Val>(false, DataType::Bool);
  }
  if (sparse == nullptr) {
    sparse = IrBuilder::create<Val>(false, DataType::Bool);
  }
  IrBuilder::create<EmbeddingFwdOp>(
      output,
      input,
      weight,
      padding_idx,
      max_norm,
      norm_type,
      scale_grad_by_freq,
      sparse);

  return output;
}

Potential Logic Change

The embedding_fwd function has been bound to the Python frontend. Reviewers should verify that this binding behaves as expected and does not introduce any regressions.

nvf_ops.def(
    "embedding_fwd",
    [](FusionDefinition::Operators& self,
       Tensor input,
       Tensor weight,
       std::optional<Scalar> padding_idx,
       std::optional<Scalar> max_norm,
       std::optional<Scalar> norm_type,
       std::optional<Scalar> scale_grad_by_freq,
       std::optional<Scalar> sparse) -> decltype(auto) {
      FUSER_PERF_SCOPE("Operators.embedding_fwd");
      NVF_CHECK(
          self.validUse(), "Attempting to add to a completed definition!");
      FusionDefinition* fd = self.fusion_definition;
      size_t ndims = input.dims + 1;
      Tensor output = fd->defineTensor(/*dims=*/ndims);

      auto padding_idx_state = padding_idx.has_value()
          ? fd->recordingState(padding_idx.value()())
          : State(/*_index=*/0, /*_stype=*/serde::StateType::None);
      auto max_norm_state = max_norm.has_value()
          ? fd->recordingState(max_norm.value()())
          : State(/*_index=*/0, /*_stype=*/serde::StateType::None);
      auto norm_type_state = norm_type.has_value()
          ? fd->recordingState(norm_type.value()())
          : State(/*_index=*/0, /*_stype=*/serde::StateType::None);
      auto scale_grad_by_freq_state = scale_grad_by_freq.has_value()
          ? fd->recordingState(scale_grad_by_freq.value()())
          : State(/*_index=*/0, /*_stype=*/serde::StateType::None);
      auto sparse_state = sparse.has_value()
          ? fd->recordingState(sparse.value()())
          : State(/*_index=*/0, /*_stype=*/serde::StateType::None);

      fd->defineRecord(new EmbeddingFwdOpRecord(
          {fd->recordingState(input()),
           fd->recordingState(weight()),
           padding_idx_state,
           max_norm_state,
           norm_type_state,
           scale_grad_by_freq_state,
           sparse_state},
          {fd->recordingState(output())}));
      return output;
    },
    py::arg("input"),
    py::arg("weight"),
    py::arg("padding_idx").none(true) = py::none(),
    py::arg("max_norm").none(true) = py::none(),
    py::arg("norm_type").none(true) = py::none(),
    py::arg("scale_grad_by_freq").none(true) = py::none(),
    py::arg("sparse").none(true) = py::none(),
    py::return_value_policy::reference);

Priya2698 · 2025-01-16T21:12:38Z

!test

protonu · 2025-01-17T18:30:06Z

tests/cpp/test_embedding_node.cpp

+
+constexpr int64_t n = 5, s = 2;
+
+TEST_F(EmbeddingTest, EmbeddingFwdNode) {


I wonder if it's possible to add a check to verify the output of the toString method as well.

protonu · 2025-01-17T18:40:40Z

csrc/ir/nodes.cpp

+  }
+  std::optional<double> max_norm = std::nullopt;
+  if (has_max_norm()) {
+    auto idx = 5 + has_padding_idx();


nit: having this free 5 bothers me a little bit, but not sure what would be better.

It may not be ideal, however, we are fetching the previous variables based on fixed indices as well. The position of the variables is constant so it should be safe.

protonu · 2025-01-17T19:06:13Z

csrc/ops/composite.cpp

+  if (norm_type == nullptr) {
+    norm_type = IrBuilder::create<Val>(2.0, DataType::Double);
+  }
+  if (scale_grad_by_freq == nullptr) {


nit: can we use IrContainer::falseVal() here?
input->fusion()->falseVall() or get the current fusion and call the function on it?

similar pattern below.

Priya2698 force-pushed the pm/embedding branch from e0e76e8 to a0aa5dc Compare December 26, 2024 20:17

kevinstephano mentioned this pull request Jan 10, 2025

HF Llama 1B 1 Layer slowness (inference) Lightning-AI/lightning-thunder#1467

Open

Priya2698 changed the title ~~EmbeddingOp node with same functionality as F.embedding~~ EmbeddingFwdOp node with same functionality as F.embedding Jan 16, 2025

Priya2698 added 8 commits January 15, 2025 19:04

embedding op implementation, scheduling

77b09cf

add test

18816d4

finish adding python API, python tests

7a1e646

use data attributes, private members get reset

db19eff

parametrize in python test

891d50e

update version

6818d6d

rename to embedding_fwd

44780c0

fix torch fn

9ea8f85

Priya2698 force-pushed the pm/embedding branch from a4a8e33 to 9ea8f85 Compare January 16, 2025 03:05

Priya2698 added 3 commits January 15, 2025 20:59

lintrunner

08c2f1f

fix lintunner

6698535

lint

af25ce0

Priya2698 requested review from jacobhinkle and protonu January 16, 2025 21:13

protonu reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`EmbeddingFwdOp` node with same functionality as `F.embedding` #3649

`EmbeddingFwdOp` node with same functionality as `F.embedding` #3649

Priya2698 commented Dec 26, 2024 •

edited

Loading

github-actions bot commented Jan 16, 2025 •

edited

Loading

Priya2698 commented Jan 16, 2025

protonu Jan 17, 2025 •

edited

Loading

protonu Jan 17, 2025

Priya2698 Jan 18, 2025

protonu Jan 17, 2025

protonu Jan 17, 2025


		constexpr int64_t n = 5, s = 2;

		TEST_F(EmbeddingTest, EmbeddingFwdNode) {

EmbeddingFwdOp node with same functionality as F.embedding #3649

Are you sure you want to change the base?

EmbeddingFwdOp node with same functionality as F.embedding #3649

Conversation

Priya2698 commented Dec 26, 2024 • edited Loading

github-actions bot commented Jan 16, 2025 • edited Loading

PR Reviewer Guide 🔍

(Review updated until commit af25ce0)

Priya2698 commented Jan 16, 2025

protonu Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

protonu Jan 17, 2025

Choose a reason for hiding this comment

Priya2698 Jan 18, 2025

Choose a reason for hiding this comment

protonu Jan 17, 2025

Choose a reason for hiding this comment

protonu Jan 17, 2025

Choose a reason for hiding this comment

`EmbeddingFwdOp` node with same functionality as `F.embedding` #3649

`EmbeddingFwdOp` node with same functionality as `F.embedding` #3649

Priya2698 commented Dec 26, 2024 •

edited

Loading

github-actions bot commented Jan 16, 2025 •

edited

Loading

(Review updated until commit `af25ce0`)

protonu Jan 17, 2025 •

edited

Loading