Flush debug print to avoid truncated output #3715

naoyam · 2025-01-16T01:31:37Z

Quite commonly, debug dump like NVFUSER_DUMP=fusion_ir_math gets truncated when a device-side error happens. This should at least avoid that with some of fusion dump.

naoyam · 2025-01-16T01:31:59Z

!build

github-actions · 2025-01-16T01:32:16Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 1 🔵⚪⚪⚪⚪

🧪 No relevant tests

⚡ Recommended focus areas for review

Function Signature

Verify that the addition of std::flush to the debug() output streams does not introduce any unintended side effects or performance concerns, especially considering the debug logging's impact on the program's behavior.

os << std::flush;

Consistency Check

Ensure that the newly added std::flush statements are consistently applied to all relevant debug output streams throughout the codebase, maintaining uniform behavior for debug logging.

  debug() << std::flush;
}

std::unordered_map<
    TensorView*,
    std::pair<std::vector<int64_t>, std::vector<int64_t>>>
Fusion::bankConflictInfo(const CompileParams& compile_params) {
  std::vector<TensorView*> smem_tvs;
  for (auto v : usedMathVals()) {
    auto tv = dynamic_cast<TensorView*>(v);
    if (tv == nullptr) {
      continue;
    }
    if (tv->getMemoryType() == MemoryType::Shared) {
      smem_tvs.push_back(tv);
    }
  }
  if (smem_tvs.empty()) {
    return {};
  }
  manage("smem_tvs", smem_tvs);

  GpuLower lower(this, compile_params);
  lower.run();
  auto kernel = lower.kernel();
  auto info = getBankConflictInfo(kernel);

  // Convert TVs in kernel to TVs in fusion
  auto smem_tvs_in_kernel =
      kernel->getManaged<std::vector<TensorView*>>("smem_tvs");
  NVF_ERROR(smem_tvs_in_kernel.size() == smem_tvs.size());
  auto getSmemTvInFusion = [&](Val* v) -> TensorView* {
    auto ti = dynamic_cast<kir::TensorIndex*>(v);
    if (ti == nullptr) {
      return nullptr;
    }
    auto tv = ti->view();
    auto it =
        std::find(smem_tvs_in_kernel.begin(), smem_tvs_in_kernel.end(), tv);
    if (it == smem_tvs_in_kernel.end()) {
      return nullptr;
    }
    auto index = std::distance(smem_tvs_in_kernel.begin(), it);
    return smem_tvs.at(index);
  };

  std::unordered_map<
      TensorView*,
      std::pair<std::vector<int64_t>, std::vector<int64_t>>>
      result;
  result.reserve(info.size());
  for (auto i : info) {
    auto expr = i.first;

    // Currently only set and load store op are supported
    NVF_ERROR(expr->inputs().size() == 1);
    NVF_ERROR(expr->outputs().size() == 1);

    auto input = getSmemTvInFusion(expr->input(0));
    auto output = getSmemTvInFusion(expr->output(0));
    if (input == nullptr) {
      NVF_ERROR(i.second.first == 0);
    } else {
      NVF_ERROR(i.second.first != 0);
      result[input].first.push_back(i.second.first);
    }
    if (output == nullptr) {
      NVF_ERROR(i.second.second == 0);
    } else {
      NVF_ERROR(i.second.second != 0);
      result[output].second.push_back(i.second.second);
    }
  }
  return result;
}

void Fusion::printMath(bool from_outputs_only) {
  FUSER_PERF_SCOPE("Fusion::printMath");

  FusionGuard fg(this);
  auto exprs_for_print = exprs();
  debug() << "Inputs:" << std::endl;
  for (auto inp : inputs()) {
    debug() << "  " << inp << std::endl;
  }

  debug() << "Outputs:" << std::endl;
  for (auto out : outputs()) {
    debug() << "  " << out << std::endl;
  }

  // If we want everything in the fusion, grab all values without uses to
  // traverse from.
  if (!from_outputs_only) {
    std::vector<Val*> leaf_vals;
    for (auto val : deterministic_vals()) {
      if (val->uses().empty()) {
        leaf_vals.push_back(val);
      }
    }
    exprs_for_print = StmtSort::getExprsTo(leaf_vals);
  }

  debug() << "\n%kernel_math {\n";
  for (auto expr : exprs_for_print) {
    debug() << expr;
  }
  debug() << "} // %kernel_math \n\n";

  debug() << std::flush;

wujingyue

I can imagine having to add more flushes in the future. You may want to let debug() subclass ostream& and overload operator<< so we can flush all messages to debug() conditioned by a flag. For example, in glog, LOG(ERROR) always flushes and LOG(INFO) doesn't.

naoyam · 2025-01-16T07:11:53Z

I can imagine having to add more flushes in the future. You may want to let debug() subclass ostream& and overload operator<< so we can flush all messages to debug() conditioned by a flag. For example, in glog, LOG(ERROR) always flushes and LOG(INFO) doesn't.

Agreed

Flush debug print to avoid truncated output

331d235

naoyam requested a review from wujingyue January 16, 2025 01:31

wujingyue approved these changes Jan 16, 2025

View reviewed changes

naoyam merged commit 0d0402f into main Jan 16, 2025
18 checks passed

naoyam deleted the debug_dump_flush branch January 16, 2025 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flush debug print to avoid truncated output #3715

Flush debug print to avoid truncated output #3715

naoyam commented Jan 16, 2025

naoyam commented Jan 16, 2025

github-actions bot commented Jan 16, 2025

wujingyue left a comment

naoyam commented Jan 16, 2025

Flush debug print to avoid truncated output #3715

Flush debug print to avoid truncated output #3715

Conversation

naoyam commented Jan 16, 2025

naoyam commented Jan 16, 2025

github-actions bot commented Jan 16, 2025

PR Reviewer Guide 🔍

wujingyue left a comment

Choose a reason for hiding this comment

naoyam commented Jan 16, 2025