Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVIDIA GPU SPMD] Add runtime support to run windowed einsum in multiple streams #3

Closed
wants to merge 2,324 commits into from

Conversation

Tixxx
Copy link
Owner

@Tixxx Tixxx commented Jan 23, 2024

This PR contains the runtime changes to be able to run windowed einsum in multiple cuda streams.
This pr follows(openxla#7854) which adds stream attributes to the HLO graph.
We take the stream attributes and dispatch corresponding kernels to separate cuda streams.

jreiffers and others added 30 commits January 12, 2024 03:09
- Remove unnecessary helpers
- Unify naming (e.g. NumberOf vs Num)
- Clarify the two different tile sizes (per thread/per block)

PiperOrigin-RevId: 597792339
…IndexingMap.

This will simplify the logic in an upcoming tile analysis change.

PiperOrigin-RevId: 597794038
This allows using the correct registration API for both legacy (untyped) and
new (typed) XLA FFI custom calls.

PiperOrigin-RevId: 597818106
Reverts changelist cl/597713624

PiperOrigin-RevId: 597830537
PiperOrigin-RevId: 597848585
Imported from GitHub PR openxla#7843

This PR enables XLA to take advantage of [NCCL User Buffer Registration](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/bufferreg.html) in NCCL 2.19:
> User Buffer Registration is a feature that allows NCCL to directly send/receive/operate data through the user buffer without extra internal copy. It can accelerate collectives and reduce the resource usage (e.g. #channel usage).

NCCL supports this feature for all reduce, all gather, and reduce scatter. To use the feature, A buffer must be allocated with `ncclMemAlloc`. Then, the buffer must be registered using `ncclCommRegister` for all communicators. If capturing collectives using cuda graphs, `ncclCommRegister` can be skipped. Finally, when performing a collective using offsets to this buffer, all devices must have the same offset into the buffer.

This PR contains the following components to enable this feature:

1. Compiler
   - If flag `xla_gpu_enable_nccl_user_buffers` is true, a custom "colorer" function will be passed to `BufferAssignment` which will mark all buffers for all reduce, all gather, and reduce scatter to use alternate memory space `1`.
2. Runtime
   -  The StreamExecutorGpuClient will create a second `BFCAllocator` for each device which uses `ncclMemAlloc/ncclMemFree` as its suballocator. The amount of memory reserved can be configured using `GpuAllocatorConfig.collective_memory_size`.
   - `MultiDeviceAllocator` will read the `memory_space` value for each allocation and use it to route the allocation to the correct allocator. Since `memory_space` is not passed to the deallocate function, it uses a hash map to remember which memory space each address belongs to.
   - When running a collective, the first time a buffer is encountered for each communicator it is registered using `ncclCommRegister`. The collective buffers will use a slice of the `ncclMemAlloc`'d CollectiveBFCAllocator memory, but `ncclCommRegister` will automatically locate the base address and size to register the entire `ncclMemAlloc` memory. Once, cudagraph capture of nccl collectives is supported, this code can be removed.

NCCL 2.19 is required for `ncclCommRegister`.

PJRT Changes: openxla#7963
StreamExecutor Changes: openxla#7962

Copybara import of the project:

--
dc8aefb by Trevor Morris <[email protected]>:

NCCL User Buffer Registration - service gpu changes

--
b1c48e7 by Trevor Morris <[email protected]>:

Add new nccl symbols

--
4e7a299 by Trevor Morris <[email protected]>:

Check if buffer was allocated used ncclMemAlloc before attempting to register it.

--
bdbc6ea by Trevor Morris <[email protected]>:

Add bazel dependency and fix cpu build

--
c9acddc by Trevor Morris <[email protected]>:

Cpu build fixes

--
95f0a06 by Trevor Morris <[email protected]>:

Remove cuda bazel dep since it is automatically added by tsl_gpu_library

Merging this change closes openxla#7843

COPYBARA_INTEGRATE_REVIEW=openxla#7843 from trevor-m:memoryspaces 95f0a06
PiperOrigin-RevId: 597848721
…ithin an error margin.

The proposed margin of 2 microseconds was picked partially by experimenting with a small number of models and partially by experience. In the future a more rigorous analysis could yield a better margin.

PiperOrigin-RevId: 597854663
…required assignment offset at use

PiperOrigin-RevId: 597854854
`s3 = MergeShardingIfCompatible(s1, s2)`, where s1 and s2 are two input shardings, and s3 are the generated output sharding. Let ta1, ta2, ta3 be their tile assignments, respectively.

Before this cl, the function builds ta3 by visiting each element in its array. For each element, it finds compatible device from ta1 and ta2.

In this cl, we use reshape and transpose to generate the compatible tile assignments for ta1 and ta2.
* Reshape: decompose the input tile assignment along the replicated dimension
* Transpose: reassign the decomposed dimension to the new tiled dimension
For example, if the input sharding is `{devices=[1,5,6]<=[30] last_tile_dim_replicate}` and the expected output sharding has the tile dimension `[2, 5, 3]` with last_tile_dim_replicate. The list of compatible tile assignments is
* input_sharding.tile_assignment.Reshape(5, 2, 3),Transpose(1, 0, 2)
* input_sharding.tile_assignment.Reshape(5, 3, 2),Transpose(2, 0, 1)

With the defined `compatible` function. the new tile assignment will be `ta3 = intersection(compatible(ta1), compatible(ta2))`.

Since ta3 is built by reshape and transpose from ta1 and ta2, s3 is in the iota format if s1 or s2 is in the iota format.

PiperOrigin-RevId: 597898884
The design of sugared HLO async ops is clunky, as it requires running multiple passes (specifically AsyncOpCanonicalizer, FlattenCallGraph and DCE) to cleanup the generated HLO from duplicate async wrapped computations. Instead of deffering the construction of the async op chain, it is possible to construct it at parse time, which makes the following previously mentioned passes redundant.

Make *-{done,update} infer their thread and computation attributes from their operand (still allow these for backwards compatibility) and deprecate "async_group_id" (subsequently purge AsyncOpCanonicalizer pass). This removes the need to specify any attributes for these ops.

PiperOrigin-RevId: 597908702
…que lib

This is mostly NFC, just moving code around and cleaning build files. Real changes will be in the followup changes.

Reverts ef66a6b

PiperOrigin-RevId: 597908777
… Listener support was previously removed (and had no initialization anyway).

PiperOrigin-RevId: 597917086
PiperOrigin-RevId: 597935429
`//third_party/bazel_platforms/os:emscripten` was recently added.
Migrate existing usage of `//third_party/bazel_platforms/cpu:wasm32`
that seems better suited for emscripten constraint.

PiperOrigin-RevId: 597951755
…g_util` if the length of the vector is similar to the rank of a tensor.

PiperOrigin-RevId: 597959646
…eadlocks

This is required to guarantee that all initialization (that can allocate new data structures on device) is completed before we start executing any of the replicas/partitions, because otherwise we can get NCCL + CUDA graphs deadlocks.

PiperOrigin-RevId: 597971719
PiperOrigin-RevId: 598064412
jreiffers and others added 20 commits January 22, 2024 08:27
PiperOrigin-RevId: 600461196
…iveBFCAllocator in PJRT (3/3)

Imported from GitHub PR openxla#7963

If `collective_memory_size` is non-zero in the `GPUAllocatorConfig`, then a second BFCAllocator will be created for each device that uses `CollectiveMemoryAllocate` for its suballocator. For allocations with `memory_space == 1`, the allocators will be routed to this `CollectiveBFCAllocator` instead of the regular device memory allocator.

Main PR here: openxla#7843
Copybara import of the project:

--
78a6cce by Trevor Morris <[email protected]>:

NCCL User Buffer Registration - PJRT changes

--
852cd6f by Trevor Morris <[email protected]>:

Allow MultiDeviceAdapter to switch between different allocators based on memory_space

--
1f5e217 by Trevor Morris <[email protected]>:

Allow nullptr in Deallocate. Set default collective memory size to 0.

--
9aa3e75 by Trevor Morris <[email protected]>:

Add unit test for MultiDeviceAllocator

--
7266042 by Trevor Morris <[email protected]>:

Check status of CollectiveMemoryAlloc/Dealloc

--
8e371d4 by Trevor Morris <[email protected]>:

Fix platform allocator

--
fb8249b by Trevor Morris <[email protected]>:

Fix build error

Merging this change closes openxla#7963

COPYBARA_INTEGRATE_REVIEW=openxla#7963 from trevor-m:memoryspaces-pjrt fb8249b
PiperOrigin-RevId: 600471007
In summary:
- No dynamic dimensions are permitted in broadcast op result shapes.
- HLO Broadcast doesn't support unbounded dynamism, operand or output.
- HLO BroadcastInDim permits dynamic operands as long as result shape is static.

Dynamic output shapes must use `stablehlo.dynamic_broadcast_in_dim` for both `broadcast` and `broadcast_in_dim` uses.

PiperOrigin-RevId: 600476895
This allows compiling PTX to CUBIN using NVIDIA's libnvptxcompiler library instead of calling ptxas. Compiling XLA with libnvptxcompiler is optional and disabled by default because libnvptxcompiler is only available as a static library and linking against it significantly increases the binary size.

The feature itself is behind a (runtime) flag and is disabled by default for now (even when XLA is compiled with libnvptxcompiler support).

PiperOrigin-RevId: 600488239
Updates LLVM usage to match
[21830c913505](llvm/llvm-project@21830c913505)

PiperOrigin-RevId: 600494781
…ffsets in spmd_partitioner_util.

To avoid iterating tile_assignment.array() twice in PartialReplicateReshardCompatibleSharding, we make it depending on hlo_sharding_util::IsSubTilingOrEqualSharding in this cl. Since we accelerated IsSubTilingOrEqualSharding in cl/598940220, we can also accelerate PartialReplicateReshardCompatibleSharding with this cl.

PiperOrigin-RevId: 600508054
The normal loop emitter generates concatenate ops as a branch into each of the operands. The new concatenate emitter generates all operand slices sequentially, avoiding high warp divergence.

PiperOrigin-RevId: 600515857
Also includes minor readability improvements.

PiperOrigin-RevId: 600520568
…ency with xla::InvalidArgumentStrCat and others.

This also lets a common macro be used.

PiperOrigin-RevId: 600535070
…le loops while nested loop fusion is still off.

PiperOrigin-RevId: 600537737
…stead of the value in dot_handler.

PiperOrigin-RevId: 600542456
…ic. This is the semantics of StableHLO reshape with dynamic dims.

In summary:
- Reshape from `[?] -> [X,Y,...,Z]` is valid as all dims are static.
- Combination of bounded/unbounded is not supported since we cannot propagate the bounded dims `[?, <=3] --> [3]` errors.

PiperOrigin-RevId: 600559707
…is called on a sharded array.

Fixes jax-ml/jax#19134

PiperOrigin-RevId: 600570354
PiperOrigin-RevId: 600575641
PiperOrigin-RevId: 600577027
…d move to Thunk

CollectiveExecuteParams is a companion of Thunk::ExecuteParams and has to be defined close to it.

Also convert class to struct for consistency with ExecuteParams.

PiperOrigin-RevId: 600581907
…efore initialization and execution

PiperOrigin-RevId: 600586067
@Tixxx Tixxx force-pushed the tixxx/collective_matmul_runtime branch from 1e13492 to 391ff8a Compare January 23, 2024 00:46
@Tixxx Tixxx changed the base branch from tixxx/collective_matmul_hlo to main January 23, 2024 00:46
@Tixxx Tixxx closed this Jan 23, 2024
Tixxx pushed a commit that referenced this pull request Apr 3, 2024
Currently we look for ptxas and nvlink in a few different places on the host machine, then we choose the first found binary without taking its version into account. If the chosen binary doesn't fulfill our version requirements we will later fail even if there was a suitable ptxas or nvlink in the search path in the first place.

This change makes it take the version of each binary into account when going through the search path. Unsuitable binaries will be discarded right away and the search continues until we are out of locations to check.

This should help with host environments that have multiple CUDA toolkits installed and should make ptxas and nvlink selection more robust.

The concreate changes:

1. `FindCudaExecutable` now also takes a minimum version and a list of forbidden (think buggy) versions that are supposed to be skipped.
2. `WarnIfBadPtxAsVersion` has been removed. It was checking for ptxas < 11.1 which is way older than our minimum supported version of 11.8 and was not doing anything given the check described in #3.
3. There was another version check for `ptxas` in `NVPTXCompiler::ChooseLinkingMethod` which was checking for `version(ptxas)` < 11.8. This has also been removed/replace by the version check described in openxla#4.
4. Version checking for `ptxas` and `nvlink` has been consolidated into 2 methods `FindPtxAsExectuable` and `FindNvLinkExecutable`. These methods hard code the current minimum version (and the list of excluded versions) of each tool in one place. It's still not great but at least less spaghetti-like.

PiperOrigin-RevId: 618797392
Tixxx pushed a commit that referenced this pull request May 23, 2024
Tixxx pushed a commit that referenced this pull request May 23, 2024
…d phase to Initialize()

Imported from GitHub PR openxla#12228

The first time that a NormThunk is executed, it will build a cudnn execution plan. This build step can hang if a NCCL collective is running at the same time. To fix this, I've moved the build step to take place during thunk initialization. We only observe this hang when using cudnn 9.

Here's a backtrace from the hang that will be fixed:
```
Thread 585 (Thread 0x7fb9391ff640 (LWP 41364) "main.py"):
#0  0x00007fd3d17cffd9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd3d17da24f in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd070967dfe in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fd0709c928a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
openxla#4  0x00007f1970d76102 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
openxla#5  0x00007f1970f2c999 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
openxla#6  0x00007f1970a7d4ab in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
openxla#7  0x00007f1970d0a9cb in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
openxla#8  0x00007fce60b2a98c in cudnn::backend::ExecutionPlan::finalize_internal() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
openxla#9  0x00007fce60aefbb1 in cudnn::backend::Descriptor::finalize() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
openxla#10 0x00007fce60b15bec in cudnnBackendFinalize () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
#11 0x00007fd2521b8f39 in cudnn_frontend::ExecutionPlanBuilder_v8::build() () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#12 0x00007fd2521734ba in stream_executor::gpu::(anonymous namespace)::GetExecPlanFromHeuristics(cudnn_frontend::OperationGraph_v8&&, stream_executor::gpu::(anonymous namespace)::CudnnHandle const&, bool) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#13 0x00007fd25216ff9b in stream_executor::gpu::CudnnSupport::NormRunnerFromDesc(stream_executor::Stream*, stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormKind, double, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#14 0x00007fd24e36b88b in stream_executor::dnn::NormOp::RunnerFromAlgorithmDesc(stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#15 0x00007fd24e36ae37 in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}::operator()() const () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#16 0x00007fd24e36adbc in void absl::lts_20230802::base_internal::CallOnceImpl<stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}>(std::atomic<unsigned int>*, absl::lts_20230802::base_internal::SchedulingMode, stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}&&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#17 0x00007fd24e36a9bd in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#18 0x00007fd24e369d29 in xla::gpu::RunGpuNorm(xla::gpu::GpuNormConfig const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, stream_executor::DeviceMemoryBase const&, stream_executor::Stream*, xla::gpu::RunNormOptions) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#19 0x00007fd24e368be6 in xla::gpu::NormThunk::ExecuteOnStream(xla::gpu::Thunk::ExecuteParams const&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
```
Copybara import of the project:

--
f535330 by Trevor Morris <[email protected]>:

Fix hang with cudnn layer norm by moving cudnn init to Initialize()

Merging this change closes openxla#12228

COPYBARA_INTEGRATE_REVIEW=openxla#12228 from trevor-m:tmorris-norm-init f535330
PiperOrigin-RevId: 633220207
Tixxx pushed a commit that referenced this pull request Dec 20, 2024
Fixes the following TSAN race:

```
WARNING: ThreadSanitizer: data race (pid=899472)
  Write of size 8 at 0x7f979e0f1cd8 by thread T69:
    #0 llvm::TargetRegistry::RegisterTargetMachine(llvm::Target&, llvm::TargetMachine* (*)(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool)) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:827:27 (xla_extension.so+0x9803668) (BuildId: 6fa88e3910a5eb04)
    #1 llvm::RegisterTargetMachine<llvm::X86TargetMachine>::RegisterTargetMachine(llvm::Target&) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:1250:5 (xla_extension.so+0x9803668)
    #2 LLVMInitializeX86Target /proc/self/cwd/external/llvm-project/llvm/lib/Target/X86/X86TargetMachine.cpp:69:43 (xla_extension.so+0x9803668)
    #3 llvm::InitializeNativeTarget() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/TargetSelect.h:123:5 (xla_extension.so+0x48d2358) (BuildId: 6fa88e3910a5eb04)
    openxla#4 xla::cpu::JitCompiler::Create(llvm::TargetOptions, xla::cpu::JitCompiler::Options, absl::lts_20230802::AnyInvocable<void (std::function<void ()>)>)::$_0::operator()() const /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:113:5 (xla_extension.so+0x48d2358)
    openxla#5 xla::cpu::JitCompiler::Create(llvm::TargetOptions, xla::cpu::JitCompiler::Options, absl::lts_20230802::AnyInvocable<void (std::function<void ()>)>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:112:34 (xla_extension.so+0x48d209b) (BuildId: 6fa88e3910a5eb04)
    openxla#6 xla::cpu::CpuCompiler::CompileLegacyCpuExecutable(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1416:3 (xla_extension.so+0x2f716a0) (BuildId: 6fa88e3910a5eb04)
    openxla#7 xla::cpu::CpuCompiler::RunBackend(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1730:3 (xla_extension.so+0x2f7ae18) (BuildId: 6fa88e3910a5eb04)
    openxla#8 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:759:19 (xla_extension.so+0x2f12915) (BuildId: 6fa88e3910a5eb04)
    openxla#9 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:847:3 (xla_extension.so+0x2f12915)

  Previous read of size 8 at 0x7f979e0f1cd8 by thread T66:
    #0 llvm::Target::createTargetMachine(llvm::StringRef, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) const /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:460:10 (xla_extension.so+0x94ba6db) (BuildId: 6fa88e3910a5eb04)
    #1 llvm::EngineBuilder::selectTarget(llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::SmallVectorImpl<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>> const&) /proc/self/cwd/external/llvm-project/llvm/lib/ExecutionEngine/TargetSelect.cpp:88:18 (xla_extension.so+0x94ba6db)
    #2 xla::cpu::JitCompiler::InferTargetMachine(llvm::TargetOptions const&, llvm::CodeGenOptLevel, std::optional<tsl::port::CPUFeature>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:88:12 (xla_extension.so+0x48d096f) (BuildId: 6fa88e3910a5eb04)
    #3 xla::cpu::CpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1017:3 (xla_extension.so+0x2f70857) (BuildId: 6fa88e3910a5eb04)
    openxla#4 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:754:3 (xla_extension.so+0x2f12874) (BuildId: 6fa88e3910a5eb04)
    openxla#5 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:847:3 (xla_extension.so+0x2f12874)
    openxla#6 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:893:10 (xla_extension.so+0x2f13ef2) (BuildId: 6fa88e3910a5eb04)
```

PiperOrigin-RevId: 707701032
Tixxx pushed a commit that referenced this pull request Dec 20, 2024
…r RunBackend.

Both of these call into LLVM code that reads the compiler options.

Fixes the following race:

```
WARNING: ThreadSanitizer: data race (pid=869815)
  Read of size 1 at 0x7f8b24effc08 by thread T65:
    #0 llvm::cl::opt_storage<bool, false, false>::getValue() const /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1406:38 (xla_extension.so+0xa281417) (BuildId: 7f5d2098f168c4db)
    #1 llvm::cl::opt_storage<bool, false, false>::operator bool() const /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1410:38 (xla_extension.so+0xa281417)
    #2 llvm::CodeGenTargetMachineImpl::CodeGenTargetMachineImpl(llvm::Target const&, llvm::StringRef, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, llvm::Reloc::Model, llvm::CodeModel::Model, llvm::CodeGenOptLevel) /proc/self/cwd/external/llvm-project/llvm/lib/CodeGen/CodeGenTargetMachineImpl.cpp:97:7 (xla_extension.so+0xa281417)
    #3 llvm::X86TargetMachine::X86TargetMachine(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) /proc/self/cwd/external/llvm-project/llvm/lib/Target/X86/X86TargetMachine.cpp:236:7 (xla_extension.so+0x9803b80) (BuildId: 7f5d2098f168c4db)
    openxla#4 llvm::RegisterTargetMachine<llvm::X86TargetMachine>::Allocator(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:1258:16 (xla_extension.so+0x980757a) (BuildId: 7f5d2098f168c4db)
    openxla#5 llvm::Target::createTargetMachine(llvm::StringRef, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) const /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:462:12 (xla_extension.so+0x94ba529) (BuildId: 7f5d2098f168c4db)
    openxla#6 llvm::EngineBuilder::selectTarget(llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::SmallVectorImpl<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>> const&) /proc/self/cwd/external/llvm-project/llvm/lib/ExecutionEngine/TargetSelect.cpp:88:18 (xla_extension.so+0x94ba529)
    openxla#7 xla::cpu::JitCompiler::InferTargetMachine(llvm::TargetOptions const&, llvm::CodeGenOptLevel, std::optional<tsl::port::CPUFeature>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:88:12 (xla_extension.so+0x48d070f) (BuildId: 7f5d2098f168c4db)
    openxla#8 xla::cpu::CpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1017:3 (xla_extension.so+0x2f6dc47) (BuildId: 7f5d2098f168c4db)
    openxla#9 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:749:3 (xla_extension.so+0x2f127e2) (BuildId: 7f5d2098f168c4db)
    openxla#10 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:842:3 (xla_extension.so+0x2f127e2)
    #11 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:888:10 (xla_extension.so+0x2f13da2) (BuildId: 7f5d2098f168c4db)
    #12 xla::ifrt::PjRtLoadedExecutable::Create(xla::ifrt::PjRtCompatibleClient*, mlir::ModuleOp, xla::CompileOptions, std::vector<tsl::RCReference<xla::ifrt::LoadedHostCallback>, std::allocator<tsl::RCReference<xla::ifrt::LoadedHostCallback>>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_executable.cc:258:3 (xla_extension.so+0xdd04d77) (BuildId: 7f5d2098f168c4db)
    openxla#13 xla::ifrt::PjRtCompiler::Compile(std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_compiler.cc:97:10 (xla_extension.so+0xdcfd29b) (BuildId: 7f5d2098f168c4db)
    #14 xla::PyClient::CompileIfrtProgram(xla::nb_class_ptr<xla::PyClient>, std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:443:5 (xla_extension.so+0xc62a228) (BuildId: 7f5d2098f168c4db)
    openxla#15 xla::PyClient::Compile(xla::nb_class_ptr<xla::PyClient>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, xla::CompileOptions, std::vector<nanobind::capsule, std::allocator<nanobind::capsule>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:466:10 (xla_extension.so+0xc62b514) (BuildId: 7f5d2098f168c4db)

  Previous write of size 1 at 0x7f8b24effc08 by thread T66 (mutexes: write M0):
    #0 void llvm::cl::opt_storage<bool, false, false>::setValue<bool>(bool const&, bool) /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1401:11 (xla_extension.so+0x100bace9) (BuildId: 7f5d2098f168c4db)
    #1 void llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::setDefaultImpl<bool, void>() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h (xla_extension.so+0x100bace9)
    #2 llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::setDefault() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1474:32 (xla_extension.so+0x100bace9)
    #3 llvm::cl::Option::reset() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:460:3 (xla_extension.so+0x100cac0e) (BuildId: 7f5d2098f168c4db)
    openxla#4 (anonymous namespace)::CommandLineParser::ResetAllOptionOccurrences() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:1478:17 (xla_extension.so+0x100cac0e)
    openxla#5 llvm::cl::ResetAllOptionOccurrences() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:2831:17 (xla_extension.so+0x100caa72) (BuildId: 7f5d2098f168c4db)
    openxla#6 xla::llvm_ir::LLVMCommandLineOptionsLock::LLVMCommandLineOptionsLock(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>> const&) /proc/self/cwd/external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:70:5 (xla_extension.so+0x91d69f4) (BuildId: 7f5d2098f168c4db)
    openxla#7 xla::cpu::CpuCompiler::RunBackend(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1727:39 (xla_extension.so+0x2f781c8) (BuildId: 7f5d2098f168c4db)
    openxla#8 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:754:19 (xla_extension.so+0x2f12883) (BuildId: 7f5d2098f168c4db)
    openxla#9 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:842:3 (xla_extension.so+0x2f12883)
    openxla#10 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:888:10 (xla_extension.so+0x2f13da2) (BuildId: 7f5d2098f168c4db)
    #11 xla::ifrt::PjRtLoadedExecutable::Create(xla::ifrt::PjRtCompatibleClient*, mlir::ModuleOp, xla::CompileOptions, std::vector<tsl::RCReference<xla::ifrt::LoadedHostCallback>, std::allocator<tsl::RCReference<xla::ifrt::LoadedHostCallback>>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_executable.cc:258:3 (xla_extension.so+0xdd04d77) (BuildId: 7f5d2098f168c4db)
    #12 xla::ifrt::PjRtCompiler::Compile(std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_compiler.cc:97:10 (xla_extension.so+0xdcfd29b) (BuildId: 7f5d2098f168c4db)
    openxla#13 xla::PyClient::CompileIfrtProgram(xla::nb_class_ptr<xla::PyClient>, std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:443:5 (xla_extension.so+0xc62a228) (BuildId: 7f5d2098f168c4db)
    #14 xla::PyClient::Compile(xla::nb_class_ptr<xla::PyClient>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, xla::CompileOptions, std::vector<nanobind::capsule, std::allocator<nanobind::capsule>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:466:10 (xla_extension.so+0xc62b514) (BuildId: 7f5d2098f168c4db)
```

PiperOrigin-RevId: 707721170
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.