New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[NVIDIA GPU SPMD] Add runtime support to run windowed einsum in multiple streams #3

Closed

Tixxx wants to merge 2,324 commits into main from tixxx/collective_matmul_runtime

Owner

Tixxx commented Jan 23, 2024

This PR contains the runtime changes to be able to run windowed einsum in multiple cuda streams.
This pr follows(openxla#7854) which adds stream attributes to the HLO graph.
We take the stream attributes and dispatch corresponding kernels to separate cuda streams.

jreiffers and others added 30 commits

January 12, 2024 03:09


          Consolidate TilingScheme functions.

836c164

- Remove unnecessary helpers
- Unify naming (e.g. NumberOf vs Num)
- Clarify the two different tile sizes (per thread/per block)

PiperOrigin-RevId: 597792339


          [XLA:GPU][NFC] Add a util to derive an IndexingMapSimplifier from an …

94a6cfb

…IndexingMap.

This will simplify the logic in an upcoming tile analysis change.

PiperOrigin-RevId: 597794038


          Added an api_version field to PJRT_Gpu_Register_Custom_Call*

09062c1

This allows using the correct registration API for both legacy (untyped) and
new (typed) XLA FFI custom calls.

PiperOrigin-RevId: 597818106


          Fix broken build.

e527933

Reverts changelist cl/597713624

PiperOrigin-RevId: 597830537


          [xla:gpu][NFC] Remove unnecessary static helper for BuildConstantInit…

e7d9d99

…ializerThunk

PiperOrigin-RevId: 597833988


          [xla:gpu] Handle HLO-only case in BuildConstantInitializerThunk

f36492e

PiperOrigin-RevId: 597846583


          Import openai/triton from GitHub.

6518a7d

PiperOrigin-RevId: 597848585


          PR openxla#7843: [GPU] NCCL User Buffer registration (1/3)

5ec5b7c

Imported from GitHub PR openxla#7843

This PR enables XLA to take advantage of [NCCL User Buffer Registration](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/bufferreg.html) in NCCL 2.19:
> User Buffer Registration is a feature that allows NCCL to directly send/receive/operate data through the user buffer without extra internal copy. It can accelerate collectives and reduce the resource usage (e.g. #channel usage).

NCCL supports this feature for all reduce, all gather, and reduce scatter. To use the feature, A buffer must be allocated with `ncclMemAlloc`. Then, the buffer must be registered using `ncclCommRegister` for all communicators. If capturing collectives using cuda graphs, `ncclCommRegister` can be skipped. Finally, when performing a collective using offsets to this buffer, all devices must have the same offset into the buffer.

This PR contains the following components to enable this feature:

1. Compiler
- If flag `xla_gpu_enable_nccl_user_buffers` is true, a custom "colorer" function will be passed to `BufferAssignment` which will mark all buffers for all reduce, all gather, and reduce scatter to use alternate memory space `1`.
2. Runtime
- The StreamExecutorGpuClient will create a second `BFCAllocator` for each device which uses `ncclMemAlloc/ncclMemFree` as its suballocator. The amount of memory reserved can be configured using `GpuAllocatorConfig.collective_memory_size`.
- `MultiDeviceAllocator` will read the `memory_space` value for each allocation and use it to route the allocation to the correct allocator. Since `memory_space` is not passed to the deallocate function, it uses a hash map to remember which memory space each address belongs to.
- When running a collective, the first time a buffer is encountered for each communicator it is registered using `ncclCommRegister`. The collective buffers will use a slice of the `ncclMemAlloc`'d CollectiveBFCAllocator memory, but `ncclCommRegister` will automatically locate the base address and size to register the entire `ncclMemAlloc` memory. Once, cudagraph capture of nccl collectives is supported, this code can be removed.

NCCL 2.19 is required for `ncclCommRegister`.

PJRT Changes: openxla#7963
StreamExecutor Changes: openxla#7962

Copybara import of the project:

--
dc8aefb by Trevor Morris <[email protected]>:

NCCL User Buffer Registration - service gpu changes

--
b1c48e7 by Trevor Morris <[email protected]>:

Add new nccl symbols

--
4e7a299 by Trevor Morris <[email protected]>:

Check if buffer was allocated used ncclMemAlloc before attempting to register it.

--
bdbc6ea by Trevor Morris <[email protected]>:

Add bazel dependency and fix cpu build

--
c9acddc by Trevor Morris <[email protected]>:

Cpu build fixes

--
95f0a06 by Trevor Morris <[email protected]>:

Remove cuda bazel dep since it is automatically added by tsl_gpu_library

Merging this change closes openxla#7843

COPYBARA_INTEGRATE_REVIEW=openxla#7843 from trevor-m:memoryspaces 95f0a06
PiperOrigin-RevId: 597848721


          Consider scratch_bytes for top autotuning results whose times are w…

b41c906

…ithin an error margin.

The proposed margin of 2 microseconds was picked partially by experimenting with a small number of models and partially by experience. In the future a more rigorous analysis could yield a better margin.

PiperOrigin-RevId: 597854663


          [XLA] Require that allocations use the same offset for their aliased …

63a9e97

…required assignment offset at use

PiperOrigin-RevId: 597854854


          [xla:gpu] Add support for HLO-only BuildInitializerThunk

27efd18

PiperOrigin-RevId: 597886331


          Accelerate and optimize MergeShardingIfCompatible.

d9e1cb3

`s3 = MergeShardingIfCompatible(s1, s2)`, where s1 and s2 are two input shardings, and s3 are the generated output sharding. Let ta1, ta2, ta3 be their tile assignments, respectively.

Before this cl, the function builds ta3 by visiting each element in its array. For each element, it finds compatible device from ta1 and ta2.

In this cl, we use reshape and transpose to generate the compatible tile assignments for ta1 and ta2.
* Reshape: decompose the input tile assignment along the replicated dimension
* Transpose: reassign the decomposed dimension to the new tiled dimension
For example, if the input sharding is `{devices=[1,5,6]<=[30] last_tile_dim_replicate}` and the expected output sharding has the tile dimension `[2, 5, 3]` with last_tile_dim_replicate. The list of compatible tile assignments is
* input_sharding.tile_assignment.Reshape(5, 2, 3),Transpose(1, 0, 2)
* input_sharding.tile_assignment.Reshape(5, 3, 2),Transpose(2, 0, 1)

With the defined `compatible` function. the new tile assignment will be `ta3 = intersection(compatible(ta1), compatible(ta2))`.

Since ta3 is built by reshape and transpose from ta1 and ta2, s3 is in the iota format if s1 or s2 is in the iota format.

PiperOrigin-RevId: 597898884


          Roll back previous change "Cleans up wrong callers of FindNonTrivialH…

5850c1c

…ero."

Reverts 756d18c

PiperOrigin-RevId: 597900669


          [XLA] Purge AsyncOpCanonicalizer

9696bb9

The design of sugared HLO async ops is clunky, as it requires running multiple passes (specifically AsyncOpCanonicalizer, FlattenCallGraph and DCE) to cleanup the generated HLO from duplicate async wrapped computations. Instead of deffering the construction of the async op chain, it is possible to construct it at parse time, which makes the following previously mentioned passes redundant.

Make *-{done,update} infer their thread and computation attributes from their operand (still allow these for backwards compatibility) and deprecate "async_group_id" (subsequently purge AsyncOpCanonicalizer pass). This removes the need to specify any attributes for these ops.

PiperOrigin-RevId: 597908702


          [xla:gpu] Clean up NCCL utils library and extract a separate nccl_cli…

d3eda6f

…que lib

This is mostly NFC, just moving code around and cleaning build files. Real changes will be in the followup changes.

Reverts ef66a6b

PiperOrigin-RevId: 597908777


          Remove multi_platform_manager_listener module initializer sequencing.…

cf54f99

… Listener support was previously removed (and had no initialization anyway).

PiperOrigin-RevId: 597917086


          [xla:gpu] Support adding cu_threefry2x32 custom kernel to command buffer

5092a84

PiperOrigin-RevId: 597925675


          [xla:gpu] Add support for emitting SelectAndScatter from HLO

1bb2a74

PiperOrigin-RevId: 597926315


          Reverts 093cd67

93e8599

PiperOrigin-RevId: 597935429


          [xla:gpu] Support emitting ConditionalThunk for command buffers

4132f14

PiperOrigin-RevId: 597943209


          Migrate wasm32 constraint to emscripten

4ee8bd4

`//third_party/bazel_platforms/os:emscripten` was recently added.
Migrate existing usage of `//third_party/bazel_platforms/cpu:wasm32`
that seems better suited for emscripten constraint.

PiperOrigin-RevId: 597951755


          Replace std::vector<int64_t> with DimensionVector in `hlo_shardin…

97a4eed

…g_util` if the length of the vector is similar to the rank of a tensor.

PiperOrigin-RevId: 597959646


          [xla:gpu] Add support for scheduling conditional into a command buffer

ce1e5b6

PiperOrigin-RevId: 597965028


          [xla:gpu] Add GpuExecutable post-initialization rendezvous to avoid d…

e9502be

…eadlocks

This is required to guarantee that all initialization (that can allocate new data structures on device) is completed before we start executing any of the replicas/partitions, because otherwise we can get NCCL + CUDA graphs deadlocks.

PiperOrigin-RevId: 597971719


          [XLA] Remove slice to dynamic custom call with pad to static operand

ac46dcd

PiperOrigin-RevId: 597981409


          Introduce some macros to define XLA errors.

1e5f535

PiperOrigin-RevId: 597987199


          Increase maxZoom factor for hlo graph explorer

b1f2f98

PiperOrigin-RevId: 597997260


          Reverts 93e8599

e69cbb2

PiperOrigin-RevId: 598064412


          [xla:gpu][NFC] AppendCommands in thunk kind alphabetical order

a83c6e1

PiperOrigin-RevId: 598122960


          Improves logging/debuggability in the hlo verifier.

c7d4559

PiperOrigin-RevId: 598187586

jreiffers and others added 20 commits

January 22, 2024 08:27


          Remove obsolete TODO.

f61f472

PiperOrigin-RevId: 600461196


          PR openxla#7963: [GPU] NCCL User Buffer Registration - Create Collect…

60f5032

…iveBFCAllocator in PJRT (3/3)

Imported from GitHub PR openxla#7963

If `collective_memory_size` is non-zero in the `GPUAllocatorConfig`, then a second BFCAllocator will be created for each device that uses `CollectiveMemoryAllocate` for its suballocator. For allocations with `memory_space == 1`, the allocators will be routed to this `CollectiveBFCAllocator` instead of the regular device memory allocator.

Main PR here: openxla#7843
Copybara import of the project:

--
78a6cce by Trevor Morris <[email protected]>:

NCCL User Buffer Registration - PJRT changes

--
852cd6f by Trevor Morris <[email protected]>:

Allow MultiDeviceAdapter to switch between different allocators based on memory_space

--
1f5e217 by Trevor Morris <[email protected]>:

Allow nullptr in Deallocate. Set default collective memory size to 0.

--
9aa3e75 by Trevor Morris <[email protected]>:

Add unit test for MultiDeviceAllocator

--
7266042 by Trevor Morris <[email protected]>:

Check status of CollectiveMemoryAlloc/Dealloc

--
8e371d4 by Trevor Morris <[email protected]>:

Fix platform allocator

--
fb8249b by Trevor Morris <[email protected]>:

Fix build error

Merging this change closes openxla#7963

COPYBARA_INTEGRATE_REVIEW=openxla#7963 from trevor-m:memoryspaces-pjrt fb8249b
PiperOrigin-RevId: 600471007


          Add support for unbounded dynamism for Broadcast and BroadcastInDim op.

43081b5

In summary:
- No dynamic dimensions are permitted in broadcast op result shapes.
- HLO Broadcast doesn't support unbounded dynamism, operand or output.
- HLO BroadcastInDim permits dynamic operands as long as result shape is static.

Dynamic output shapes must use `stablehlo.dynamic_broadcast_in_dim` for both `broadcast` and `broadcast_in_dim` uses.

PiperOrigin-RevId: 600476895


          Add LibNVPTXCompiler-based compilation

ae11a78

This allows compiling PTX to CUBIN using NVIDIA's libnvptxcompiler library instead of calling ptxas. Compiling XLA with libnvptxcompiler is optional and disabled by default because libnvptxcompiler is only available as a static library and linking against it significantly increases the binary size.

The feature itself is behind a (runtime) flag and is disabled by default for now (even when XLA is compiled with libnvptxcompiler support).

PiperOrigin-RevId: 600488239


          [XLA:GPU] Add "NVIDIA H100 PCIe" target config (h100_pcie.txtpb)

d7be06e

PiperOrigin-RevId: 600494599


          Integrate LLVM at llvm/llvm-project@21830c913505

324b9dd

Updates LLVM usage to match
[21830c913505](llvm/llvm-project@21830c913505)

PiperOrigin-RevId: 600494781


          Refactor PartialReplicateReshardCompatibleSharding and MakePartitionO…

350c766

…ffsets in spmd_partitioner_util.

To avoid iterating tile_assignment.array() twice in PartialReplicateReshardCompatibleSharding, we make it depending on hlo_sharding_util::IsSubTilingOrEqualSharding in this cl. Since we accelerated IsSubTilingOrEqualSharding in cl/598940220, we can also accelerate PartialReplicateReshardCompatibleSharding with this cl.

PiperOrigin-RevId: 600508054


          [xla:gpu] Add new emitter for concatenate fusions.

ab60a40

The normal loop emitter generates concatenate ops as a branch into each of the operands. The new concatenate emitter generates all operand slices sequentially, avoiding high warp divergence.

PiperOrigin-RevId: 600515857


          Add unbounded dynamism tests for AndOp and CompareOp.

8ac05e1

Also includes minor readability improvements.

PiperOrigin-RevId: 600520568


          Rename xla::InternalErrorStrCat to be xla::InternalStrCat for consist…

4549a23

…ency with xla::InvalidArgumentStrCat and others.

This also lets a common macro be used.

PiperOrigin-RevId: 600535070


          [XLA:WHILE_LOOP_FUSIBLE] Do not sync sliced invariant values into whi…

c3b4145

…le loops while nested loop fusion is still off.

PiperOrigin-RevId: 600537737


          Mark const for member functions of PartitionedHlo. Pass the const& in…

779cb8b

…stead of the value in dot_handler.

PiperOrigin-RevId: 600542456


          [TileAnalysis] Match Domain of thread_id->output indexing maps.

3b31e2c

PiperOrigin-RevId: 600546448


          Symlink the Bazel test output directory to be in the Kokoro artifacts…

b5dda58

… folder

PiperOrigin-RevId: 600557916


          Allow Reshape to have unbounded dynamic input if result shape is stat…

59d9399

…ic. This is the semantics of StableHLO reshape with dynamic dims.

In summary:
- Reshape from `[?] -> [X,Y,...,Z]` is valid as all dims are static.
- Combination of bounded/unbounded is not supported since we cannot propagate the bounded dims `[?, <=3] --> [3]` errors.

PiperOrigin-RevId: 600559707


          [XLA:Python] Fail with an AttributeError if __cuda_array_interface__ …

54accfd

…is called on a sharded array.

Fixes jax-ml/jax#19134

PiperOrigin-RevId: 600570354


          [xla:gpu] Support Emitting fft from HLO

962df63

PiperOrigin-RevId: 600575641


          Remove unneeded deps.

ff1db84

PiperOrigin-RevId: 600577027


          [xla:gpu] NFC: Rename NcclExecuteParams to CollectiveExecuteParams an…

166f51f

…d move to Thunk

CollectiveExecuteParams is a companion of Thunk::ExecuteParams and has to be defined close to it.

Also convert class to struct for consistency with ExecuteParams.

PiperOrigin-RevId: 600581907


          [xla:gpu] Add a Prepare stage to Thunks to request shared resources b…

5cf68b6

…efore initialization and execution

PiperOrigin-RevId: 600586067

github-actions bot added the kokoro:force-run label


          added multi-stream support in sequentialThunk

391ff8a

Tixxx force-pushed the tixxx/collective_matmul_runtime branch from 1e13492 to 391ff8a Compare

January 23, 2024 00:46

Tixxx changed the base branch from tixxx/collective_matmul_hlo to main

January 23, 2024 00:46

Tixxx closed this

Tixxx pushed a commit that referenced this pull request


          Add version checks to FindCudaExecutable

af5a607

Currently we look for ptxas and nvlink in a few different places on the host machine, then we choose the first found binary without taking its version into account. If the chosen binary doesn't fulfill our version requirements we will later fail even if there was a suitable ptxas or nvlink in the search path in the first place.

This change makes it take the version of each binary into account when going through the search path. Unsuitable binaries will be discarded right away and the search continues until we are out of locations to check.

This should help with host environments that have multiple CUDA toolkits installed and should make ptxas and nvlink selection more robust.

The concreate changes:

1. `FindCudaExecutable` now also takes a minimum version and a list of forbidden (think buggy) versions that are supposed to be skipped.
2. `WarnIfBadPtxAsVersion` has been removed. It was checking for ptxas < 11.1 which is way older than our minimum supported version of 11.8 and was not doing anything given the check described in #3.
3. There was another version check for `ptxas` in `NVPTXCompiler::ChooseLinkingMethod` which was checking for `version(ptxas)` < 11.8. This has also been removed/replace by the version check described in openxla#4.
4. Version checking for `ptxas` and `nvlink` has been consolidated into 2 methods `FindPtxAsExectuable` and `FindNvLinkExecutable`. These methods hard code the current minimum version (and the list of excluded versions) of each tool in one place. It's still not great but at least less spaghetti-like.

PiperOrigin-RevId: 618797392

Tixxx pushed a commit that referenced this pull request


          [xla:cpu] NFC: Remove deprecated XLA:CPU mlir based codegen part #3

2ef2d79

PiperOrigin-RevId: 630125494

Tixxx pushed a commit that referenced this pull request


          PR openxla#12228: [GPU] Fix hang with cudnn layer norm by moving buil…

875998e

…d phase to Initialize()

Imported from GitHub PR openxla#12228

The first time that a NormThunk is executed, it will build a cudnn execution plan. This build step can hang if a NCCL collective is running at the same time. To fix this, I've moved the build step to take place during thunk initialization. We only observe this hang when using cudnn 9.

Here's a backtrace from the hang that will be fixed:
```
Thread 585 (Thread 0x7fb9391ff640 (LWP 41364) "main.py"):
#0  0x00007fd3d17cffd9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fd3d17da24f in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fd070967dfe in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fd0709c928a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
openxla#4  0x00007f1970d76102 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
openxla#5  0x00007f1970f2c999 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
openxla#6  0x00007f1970a7d4ab in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
openxla#7  0x00007f1970d0a9cb in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
openxla#8  0x00007fce60b2a98c in cudnn::backend::ExecutionPlan::finalize_internal() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
openxla#9  0x00007fce60aefbb1 in cudnn::backend::Descriptor::finalize() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
openxla#10 0x00007fce60b15bec in cudnnBackendFinalize () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
#11 0x00007fd2521b8f39 in cudnn_frontend::ExecutionPlanBuilder_v8::build() () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#12 0x00007fd2521734ba in stream_executor::gpu::(anonymous namespace)::GetExecPlanFromHeuristics(cudnn_frontend::OperationGraph_v8&&, stream_executor::gpu::(anonymous namespace)::CudnnHandle const&, bool) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#13 0x00007fd25216ff9b in stream_executor::gpu::CudnnSupport::NormRunnerFromDesc(stream_executor::Stream*, stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormKind, double, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#14 0x00007fd24e36b88b in stream_executor::dnn::NormOp::RunnerFromAlgorithmDesc(stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#15 0x00007fd24e36ae37 in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}::operator()() const () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#16 0x00007fd24e36adbc in void absl::lts_20230802::base_internal::CallOnceImpl<stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}>(std::atomic<unsigned int>*, absl::lts_20230802::base_internal::SchedulingMode, stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}&&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#17 0x00007fd24e36a9bd in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#18 0x00007fd24e369d29 in xla::gpu::RunGpuNorm(xla::gpu::GpuNormConfig const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, stream_executor::DeviceMemoryBase const&, stream_executor::Stream*, xla::gpu::RunNormOptions) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
openxla#19 0x00007fd24e368be6 in xla::gpu::NormThunk::ExecuteOnStream(xla::gpu::Thunk::ExecuteParams const&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
```
Copybara import of the project:

--
f535330 by Trevor Morris <[email protected]>:

Fix hang with cudnn layer norm by moving cudnn init to Initialize()

Merging this change closes openxla#12228

COPYBARA_INTEGRATE_REVIEW=openxla#12228 from trevor-m:tmorris-norm-init f535330
PiperOrigin-RevId: 633220207

Tixxx pushed a commit that referenced this pull request


          [XLA:CPU] Consistently initialize the LLVM native target.

73e702b

Fixes the following TSAN race:

```
WARNING: ThreadSanitizer: data race (pid=899472)
  Write of size 8 at 0x7f979e0f1cd8 by thread T69:
    #0 llvm::TargetRegistry::RegisterTargetMachine(llvm::Target&, llvm::TargetMachine* (*)(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool)) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:827:27 (xla_extension.so+0x9803668) (BuildId: 6fa88e3910a5eb04)
    #1 llvm::RegisterTargetMachine<llvm::X86TargetMachine>::RegisterTargetMachine(llvm::Target&) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:1250:5 (xla_extension.so+0x9803668)
    #2 LLVMInitializeX86Target /proc/self/cwd/external/llvm-project/llvm/lib/Target/X86/X86TargetMachine.cpp:69:43 (xla_extension.so+0x9803668)
    #3 llvm::InitializeNativeTarget() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/TargetSelect.h:123:5 (xla_extension.so+0x48d2358) (BuildId: 6fa88e3910a5eb04)
    openxla#4 xla::cpu::JitCompiler::Create(llvm::TargetOptions, xla::cpu::JitCompiler::Options, absl::lts_20230802::AnyInvocable<void (std::function<void ()>)>)::$_0::operator()() const /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:113:5 (xla_extension.so+0x48d2358)
    openxla#5 xla::cpu::JitCompiler::Create(llvm::TargetOptions, xla::cpu::JitCompiler::Options, absl::lts_20230802::AnyInvocable<void (std::function<void ()>)>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:112:34 (xla_extension.so+0x48d209b) (BuildId: 6fa88e3910a5eb04)
    openxla#6 xla::cpu::CpuCompiler::CompileLegacyCpuExecutable(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1416:3 (xla_extension.so+0x2f716a0) (BuildId: 6fa88e3910a5eb04)
    openxla#7 xla::cpu::CpuCompiler::RunBackend(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1730:3 (xla_extension.so+0x2f7ae18) (BuildId: 6fa88e3910a5eb04)
    openxla#8 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:759:19 (xla_extension.so+0x2f12915) (BuildId: 6fa88e3910a5eb04)
    openxla#9 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:847:3 (xla_extension.so+0x2f12915)

  Previous read of size 8 at 0x7f979e0f1cd8 by thread T66:
    #0 llvm::Target::createTargetMachine(llvm::StringRef, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) const /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:460:10 (xla_extension.so+0x94ba6db) (BuildId: 6fa88e3910a5eb04)
    #1 llvm::EngineBuilder::selectTarget(llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::SmallVectorImpl<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>> const&) /proc/self/cwd/external/llvm-project/llvm/lib/ExecutionEngine/TargetSelect.cpp:88:18 (xla_extension.so+0x94ba6db)
    #2 xla::cpu::JitCompiler::InferTargetMachine(llvm::TargetOptions const&, llvm::CodeGenOptLevel, std::optional<tsl::port::CPUFeature>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:88:12 (xla_extension.so+0x48d096f) (BuildId: 6fa88e3910a5eb04)
    #3 xla::cpu::CpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1017:3 (xla_extension.so+0x2f70857) (BuildId: 6fa88e3910a5eb04)
    openxla#4 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:754:3 (xla_extension.so+0x2f12874) (BuildId: 6fa88e3910a5eb04)
    openxla#5 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:847:3 (xla_extension.so+0x2f12874)
    openxla#6 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:893:10 (xla_extension.so+0x2f13ef2) (BuildId: 6fa88e3910a5eb04)
```

PiperOrigin-RevId: 707701032

Tixxx pushed a commit that referenced this pull request


          [XLA:CPU] Acquire the LLVM options lock before calling RunHloPasses o…

2bc42a4

…r RunBackend.

Both of these call into LLVM code that reads the compiler options.

Fixes the following race:

```
WARNING: ThreadSanitizer: data race (pid=869815)
  Read of size 1 at 0x7f8b24effc08 by thread T65:
    #0 llvm::cl::opt_storage<bool, false, false>::getValue() const /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1406:38 (xla_extension.so+0xa281417) (BuildId: 7f5d2098f168c4db)
    #1 llvm::cl::opt_storage<bool, false, false>::operator bool() const /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1410:38 (xla_extension.so+0xa281417)
    #2 llvm::CodeGenTargetMachineImpl::CodeGenTargetMachineImpl(llvm::Target const&, llvm::StringRef, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, llvm::Reloc::Model, llvm::CodeModel::Model, llvm::CodeGenOptLevel) /proc/self/cwd/external/llvm-project/llvm/lib/CodeGen/CodeGenTargetMachineImpl.cpp:97:7 (xla_extension.so+0xa281417)
    #3 llvm::X86TargetMachine::X86TargetMachine(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) /proc/self/cwd/external/llvm-project/llvm/lib/Target/X86/X86TargetMachine.cpp:236:7 (xla_extension.so+0x9803b80) (BuildId: 7f5d2098f168c4db)
    openxla#4 llvm::RegisterTargetMachine<llvm::X86TargetMachine>::Allocator(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:1258:16 (xla_extension.so+0x980757a) (BuildId: 7f5d2098f168c4db)
    openxla#5 llvm::Target::createTargetMachine(llvm::StringRef, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) const /proc/self/cwd/external/llvm-project/llvm/include/llvm/MC/TargetRegistry.h:462:12 (xla_extension.so+0x94ba529) (BuildId: 7f5d2098f168c4db)
    openxla#6 llvm::EngineBuilder::selectTarget(llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::SmallVectorImpl<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>> const&) /proc/self/cwd/external/llvm-project/llvm/lib/ExecutionEngine/TargetSelect.cpp:88:18 (xla_extension.so+0x94ba529)
    openxla#7 xla::cpu::JitCompiler::InferTargetMachine(llvm::TargetOptions const&, llvm::CodeGenOptLevel, std::optional<tsl::port::CPUFeature>) /proc/self/cwd/external/xla/xla/backends/cpu/codegen/jit_compiler.cc:88:12 (xla_extension.so+0x48d070f) (BuildId: 7f5d2098f168c4db)
    openxla#8 xla::cpu::CpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1017:3 (xla_extension.so+0x2f6dc47) (BuildId: 7f5d2098f168c4db)
    openxla#9 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:749:3 (xla_extension.so+0x2f127e2) (BuildId: 7f5d2098f168c4db)
    openxla#10 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:842:3 (xla_extension.so+0x2f127e2)
    #11 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:888:10 (xla_extension.so+0x2f13da2) (BuildId: 7f5d2098f168c4db)
    #12 xla::ifrt::PjRtLoadedExecutable::Create(xla::ifrt::PjRtCompatibleClient*, mlir::ModuleOp, xla::CompileOptions, std::vector<tsl::RCReference<xla::ifrt::LoadedHostCallback>, std::allocator<tsl::RCReference<xla::ifrt::LoadedHostCallback>>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_executable.cc:258:3 (xla_extension.so+0xdd04d77) (BuildId: 7f5d2098f168c4db)
    openxla#13 xla::ifrt::PjRtCompiler::Compile(std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_compiler.cc:97:10 (xla_extension.so+0xdcfd29b) (BuildId: 7f5d2098f168c4db)
    #14 xla::PyClient::CompileIfrtProgram(xla::nb_class_ptr<xla::PyClient>, std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:443:5 (xla_extension.so+0xc62a228) (BuildId: 7f5d2098f168c4db)
    openxla#15 xla::PyClient::Compile(xla::nb_class_ptr<xla::PyClient>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, xla::CompileOptions, std::vector<nanobind::capsule, std::allocator<nanobind::capsule>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:466:10 (xla_extension.so+0xc62b514) (BuildId: 7f5d2098f168c4db)

  Previous write of size 1 at 0x7f8b24effc08 by thread T66 (mutexes: write M0):
    #0 void llvm::cl::opt_storage<bool, false, false>::setValue<bool>(bool const&, bool) /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1401:11 (xla_extension.so+0x100bace9) (BuildId: 7f5d2098f168c4db)
    #1 void llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::setDefaultImpl<bool, void>() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h (xla_extension.so+0x100bace9)
    #2 llvm::cl::opt<bool, false, llvm::cl::parser<bool>>::setDefault() /proc/self/cwd/external/llvm-project/llvm/include/llvm/Support/CommandLine.h:1474:32 (xla_extension.so+0x100bace9)
    #3 llvm::cl::Option::reset() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:460:3 (xla_extension.so+0x100cac0e) (BuildId: 7f5d2098f168c4db)
    openxla#4 (anonymous namespace)::CommandLineParser::ResetAllOptionOccurrences() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:1478:17 (xla_extension.so+0x100cac0e)
    openxla#5 llvm::cl::ResetAllOptionOccurrences() /proc/self/cwd/external/llvm-project/llvm/lib/Support/CommandLine.cpp:2831:17 (xla_extension.so+0x100caa72) (BuildId: 7f5d2098f168c4db)
    openxla#6 xla::llvm_ir::LLVMCommandLineOptionsLock::LLVMCommandLineOptionsLock(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>> const&) /proc/self/cwd/external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:70:5 (xla_extension.so+0x91d69f4) (BuildId: 7f5d2098f168c4db)
    openxla#7 xla::cpu::CpuCompiler::RunBackend(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /proc/self/cwd/external/xla/xla/service/cpu/cpu_compiler.cc:1727:39 (xla_extension.so+0x2f781c8) (BuildId: 7f5d2098f168c4db)
    openxla#8 xla::JitCompile(xla::XlaComputation const&, absl::lts_20230802::Span<xla::Shape const* const>, xla::ExecutableBuildOptions const&, xla::ExecutionOptions const&, xla::Compiler::CompileOptions const&, int, std::function<void (xla::HloModuleConfig&)>) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:754:19 (xla_extension.so+0x2f12883) (BuildId: 7f5d2098f168c4db)
    openxla#9 xla::TfrtCpuClient::Compile(xla::XlaComputation const&, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:842:3 (xla_extension.so+0x2f12883)
    openxla#10 xla::TfrtCpuClient::Compile(mlir::ModuleOp, xla::CompileOptions) /proc/self/cwd/external/xla/xla/pjrt/cpu/cpu_client.cc:888:10 (xla_extension.so+0x2f13da2) (BuildId: 7f5d2098f168c4db)
    #11 xla::ifrt::PjRtLoadedExecutable::Create(xla::ifrt::PjRtCompatibleClient*, mlir::ModuleOp, xla::CompileOptions, std::vector<tsl::RCReference<xla::ifrt::LoadedHostCallback>, std::allocator<tsl::RCReference<xla::ifrt::LoadedHostCallback>>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_executable.cc:258:3 (xla_extension.so+0xdd04d77) (BuildId: 7f5d2098f168c4db)
    #12 xla::ifrt::PjRtCompiler::Compile(std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/pjrt_ifrt/pjrt_compiler.cc:97:10 (xla_extension.so+0xdcfd29b) (BuildId: 7f5d2098f168c4db)
    openxla#13 xla::PyClient::CompileIfrtProgram(xla::nb_class_ptr<xla::PyClient>, std::unique_ptr<xla::ifrt::Program, std::default_delete<xla::ifrt::Program>>, std::unique_ptr<xla::ifrt::CompileOptions, std::default_delete<xla::ifrt::CompileOptions>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:443:5 (xla_extension.so+0xc62a228) (BuildId: 7f5d2098f168c4db)
    #14 xla::PyClient::Compile(xla::nb_class_ptr<xla::PyClient>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, xla::CompileOptions, std::vector<nanobind::capsule, std::allocator<nanobind::capsule>>) /proc/self/cwd/external/xla/xla/python/py_client.cc:466:10 (xla_extension.so+0xc62b514) (BuildId: 7f5d2098f168c4db)
```

PiperOrigin-RevId: 707721170

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kokoro:force-run

60 participants