Cost Models in XLA GPU – Present and Future #10065
thomasjoerg
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Intro
With the introduction of priority-based fusion (see RFC), XLA GPU uses a cost model to reason about HLO fusion benefits. A cost model can make better decisions than heuristics that prevent some known bad patterns. A cost model also separates the compiler passes for HLO fusion from reasoning about desirable HLO fusions. This separation of concerns is a key step towards codegen optionality, i.e. making it much easier to target different 3rd party GPU libraries and compilers from XLA GPU.
In this post, we describe how XLA GPU’s cost model works, discuss some known shortcomings, and upcoming solutions. We discuss plans to generalize the cost model for Triton-based codegen in XLA GPU. Finally we share our thoughts on using XLA GPU’s cost model in combination with models provided by 3rd parties.
By sharing our plans and ideas, we hope to engage in discussions with the community. As you will notice, some ideas are still vague. Nevertheless, we feel that sharing our current thinking with the community will lead to better results.
XLA GPU’s HLO cost model
Status-quo
XLA GPU uses an analytical cost model to predict the device time of HLO ops including
fusion
ops. The code location is https://github.com/openxla/xla/tree/main/xla/service/gpu/model. The primary use of the HLO cost model is to steer priority based fusion (see [RFC] XLA:GPU Priority-based fusion pass: #6407). The priority order is determined by the modeled reductions in kernel runtime and this way, the most effective fusions are carried out first. XLA GPU’s analytical cost model returns runtime predictions inabsl::Duration
for given HLOfusion
ops. The model predicts compute time and memory reads+writes independently and returns whatever is larger. Finally, the kernel launch overhead is added to the compute time estimate. This is relevant because priority fusion reduces the number of kernels (by one in each step) and thus reduces kernel launch overhead.Device time = max(compute time, memory read+write time) + kernel launch overhead
This (and other) equations in the RFC gloss over some details to keep things brief and simple (see #8170 to dig deeper).
Compute time
Compute time is roughly estimated as the computational cost divided by the achievable parallelism. The relevant code lives in xla/service/gpu/model/gpu_performance_model_base.cc and xla/service/gpu/model/gpu_performance_model.cc.
Compute time = compute cost / compute parallelism
XLA uses
HloCostAnalysis
and its specializationGpuHloCostAnalysis
to traverse HLO computations and sum up the computation cost. The classes implement callback functions to return the cost for HLO ops on a given hardware platform. The number of floating point ops is scaled by the cost per op (in clock cycles) to arrive at an estimate (see xla/service/gpu/model/hlo_op_profiles_data.h).The achievable parallelism of an HLO computation on GPU is modeled as the total number of CUDA threads. It depends on the codegen emitter that will be used. For some HLO ops, XLA GPU has dedicated emitters with high parallelism. An example is XLA GPU’s reduction emitter that codegens parallel reductions using warp shuffles and shared memory for increased parallelism.
Memory read+write time
Not surprisingly, memory read+write times are computed based on the amount of data read+written and the memory bandwidth of the given GPU architecture.
operand read time =
(operand size * operand utilization) / (memory bandwidth / coalescing waste factor * caching speedup)
Operands are not always read entirely. HLO ops like
slice
can reduce the data that is read and this is captured as an operand utilization of less than one. Operand utilization can also be greater than one, for operands that are read multiple times. This can be caused by HLO ops that read the same data for different output threads likebroadcast
orreverse
. Operand utilization is computed by HloCostAnalysis already mentioned in the previous section.The coalescing waste factor assumes fully strided access, i.e. just one element per cache line is used and the remaining bandwidth is wasted. Since V100 the default transaction size from DRAM to L2 cache is 64 B. Strided access on an f32 type, for instance, would be modeled as a waste factor of 1/16th (=64B / 4B).
The caching speedup is determined by a rather crude heuristic based on the amount of data read from an operand. Small amounts of data are assumed to fit into L1 or L2 cache and the memory bandwidth is multiplied by a constant factor. Each parameter is modeled independently; there is no model for cache evictions due to reads of multiple small parameters.
There are a few subtleties omitted in the description above: Coalescing waste factors are applied to the first read from DRAM only, but not to subsequent reads from cache. Moreover, the memory bandwidth is degraded if the achieved occupancy of the GPU kernel is expected to be poor. This may happen for HLO fusions that process small amounts of data and exhibit little parallelism.
write time = output size / memory bandwidth
The write time is estimated in a similar fashion. Writes are assumed to be coalesced. HLO ops that write in-place (
DynamicUpdateSlice
) are modeled for an output size corresponding to the in-place update rather than the entire output buffer.Known Shortcoming and upcoming improvements
“All models are wrong, but some are useful” --George Box.
The HLO cost model is certainly no exception and we are aware of several shortcomings that should be addressed to make the model more useful.
Read coalescing is determined heuristically by looking for combinations of HLO ops that are known to be problematic (see xla/service/gpu/model/coalescing_analysis.cc). The code prevents pathological cases, e.g. “don’t fuse
reduce
withreduce
” or “don’t fusetranspose
of the most minor dimensions”. The heuristics are necessarily imperfect and incomplete. Moreover, the notion of coalescing is binary, i.e. reads are modeled as fully coalesced or fully strided. The XLA GPU team is working on a more principled approach building on HLO indexing analysis. Reasoning about memory coalescing becomes feasible, when we know what elements/slices of the inputs are read to compute an element of the output. HLO indexing maps provide exactly this information for given HLO computations.Similarly, operand utilization can be determined more accurately with HLO indexing analysis. The HLO indexing maps immediately show what parts of operands are read. Moreover, output-to-input indexing maps with more than one entry per output element, indicate that operands need to be read more than once (see section "One input, several indexing maps" in HLO indexing analysis docu).
On the following scatter plot, each dot depicts a CUDA kernel corresponding to an HLO fusion. The position of each dot shows the measured vs predicted kernel execution time on a range of models. An ideal cost model would put all dots along the diagonal line. For dots above the diagonal, the predicted cost is too low; for dots below the diagonal, the predicted cost is too high.
The plot shows a mostly linear correlation between predicted and measured cost. There are outliers, mostly within one order of magnitude. The predictions seem least accurate for very cheap kernels at the lower left. We are working on improving accuracy, but found the HLO cost model in this state useful to drive fusion decisions in XLA GPU. Priority fusion targets a single codegen option, namely HLO emitters, and doesn’t require an actual time estimate - “fuse if modeled cost is lower” works fine. However, we want to evolve priority fusion to target multiple different codegen options. More on this in the next section.
The modeling of caching effects is very crude. Small operands are assumed to be cached, without looking at the size of operands that compete for cache space. The XLA GPU team looked into more fine-grained cache control via hints, e.g. to avoid caching operands that are known to be read just once. However, experiments did not show clear wins, most likely due to the overheads of changing CUDA cache configs and the lower quality of LLVM optimizations for caching intrinsics.
Generalizing the HLO cost model
The XLA GPU team plans to generalize the HLO cost model to Triton codegen for SoftMax/LayerNorm. This will be discussed in a separate RFC. The gist of it is that the HLO cost model is mostly codegen-agnostic. It should be generalizable to Triton by changing the way parallelism and load coalescing is determined. The fusion capabilities of Triton SoftMax/LayerNorm codegen in XLA GPU will be more limited, at least initially. The HLO cost model will help make the right trade-off.
Generalizing to GEMMs is trickier, because inputs are re-read, making prologue fusion unprofitable (compute bound) more quickly. While matrix multiplies have a fixed amount of compute (the number of FMAs), their memory access pattern is dependent on shapes and tile sizes. In general, the amount of recomputation goes down as the tile size goes up, allowing for more fusion. On the other hand, fusion has an impact on tile size selection, e.g. fusing a s8->bf16 convert into one of the operands allows selecting a larger tile size. We anticipate the need for a cheap way to predict good tile sizes during fusion passes and the continued use of auto-tuning to refine tile sizes after fusion decisions are finalized.
Combined use with 3rd party cost models
Comparing predictions across different cost models is interesting, if the corresponding codegen options have overlapping capabilities. Only in such cases, XLA GPU has a choice after all. A number of options exist for GEMMs. On NVIDIA GPU there is cuBLAS, cuBLASLt, cuDNN Graph GEMMs, and Triton. Today, XLA GPU primarily targets cuBLAS and Triton for GEMMs and relies on auto-tuning to pick the fastest option.
More recently, libraries started to extend their fusion capabilities. cuDNN Graph GEMMs, for instance, supports certain types of prologue and epilogue fusions (see https://docs.nvidia.com/deeplearning/cudnn/developer/graph-api.html#generic-runtime-fusion-engines). Similarly, Triton can flexibly fuse into GEMMs. Prologues and epilogues can alternatively be codegen via HLO emitters. Hence, the number of options available to XLA GPU is increasing and making good choices will get more important. Auto-tuning remains an option, but this approach is more expensive than cost model predictions. For priority fusion, auto-tuning would be prohibitively expensive, because the search space can be very large.
We anticipate that XLA GPU will use a mix of cost models provided by hardware vendors (potentially closed source), its HLO cost model, and some amount of auto-tuning. One way to compare between different cost models would be a comparable unit, something proportional to time (like microseconds or cycles) would obviously work. The fusion priority would be determined by getting an estimate for the fused and unfused cost from each model and taking the minimum.
priority = min(unfuse_cost_i) - min(fused_cost_i) where *_cost_i is estimated by i-th cost model.
A linear regression may be needed to make models comparable, e.g. adjust differences between average case vs best case predictors.
Since achieving a full comparability between cost models can be difficult, if even possible, there is an option of a more relaxed comparison that doesn’t require direct interaction between models. For a potential fusion decision:
priority = max(unfused_cost_i - fused_costi)
This doesn’t mean that the model that says “yes” will be used to emit the code, but we hope that the fusion decision was correct overall. After the priority fusion pass, we run autotuning to pick the best emitter for each fusion.
Beta Was this translation helpful? Give feedback.
All reactions