Simulating critical paths with optimization heuristics #105

kvignesh1420 · 2024-02-22T08:40:43Z

🚀 Motivation and context

Context: The CriticalPathAnalysis.critical_path_analysis(...) API provides access to a networkX DiGraph CPGraph, which is constructed using multiple CallStackGraph objects pertaining to specific process and thread pairs. Currently, the support for analysis on a CPGraph is limited to computing summary statistics based on the "boundedness" of the ops.

Motivation: Add functionality and best practices for simulating optimizations to compute/communication bound ops using the trace dataframes and efficiently recomputing the critical paths. (As also described in the future work section in docs)

Description

A possible approach to implement this functionality is as follows:

Setup:

Let us consider a CPGraph object that has been initialized with the original trace dataframe as cp_graph.
A call to cp_graph.critical_path() computes the critical_path_edges_set, critical_path_nodes_set, and critical_path_events_set sets.
Now, calling get_critical_path_breakdown() returns a dataframe (say cp_bkdwn_df) which has information about the duration and boundedness of ops. For example:

event_idx	duration	type	s_name	cat	pid	tid	stream	index	bound_by
359318.0	234598.0	critical_path_operator	aten::mul_	1817.0	891	891	-1.0	359438.0	cpu_bound

NOTE: Ideally aten::mul_ will be executed on a GPU but for large models where parameter offloading is required, the optimizer might perform such ops on a CPU to update it state.

Approach: (Proof of Concept)

An intuitive first step is to sort these ops by cp_bkdwn_df.duration in descending order and check which ops have the highest latency.
Next, if the user wishes to simulate a situation in which the duration of op aten::mul_ is reduced by 10%, then the dur entry in the trace dataframe cp_graph.trace_df corresponding to the event_idx (in this case 359318) can be modified.
Alternatively, we can create a copy of cp_graph.trace_df such as cp_graph.trace_sim_df, which is used for simulations.
We can retain the cp_graph.full_trace_df as a ground truth reference.
To this end, we can modify the nodes/edges in cp_graph based on the new duration and recompute the longest path in the dag using nx.dag_longest_path(self, weight="weight").
The user can now iteratively modifying the cp_graph.trace_sim_df and continue to simulate various optimizations. For instance:
- simulate a 10% reduction in aten::mul_
- Obtain the new critical path graph with breakdowns and summary.
- simulate a 15% reduction in NCCLAllReduce on the newly simulated graph.
- Obtain the new critical path graph with breakdowns and summary.

Essentially, we provide a simple API(s) (design to be discussed) such that the user can tweak the dataframes and simulate settings. Additionally, we retain the original full_trace dataframe in case we have to revert back to the original critical path.

Alternatives

An alternative approach is for the user to manually modify CPGraph attributes and call internal construct graph methods.

Additional context

Open to code contributions and discussions.

The text was updated successfully, but these errors were encountered:

kvignesh1420 · 2024-02-28T14:15:04Z

gentle ping: @anupambhatnagar @briancoutinho @fengxizhou

briancoutinho · 2024-03-14T15:16:14Z

@kvignesh1420 I'm so sorry I missed this notification. Let me take a look and respond by today

briancoutinho · 2024-03-16T01:58:50Z

@kvignesh1420 thank you for sharing and bringing this up :) Simulation is definitely one area we can develop on top of the cp_graph. That was actually the intent behind returing the networkx di-graph to the user.
Within Meta we actually have been using this for simulation of sorts too.

Approach

Some thoughts on the proof-of-concept ideas

Next, if the user wishes to simulate a situation in which the duration of op aten::mul_ is reduced by 10%, then the dur entry in the trace dataframe cp_graph.trace_df corresponding to the event_idx (in this case 359318) can be modified.

I would recommend not changing the trace dataframe itself. This is because we also have the timestamp 'ts' of the events that need to be updated if say the duration of any possibly dependent earlier operation is reduced. There is also some complex logic examining the ts and duration to build call stacks.

Instead, we can modify the cp_graph.edges. A sketch of the approach

Basically, we loop through all the edges of type "OPERATOR_KERNEL".
For a given edge we look up the associated event or operator (added a function get_event_attribution_for_edge() to help with this).
Once we know the event we can modify the duration by changing the "weight" of the edge. For, example, simulating a 2x faster GPU could be like updating all GPU events with weight = weight / 2.
Then simply call critical_path() again to recompute the new critical path.

The above is not really added in any notebook yet. But as a user one could follow the above steps. This is a viable approach folks have been using actually.

Nuances

Some edges in the graph are added based on GPU queue size (kernel launch delay is only added if queue is 0). We should add kernel launch delay edges for all GPU launches, just set the weight = 0 if the kernel was delayed due to queueing .
Without the above it is possible that we speed things up and a GPU kernel launches before its CPU counterpart. Call this causality edges!
I already added this change but there seems too be some bug or issue where this impacts actual critical path. Debugging it.

Future work

Since simulation is pretty user specific it can be a bit hard to come up with a generic API. Like the user could provide a update(e,w) -> w' function that updates events to provide the newer weight perhaps.

Also, expect a significant speed up due to some bad O(n^2) searching in the algorithm getting improved.

Let me know what you think

kvignesh1420 · 2024-03-19T20:05:45Z

@briancoutinho this sounds like a good starting point. Thanks for the pointers. However, I just wanted to discuss the following scenario:

"If the user were to reduce the weights of multiple edges in the current critical path graph (in an attempt to simulate a faster op), then what is the possibility that there does not exist another cp graph, whose sum of edges is greater than the currently simulated one?"

If such a scenario has a reasonable probability of occurrence, then the users may misinterpret the results of the simulated cp graph. For example:

The user obtains cp_g1 from the original trace df t_df.
(based on your suggestion) The user modifies few edges of cp_g1 (simulation of optimizations) to obtain cp_g1_sim.
Observe that cp_g1 and cp_g1_sim have the same set of nodes and edges (but different edge weights).
Here comes the tricky part, after modifying the weights of the edges, if one were to compute a fresh cp graph based on t_df (/call stacks) to obtain cp_g2, then will cp_g2 and cp_g1 share the same set of nodes and edges?

Let me know if this makes sense :)

briancoutinho · 2024-03-27T16:28:19Z

@kvignesh1420 Thanks for your response.
"If the user were to reduce the weights of multiple edges in the current critical path graph (in an attempt to simulate a faster op), then what is the possibility that there does not exist another cp graph, whose sum of edges is greater than the currently simulated one?"

My claim is the that the cp_graph does not change. Let's say we consider its nodes and edges (V,E)
The nodes all correspond to events in t_df, and the duration edges between them represent CPU and GPU spans, that will remain same even if we optimize CPU or GPU times. To clarify the weights of edges will change but graph structure should remain the same.

Next, we consider CUDA synchronization edges, these should remain the same as they represent control dependencies between kernels/CPU operations. Likewise, the CPU operator dependencies which are in serial order.

Lastly, kernel -aunch and kernel-kernel delay edges.

Kernel launch edges are the tricky one. I added a change that always add a kernel launch edge event if the launch is not delayed by a previous kernel (in such a case this is a 0 weight edge meant to retain causal relation between CPU launcher and the GPU kernel). This is diabled today due to some bug, need to look at it. code pointer
Kernel-kernel edges- these should remain the same as the order of execution of kernels remains same.

So all edges and nodes essentially remain the same in theory, even if we modify t_df. This equivalence makes it much easier to work directly with cp_graph. Let me know if this makes sense.

kvignesh1420 · 2024-03-27T21:20:07Z

@briancoutinho thanks for the pointers. I will take a look at the kernel-launch delay bug as well. Also, just wanted to know a few things:

for visualizing the overlaid cp graph, do you use the open-source https://ui.perfetto.dev/ for large model traces or does meta have an internal fork of perfetto?
Do we have a sample (small) trace file that I can use to play around with the bug? Does this file suffice?

Thanks again!

briancoutinho · 2024-03-28T01:17:01Z

@kvignesh1420 thank you for offering to help on the kernel-launch delay bug :)

Yea i am typically using "chrome:://tracing" just out of habit, I remeber the key maps and UI navigation on it. Perfetto is also common and people are using it at Meta.
Hmm not sure if i can share that internal trace, let me get back on it.

kvignesh1420 added feature request New feature request needs triage labels Feb 22, 2024

briancoutinho self-assigned this Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulating critical paths with optimization heuristics #105

Simulating critical paths with optimization heuristics #105

kvignesh1420 commented Feb 22, 2024 •

edited

Loading

kvignesh1420 commented Feb 28, 2024 •

edited

Loading

briancoutinho commented Mar 14, 2024

briancoutinho commented Mar 16, 2024

kvignesh1420 commented Mar 19, 2024 •

edited

Loading

briancoutinho commented Mar 27, 2024

kvignesh1420 commented Mar 27, 2024

briancoutinho commented Mar 28, 2024 •

edited

Loading

Simulating critical paths with optimization heuristics #105

Simulating critical paths with optimization heuristics #105

Comments

kvignesh1420 commented Feb 22, 2024 • edited Loading

🚀 Motivation and context

Description

Alternatives

Additional context

kvignesh1420 commented Feb 28, 2024 • edited Loading

briancoutinho commented Mar 14, 2024

briancoutinho commented Mar 16, 2024

Approach

Nuances

Future work

kvignesh1420 commented Mar 19, 2024 • edited Loading

briancoutinho commented Mar 27, 2024

kvignesh1420 commented Mar 27, 2024

briancoutinho commented Mar 28, 2024 • edited Loading

kvignesh1420 commented Feb 22, 2024 •

edited

Loading

kvignesh1420 commented Feb 28, 2024 •

edited

Loading

kvignesh1420 commented Mar 19, 2024 •

edited

Loading

briancoutinho commented Mar 28, 2024 •

edited

Loading