Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulating critical paths with optimization heuristics #105

Open
kvignesh1420 opened this issue Feb 22, 2024 · 7 comments
Open

Simulating critical paths with optimization heuristics #105

kvignesh1420 opened this issue Feb 22, 2024 · 7 comments
Assignees
Labels

Comments

@kvignesh1420
Copy link

kvignesh1420 commented Feb 22, 2024

🚀 Motivation and context

Context: The CriticalPathAnalysis.critical_path_analysis(...) API provides access to a networkX DiGraph CPGraph, which is constructed using multiple CallStackGraph objects pertaining to specific process and thread pairs. Currently, the support for analysis on a CPGraph is limited to computing summary statistics based on the "boundedness" of the ops.

Motivation: Add functionality and best practices for simulating optimizations to compute/communication bound ops using the trace dataframes and efficiently recomputing the critical paths. (As also described in the future work section in docs)

Description

A possible approach to implement this functionality is as follows:

Setup:

  • Let us consider a CPGraph object that has been initialized with the original trace dataframe as cp_graph.
  • A call to cp_graph.critical_path() computes the critical_path_edges_set, critical_path_nodes_set, and critical_path_events_set sets.
  • Now, calling get_critical_path_breakdown() returns a dataframe (say cp_bkdwn_df) which has information about the duration and boundedness of ops. For example:
event_idx duration type s_name cat pid tid stream index bound_by
359318.0 234598.0 critical_path_operator aten::mul_ 1817.0 891 891 -1.0 359438.0 cpu_bound

NOTE: Ideally aten::mul_ will be executed on a GPU but for large models where parameter offloading is required, the optimizer might perform such ops on a CPU to update it state.

Approach: (Proof of Concept)

  1. An intuitive first step is to sort these ops by cp_bkdwn_df.duration in descending order and check which ops have the highest latency.
  2. Next, if the user wishes to simulate a situation in which the duration of op aten::mul_ is reduced by 10%, then the dur entry in the trace dataframe cp_graph.trace_df corresponding to the event_idx (in this case 359318) can be modified.
  3. Alternatively, we can create a copy of cp_graph.trace_df such as cp_graph.trace_sim_df, which is used for simulations.
  4. We can retain the cp_graph.full_trace_df as a ground truth reference.
  5. To this end, we can modify the nodes/edges in cp_graph based on the new duration and recompute the longest path in the dag using nx.dag_longest_path(self, weight="weight").
  6. The user can now iteratively modifying the cp_graph.trace_sim_df and continue to simulate various optimizations. For instance:
    • simulate a 10% reduction in aten::mul_
    • Obtain the new critical path graph with breakdowns and summary.
    • simulate a 15% reduction in NCCLAllReduce on the newly simulated graph.
    • Obtain the new critical path graph with breakdowns and summary.

Essentially, we provide a simple API(s) (design to be discussed) such that the user can tweak the dataframes and simulate settings. Additionally, we retain the original full_trace dataframe in case we have to revert back to the original critical path.

Alternatives

An alternative approach is for the user to manually modify CPGraph attributes and call internal construct graph methods.

Additional context

Open to code contributions and discussions.

@kvignesh1420
Copy link
Author

kvignesh1420 commented Feb 28, 2024

gentle ping: @anupambhatnagar @briancoutinho @fengxizhou

@briancoutinho
Copy link
Contributor

@kvignesh1420 I'm so sorry I missed this notification. Let me take a look and respond by today

@briancoutinho briancoutinho self-assigned this Mar 14, 2024
@briancoutinho
Copy link
Contributor

@kvignesh1420 thank you for sharing and bringing this up :) Simulation is definitely one area we can develop on top of the cp_graph. That was actually the intent behind returing the networkx di-graph to the user.
Within Meta we actually have been using this for simulation of sorts too.

Approach

Some thoughts on the proof-of-concept ideas

  1. Next, if the user wishes to simulate a situation in which the duration of op aten::mul_ is reduced by 10%, then the dur entry in the trace dataframe cp_graph.trace_df corresponding to the event_idx (in this case 359318) can be modified.

I would recommend not changing the trace dataframe itself. This is because we also have the timestamp 'ts' of the events that need to be updated if say the duration of any possibly dependent earlier operation is reduced. There is also some complex logic examining the ts and duration to build call stacks.

Instead, we can modify the cp_graph.edges. A sketch of the approach

  1. Basically, we loop through all the edges of type "OPERATOR_KERNEL".
  2. For a given edge we look up the associated event or operator (added a function get_event_attribution_for_edge() to help with this).
  3. Once we know the event we can modify the duration by changing the "weight" of the edge. For, example, simulating a 2x faster GPU could be like updating all GPU events with weight = weight / 2.
  4. Then simply call critical_path() again to recompute the new critical path.

The above is not really added in any notebook yet. But as a user one could follow the above steps. This is a viable approach folks have been using actually.

Nuances

  • Some edges in the graph are added based on GPU queue size (kernel launch delay is only added if queue is 0). We should add kernel launch delay edges for all GPU launches, just set the weight = 0 if the kernel was delayed due to queueing .
  • Without the above it is possible that we speed things up and a GPU kernel launches before its CPU counterpart. Call this causality edges!
  • I already added this change but there seems too be some bug or issue where this impacts actual critical path. Debugging it.

Future work

Since simulation is pretty user specific it can be a bit hard to come up with a generic API. Like the user could provide a update(e,w) -> w' function that updates events to provide the newer weight perhaps.

Also, expect a significant speed up due to some bad O(n^2) searching in the algorithm getting improved.

Let me know what you think

@kvignesh1420
Copy link
Author

kvignesh1420 commented Mar 19, 2024

@briancoutinho this sounds like a good starting point. Thanks for the pointers. However, I just wanted to discuss the following scenario:

"If the user were to reduce the weights of multiple edges in the current critical path graph (in an attempt to simulate a faster op), then what is the possibility that there does not exist another cp graph, whose sum of edges is greater than the currently simulated one?"

If such a scenario has a reasonable probability of occurrence, then the users may misinterpret the results of the simulated cp graph. For example:

  1. The user obtains cp_g1 from the original trace df t_df.
  2. (based on your suggestion) The user modifies few edges of cp_g1 (simulation of optimizations) to obtain cp_g1_sim.
  3. Observe that cp_g1 and cp_g1_sim have the same set of nodes and edges (but different edge weights).
  4. Here comes the tricky part, after modifying the weights of the edges, if one were to compute a fresh cp graph based on t_df (/call stacks) to obtain cp_g2, then will cp_g2 and cp_g1 share the same set of nodes and edges?

Let me know if this makes sense :)

@briancoutinho
Copy link
Contributor

@kvignesh1420 Thanks for your response.
"If the user were to reduce the weights of multiple edges in the current critical path graph (in an attempt to simulate a faster op), then what is the possibility that there does not exist another cp graph, whose sum of edges is greater than the currently simulated one?"

My claim is the that the cp_graph does not change. Let's say we consider its nodes and edges (V,E)
The nodes all correspond to events in t_df, and the duration edges between them represent CPU and GPU spans, that will remain same even if we optimize CPU or GPU times. To clarify the weights of edges will change but graph structure should remain the same.

Next, we consider CUDA synchronization edges, these should remain the same as they represent control dependencies between kernels/CPU operations. Likewise, the CPU operator dependencies which are in serial order.

Lastly, kernel -aunch and kernel-kernel delay edges.

  • Kernel launch edges are the tricky one. I added a change that always add a kernel launch edge event if the launch is not delayed by a previous kernel (in such a case this is a 0 weight edge meant to retain causal relation between CPU launcher and the GPU kernel). This is diabled today due to some bug, need to look at it. code pointer
  • Kernel-kernel edges- these should remain the same as the order of execution of kernels remains same.

So all edges and nodes essentially remain the same in theory, even if we modify t_df. This equivalence makes it much easier to work directly with cp_graph. Let me know if this makes sense.

@kvignesh1420
Copy link
Author

@briancoutinho thanks for the pointers. I will take a look at the kernel-launch delay bug as well. Also, just wanted to know a few things:

  1. for visualizing the overlaid cp graph, do you use the open-source https://ui.perfetto.dev/ for large model traces or does meta have an internal fork of perfetto?
  2. Do we have a sample (small) trace file that I can use to play around with the bug? Does this file suffice?

Thanks again!

@briancoutinho
Copy link
Contributor

briancoutinho commented Mar 28, 2024

@kvignesh1420 thank you for offering to help on the kernel-launch delay bug :)

  1. Yea i am typically using "chrome:://tracing" just out of habit, I remeber the key maps and UI navigation on it. Perfetto is also common and people are using it at Meta.
  2. Hmm not sure if i can share that internal trace, let me get back on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants