-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simulating critical paths with optimization heuristics #105
Comments
gentle ping: @anupambhatnagar @briancoutinho @fengxizhou |
@kvignesh1420 I'm so sorry I missed this notification. Let me take a look and respond by today |
@kvignesh1420 thank you for sharing and bringing this up :) Simulation is definitely one area we can develop on top of the cp_graph. That was actually the intent behind returing the networkx di-graph to the user. ApproachSome thoughts on the proof-of-concept ideas
I would recommend not changing the trace dataframe itself. This is because we also have the timestamp 'ts' of the events that need to be updated if say the duration of any possibly dependent earlier operation is reduced. There is also some complex logic examining the ts and duration to build call stacks. Instead, we can modify the
The above is not really added in any notebook yet. But as a user one could follow the above steps. This is a viable approach folks have been using actually. Nuances
Future workSince simulation is pretty user specific it can be a bit hard to come up with a generic API. Like the user could provide a Also, expect a significant speed up due to some bad O(n^2) searching in the algorithm getting improved. Let me know what you think |
@briancoutinho this sounds like a good starting point. Thanks for the pointers. However, I just wanted to discuss the following scenario: "If the user were to reduce the weights of multiple edges in the current critical path graph (in an attempt to simulate a faster op), then what is the possibility that there does not exist another cp graph, whose sum of edges is greater than the currently simulated one?" If such a scenario has a reasonable probability of occurrence, then the users may misinterpret the results of the simulated cp graph. For example:
Let me know if this makes sense :) |
@kvignesh1420 Thanks for your response. My claim is the that the cp_graph does not change. Let's say we consider its nodes and edges (V,E) Next, we consider CUDA synchronization edges, these should remain the same as they represent control dependencies between kernels/CPU operations. Likewise, the CPU operator dependencies which are in serial order. Lastly, kernel -aunch and kernel-kernel delay edges.
So all edges and nodes essentially remain the same in theory, even if we modify t_df. This equivalence makes it much easier to work directly with cp_graph. Let me know if this makes sense. |
@briancoutinho thanks for the pointers. I will take a look at the kernel-launch delay bug as well. Also, just wanted to know a few things:
Thanks again! |
@kvignesh1420 thank you for offering to help on the kernel-launch delay bug :)
|
🚀 Motivation and context
Context: The
CriticalPathAnalysis.critical_path_analysis(...)
API provides access to a networkX DiGraphCPGraph
, which is constructed using multipleCallStackGraph
objects pertaining to specific process and thread pairs. Currently, the support for analysis on aCPGraph
is limited to computing summary statistics based on the "boundedness" of the ops.Motivation: Add functionality and best practices for simulating optimizations to compute/communication bound ops using the trace dataframes and efficiently recomputing the critical paths. (As also described in the future work section in docs)
Description
A possible approach to implement this functionality is as follows:
Setup:
CPGraph
object that has been initialized with the original trace dataframe ascp_graph
.cp_graph.critical_path()
computes thecritical_path_edges_set
,critical_path_nodes_set
, andcritical_path_events_set
sets.get_critical_path_breakdown()
returns a dataframe (saycp_bkdwn_df
) which has information about the duration and boundedness of ops. For example:NOTE: Ideally
aten::mul_
will be executed on a GPU but for large models where parameter offloading is required, the optimizer might perform such ops on a CPU to update it state.Approach: (Proof of Concept)
cp_bkdwn_df.duration
in descending order and check which ops have the highest latency.duration
of opaten::mul_
is reduced by 10%, then thedur
entry in the trace dataframecp_graph.trace_df
corresponding to theevent_idx
(in this case 359318) can be modified.cp_graph.trace_df
such ascp_graph.trace_sim_df
, which is used for simulations.cp_graph.full_trace_df
as a ground truth reference.cp_graph
based on the new duration and recompute the longest path in the dag usingnx.dag_longest_path(self, weight="weight")
.cp_graph.trace_sim_df
and continue to simulate various optimizations. For instance:aten::mul_
NCCLAllReduce
on the newly simulated graph.Essentially, we provide a simple API(s) (design to be discussed) such that the user can tweak the dataframes and simulate settings. Additionally, we retain the original
full_trace
dataframe in case we have to revert back to the original critical path.Alternatives
An alternative approach is for the user to manually modify
CPGraph
attributes and call internal construct graph methods.Additional context
Open to code contributions and discussions.
The text was updated successfully, but these errors were encountered: