-
Notifications
You must be signed in to change notification settings - Fork 41
Chakra Converter
The PyTorchConverter
class is responsible for converting Chakra host and device execution traces from JSON format into the Chakra protobuf format. The converted traces are suitable for simulation or further analysis. This class handles parsing the JSON traces, converting nodes, establishing dependencies, and writing the final protobuf trace to an output file.
The conversion process involves several key steps:
-
Loading and Parsing JSON Trace: The converter loads the JSON trace and parses it to create a mapping of node IDs to
PyTorchNode
instances. It also extracts metadata and root nodes. -
Establishing Relationships: Parent-child relationships are established based on control dependencies present in the trace. The converter handles specific cases for GPU operations,
record_param_comms
, and NCCL nodes. -
Converting Nodes to Protobuf Format: Each
PyTorchNode
is converted into aChakraNode
(protobuf format). The converter handles node types, attributes, and special cases for communication operations. -
Dependency Conversion: Control dependencies are converted into data dependencies, which are necessary for simulation. This step ensures that the execution order is correctly represented.
-
Removing Dangling Nodes: Nodes that are not connected (neither parents nor children) are removed to simplify the execution graph.
-
Cyclic Dependency Check: The converter checks for cyclic dependencies to ensure the execution graph is a Directed Acyclic Graph (DAG), which is essential for simulation.
-
Writing Protobuf Trace: The final execution trace, including metadata and nodes, is written into a protobuf file.
-
Simulation (Optional): If simulation is enabled, the converter simulates the execution of the nodes to validate the correctness of the conversion and the dependencies.
-
convert(input_filename, output_filename, simulate)
: The main method to perform the conversion process from JSON to protobuf format. -
load_json_execution_traces(input_filename)
: Loads the JSON execution traces from a file. -
parse_json_trace(json_trace)
: Parses the JSON trace and createsPyTorchNode
instances. -
establish_parent_child_relationships(json_node_map, json_node_root_nids)
: Establishes parent-child relationships among nodes based on control dependencies. -
convert_json_to_protobuf_nodes(json_node_map, protobuf_node_map)
: ConvertsPyTorchNode
instances into protobufChakraNode
instances. -
convert_ctrl_dep_to_data_dep(json_node_map, protobuf_node_map, chakra_node)
: Converts control dependencies into data dependencies suitable for simulation. -
remove_dangling_nodes(protobuf_node_map)
: Removes nodes that are not connected to the execution graph. -
identify_cyclic_dependencies(protobuf_node_map)
: Checks for cyclic dependencies in the execution graph to ensure it is a DAG. -
write_protobuf_execution_trace(output_filename, json_metadata, protobuf_node_map)
: Writes the converted execution trace into a protobuf file. -
simulate_execution(json_node_map, protobuf_node_map, parent_to_children_map)
: Simulates the execution of the nodes to validate the correctness of the conversion and dependencies.
-
Control vs. Data Dependencies: In the original Chakra host execution traces, control dependencies represent the caller-callee relationships. However, for simulation purposes, these need to be converted into data dependencies to reflect the actual execution order.
-
Inter-Thread Dependencies: The converter handles inter-thread dependencies to ensure accurate simulation of operations that span multiple threads.
-
Communication Operations: Special attention is given to communication operations like NCCL calls. The converter identifies collective communication types and handles attributes like communication size and process group names.
-
Cyclic Dependencies: The converter ensures that the final execution graph is acyclic. Cyclic dependencies can cause simulations to hang or fail, so they are detected and reported.
-
Dangling Nodes: Nodes that are neither parents nor children are considered dangling and are removed to simplify the execution graph.
-
Home
-
Community
-
Chakra Schema Release Notes
-
Tools & Applications
-
Getting Started
-
Chakra Framework Explanation
-
Resources
- Publications
- Videos
- Google Drive
- FAQ
- Design Documents
-
Contributing