Analyzer reads the log files correctly but doesn't show data #148

oabuhamdan · 2024-06-11T18:02:57Z

🐛 Describe the bug

I collect PyTorch profiler data and try to use HTA analyzer. Parsing is fine, but the analysis fails.

Note that viewing data on tesnorboard works fine.

>>> analyzer = TraceAnalysis(trace_dir = "gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3")
2024-06-11 12:23:57,307 - hta - trace.py:L389 - INFO - gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3
2024-06-11 12:23:57,307 - hta - trace_file.py:L94 - INFO - Rank to trace file map:
{1: 'gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t03_53384.1718116821592181759.pt.trace.json', 0: 'gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t01_49177.1718116821717695876.pt.trace.json', 2: 'gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t07_27283.1718116821676864190.pt.trace.json'}
2024-06-11 12:23:57,307 - hta - trace.py:L535 - INFO - ranks=[0, 1, 2]
2024-06-11 12:23:58,418 - hta - trace.py:L118 - INFO - Parsed gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t07_27283.1718116821676864190.pt.trace.json time = 1.10 seconds 
2024-06-11 12:23:58,418 - hta - trace.py:L118 - INFO - Parsed gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t03_53384.1718116821592181759.pt.trace.json time = 1.10 seconds 
2024-06-11 12:23:58,620 - hta - trace.py:L118 - INFO - Parsed gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t01_49177.1718116821717695876.pt.trace.json time = 1.31 seconds

>>> idle_time_df = analyzer.get_idle_time_breakdown()

ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 idle_time_df = analyzer.get_idle_time_breakdown()

File ~/PycharmProjects/DNNResearchProject/DistributedDeepLearningExperiments/process_results/HolisticTraceAnalysis/hta/trace_analysis.py:462, in TraceAnalysis.get_idle_time_breakdown(self, ranks, streams, visualize, visualize_pctg, show_idle_interval_stats, consecutive_kernel_delay)
    460 interval_df_list: List[pd.DataFrame] = []
    461 for rank in ranks:
--> 462     idle_time_r_df, interval_r_df = BreakdownAnalysis.get_idle_time_breakdown(
    463         self.t,
    464         consecutive_kernel_delay,
    465         rank,
    466         streams,
    467         visualize,
    468         visualize_pctg,
    469         show_idle_interval_stats,
    470     )
    471     idle_time_df_list.append(idle_time_r_df)
    472     if interval_r_df is not None:

File ~/PycharmProjects/DNNResearchProject/DistributedDeepLearningExperiments/process_results/HolisticTraceAnalysis/hta/analyzers/breakdown_analysis.py:547, in BreakdownAnalysis.get_idle_time_breakdown(cls, t, consecutive_kernel_delay, rank, streams, visualize, visualize_pctg, show_idle_interval_stats)
    544     if idle_interval_df is not None:
    545         interval_stats_list.append(idle_interval_df)
--> 547 result_df = pd.concat(result_list)
    549 idle_category_name_map = {
    550     member.value: name.lower()
    551     for name, member in IdleTimeType.__members__.items()
    552 }
    553 result_df.rename(mapper=idle_category_name_map, axis=0, inplace=True)

File ~/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:382, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    379 elif copy and using_copy_on_write():
    380     copy = False
--> 382 op = _Concatenator(
    383     objs,
    384     axis=axis,
    385     ignore_index=ignore_index,
    386     join=join,
    387     keys=keys,
    388     levels=levels,
    389     names=names,
    390     verify_integrity=verify_integrity,
    391     copy=copy,
    392     sort=sort,
    393 )
    395 return op.get_result()

File ~/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:445, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    442 self.verify_integrity = verify_integrity
    443 self.copy = copy
--> 445 objs, keys = self._clean_keys_and_objs(objs, keys)
    447 # figure out what our result ndim is going to be
    448 ndims = self._get_ndims(objs)

File ~/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:507, in _Concatenator._clean_keys_and_objs(self, objs, keys)
    504     objs_list = list(objs)
    506 if len(objs_list) == 0:
--> 507     raise ValueError("No objects to concatenate")
    509 if keys is None:
    510     objs_list = list(com.not_none(*objs_list))

ValueError: No objects to concatenate

>>> comm_comp_overlap_df = analyzer.get_comm_comp_overlap()

/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/DistributedDeepLearningExperiments/process_results/HolisticTraceAnalysis/hta/analyzers/communication_analysis.py:73: RuntimeWarning:

invalid value encountered in scalar divide

/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/DistributedDeepLearningExperiments/process_results/HolisticTraceAnalysis/hta/analyzers/communication_analysis.py:73: RuntimeWarning:

invalid value encountered in scalar divide

/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/DistributedDeepLearningExperiments/process_results/HolisticTraceAnalysis/hta/analyzers/communication_analysis.py:73: RuntimeWarning:

invalid value encountered in scalar divide

Steps to reproduce

This is how I collect the data. The profiler step is in the train method.

with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=10, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler(f'./logs/{args.exp_name}'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True,
) as profiler:
    for epoch in range(args.epochs):
        train(train_sampler, epoch, device, train_loader, ddp_model, loss_criterion, optimizer, profiler, scaler,
              precisions[args.precision])
        validate(epoch, device, validation_loader, ddp_model, loss_criterion)
        scheduler.step()

and

>>> torch.profiler.kineto_available()
True

Expected behavior

I expect the anaylsis to work!

Environment

Python Version: 3.11.7
Torch Version: '2.3.1+cu121'

I installed HTA version 0.2.0 with pip and built it from source. Neither worked.

System I collect on:
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.8.2003 (Core)
Release: 7.8.2003
Codename: Core

System I analyze on:
System Version: macOS 14.5 (23F79)
Kernel Version: Darwin 23.5.0
Secure Virtual Memory: Enabled
System Integrity Protection: Enabled
Model Name: MacBook Air
Chip: Apple M2
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 8 GB
System Firmware Version: 10151.121.1
OS Loader Version: 10151.121.1
Activation Lock Status: Disabled

Additional Info

No response

The text was updated successfully, but these errors were encountered:

oabuhamdan added bug Something isn't working needs triage labels Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyzer reads the log files correctly but doesn't show data #148

Analyzer reads the log files correctly but doesn't show data #148

oabuhamdan commented Jun 11, 2024

Analyzer reads the log files correctly but doesn't show data #148

Analyzer reads the log files correctly but doesn't show data #148

Comments

oabuhamdan commented Jun 11, 2024

🐛 Describe the bug

Steps to reproduce

Expected behavior

Environment

Additional Info