You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I collect PyTorch profiler data and try to use HTA analyzer. Parsing is fine, but the analysis fails.
Note that viewing data on tesnorboard works fine.
>>> analyzer = TraceAnalysis(trace_dir = "gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3")
2024-06-11 12:23:57,307 - hta - trace.py:L389 - INFO - gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3
2024-06-11 12:23:57,307 - hta - trace_file.py:L94 - INFO - Rank to trace file map:
{1: 'gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t03_53384.1718116821592181759.pt.trace.json', 0: 'gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t01_49177.1718116821717695876.pt.trace.json', 2: 'gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t07_27283.1718116821676864190.pt.trace.json'}
2024-06-11 12:23:57,307 - hta - trace.py:L535 - INFO - ranks=[0, 1, 2]
2024-06-11 12:23:58,418 - hta - trace.py:L118 - INFO - Parsed gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t07_27283.1718116821676864190.pt.trace.json time = 1.10 seconds
2024-06-11 12:23:58,418 - hta - trace.py:L118 - INFO - Parsed gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t03_53384.1718116821592181759.pt.trace.json time = 1.10 seconds
2024-06-11 12:23:58,620 - hta - trace.py:L118 - INFO - Parsed gpu_resnet18_cifar10_ddp_batch1024_precisionfp32_nodes3/r1t01_49177.1718116821717695876.pt.trace.json time = 1.31 seconds
>>> idle_time_df = analyzer.get_idle_time_breakdown()
ValueError Traceback (most recent call last)
Cell In[7], line 1
----> 1 idle_time_df = analyzer.get_idle_time_breakdown()
File ~/PycharmProjects/DNNResearchProject/DistributedDeepLearningExperiments/process_results/HolisticTraceAnalysis/hta/trace_analysis.py:462, in TraceAnalysis.get_idle_time_breakdown(self, ranks, streams, visualize, visualize_pctg, show_idle_interval_stats, consecutive_kernel_delay)
460 interval_df_list: List[pd.DataFrame] = []
461 for rank in ranks:
--> 462 idle_time_r_df, interval_r_df = BreakdownAnalysis.get_idle_time_breakdown(
463 self.t,
464 consecutive_kernel_delay,
465 rank,
466 streams,
467 visualize,
468 visualize_pctg,
469 show_idle_interval_stats,
470 )
471 idle_time_df_list.append(idle_time_r_df)
472 if interval_r_df is not None:
File ~/PycharmProjects/DNNResearchProject/DistributedDeepLearningExperiments/process_results/HolisticTraceAnalysis/hta/analyzers/breakdown_analysis.py:547, in BreakdownAnalysis.get_idle_time_breakdown(cls, t, consecutive_kernel_delay, rank, streams, visualize, visualize_pctg, show_idle_interval_stats)
544 if idle_interval_df is not None:
545 interval_stats_list.append(idle_interval_df)
--> 547 result_df = pd.concat(result_list)
549 idle_category_name_map = {
550 member.value: name.lower()
551 for name, member in IdleTimeType.__members__.items()
552 }
553 result_df.rename(mapper=idle_category_name_map, axis=0, inplace=True)
File ~/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:382, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
379 elif copy and using_copy_on_write():
380 copy = False
--> 382 op = _Concatenator(
383 objs,
384 axis=axis,
385 ignore_index=ignore_index,
386 join=join,
387 keys=keys,
388 levels=levels,
389 names=names,
390 verify_integrity=verify_integrity,
391 copy=copy,
392 sort=sort,
393 )
395 return op.get_result()
File ~/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:445, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
442 self.verify_integrity = verify_integrity
443 self.copy = copy
--> 445 objs, keys = self._clean_keys_and_objs(objs, keys)
447 # figure out what our result ndim is going to be
448 ndims = self._get_ndims(objs)
File ~/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/pandas/core/reshape/concat.py:507, in _Concatenator._clean_keys_and_objs(self, objs, keys)
504 objs_list = list(objs)
506 if len(objs_list) == 0:
--> 507 raise ValueError("No objects to concatenate")
509 if keys is None:
510 objs_list = list(com.not_none(*objs_list))
ValueError: No objects to concatenate
>>> comm_comp_overlap_df = analyzer.get_comm_comp_overlap()
/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/DistributedDeepLearningExperiments/process_results/HolisticTraceAnalysis/hta/analyzers/communication_analysis.py:73: RuntimeWarning:
invalid value encountered in scalar divide
/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/DistributedDeepLearningExperiments/process_results/HolisticTraceAnalysis/hta/analyzers/communication_analysis.py:73: RuntimeWarning:
invalid value encountered in scalar divide
/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/DistributedDeepLearningExperiments/process_results/HolisticTraceAnalysis/hta/analyzers/communication_analysis.py:73: RuntimeWarning:
invalid value encountered in scalar divide
Steps to reproduce
This is how I collect the data. The profiler step is in the train method.
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=10, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler(f'./logs/{args.exp_name}'),
record_shapes=True,
profile_memory=True,
with_stack=True,
) as profiler:
for epoch in range(args.epochs):
train(train_sampler, epoch, device, train_loader, ddp_model, loss_criterion, optimizer, profiler, scaler,
precisions[args.precision])
validate(epoch, device, validation_loader, ddp_model, loss_criterion)
scheduler.step()
I installed HTA version 0.2.0 with pip and built it from source. Neither worked.
System I collect on:
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.8.2003 (Core)
Release: 7.8.2003
Codename: Core
System I analyze on:
System Version: macOS 14.5 (23F79)
Kernel Version: Darwin 23.5.0
Secure Virtual Memory: Enabled
System Integrity Protection: Enabled
Model Name: MacBook Air
Chip: Apple M2
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 8 GB
System Firmware Version: 10151.121.1
OS Loader Version: 10151.121.1
Activation Lock Status: Disabled
Additional Info
No response
The text was updated successfully, but these errors were encountered:
🐛 Describe the bug
I collect PyTorch profiler data and try to use HTA analyzer. Parsing is fine, but the analysis fails.
Note that viewing data on tesnorboard works fine.
Steps to reproduce
This is how I collect the data. The profiler step is in the train method.
and
Expected behavior
I expect the anaylsis to work!
Environment
Python Version: 3.11.7
Torch Version: '2.3.1+cu121'
I installed HTA version 0.2.0 with pip and built it from source. Neither worked.
System I collect on:
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.8.2003 (Core)
Release: 7.8.2003
Codename: Core
System I analyze on:
System Version: macOS 14.5 (23F79)
Kernel Version: Darwin 23.5.0
Secure Virtual Memory: Enabled
System Integrity Protection: Enabled
Model Name: MacBook Air
Chip: Apple M2
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 8 GB
System Firmware Version: 10151.121.1
OS Loader Version: 10151.121.1
Activation Lock Status: Disabled
Additional Info
No response
The text was updated successfully, but these errors were encountered: