You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you very much for your excellent work!But I have a distressing problem with reproduction.
Background: I used ConvNeXtv2-Base to replace the ResNet50 in fbocc-r50-cbgs_depth_16f_16x4_20e.py as the backbone. All other parameters remained the same, and I then followed the example in the start.md and ran the code ./tools/dist_train.sh ./occupancy_configs/fb_occ/fbocc-r50-cbgs_depth_16f_16x4_20e.py 2 to distribute the training across two devices.
Problem: I successfully trained two epochs, but at the third epoch, an error was reported, and although the error does not seem to be a problem that appears specifically in FB-OCC, I would like to get your help.
Detailed Error Information:
...
2024-10-30 19:18:05,512 - mmdet - INFO - Iter [4000/39980] lr:2.000e-04, eta:19:56:44, time: 2.128, data_time: 0.016, memory: 26404, loss_voxel_ce_c_0: 1.1668, loss_voxel_sem_scal_c_0: 6.0262, loss_voxel_geo_scal_c_0: 1.1700, loss_voxel_lovasz_c_0: 0.8033, loss_depth: 4.5904, loss: 13.7567, grad_norm: 438604
Traceback (most recent call last):
File "./tools/train.py", line 373, in <module>
main()
File " ./tools/train.py", line 362, in main
train_model(
File "/path/to/my/workspace/mmdetection3d/mmdet3d/apis/train.py", line 28, in train_model
train_detector(
File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwards)
File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwards)
File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(*inputs[0], **kwards[0])
File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmdet/models/detectors/base.py", in line 237, in train_step
losses = self(**data)
File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwards)
File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/runner/fp_utils.py", line 128, in new_func
output = old_func(*new_args, **new_kwards)
File "/path/to/my/workspace/mmdetion3d/mmdet3d/models/detectors/base.py", line 59, in forward
return self.forward_train(**kwards)
File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 430, in forward_train
results= self.extract_feat(
File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 384, in extract_feat
results.update(self.extract_img_bev_feat(img, img_metas, **kwards))
File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 362, in extract_img_bev_feat
bev_feat = self.fuse_history(bev_feat, img_metas, img[6])
File "opt/conda/evs/mtbev/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
output = old_func(*new_args, **new_kwards)
File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 207, in fuse_history
assert (self.history_seq_ids != seq_ids)[~start_of_sequence].sum() == 0, \
AssertionError: tensor([555, 555, 555, 555], device='cuda:1'), tensor([965, 965, 965, 965], device='cuda:1'), tensor([False, False, False, False], device='cuda:1')
ERROR:torch.distributed.elastic. multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 414487) of binary: /opt/conda/envs/mtbev/bin/python
...
It's worth mentioning that I used your data prep tool exclusively to prepare the data.
The text was updated successfully, but these errors were encountered:
I've also encountered a similar problem. Have you resolved it? Thanks.
No, I still have no idea about it. Unfortunately, I'm no longer working on the Occ project for some reason, but I still look forward to replies from the author or other practitioners.
Thank you very much for your excellent work!But I have a distressing problem with reproduction.
./tools/dist_train.sh ./occupancy_configs/fb_occ/fbocc-r50-cbgs_depth_16f_16x4_20e.py 2
to distribute the training across two devices.It's worth mentioning that I used your data prep tool exclusively to prepare the data.
The text was updated successfully, but these errors were encountered: