Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem about "assert (self.history_seq_ids != seq_ids)[~start_of_sequence].sum() == 0" during training #46

Open
polyethylene16 opened this issue Oct 31, 2024 · 2 comments

Comments

@polyethylene16
Copy link

polyethylene16 commented Oct 31, 2024

Thank you very much for your excellent work!But I have a distressing problem with reproduction.

  • Background: I used ConvNeXtv2-Base to replace the ResNet50 in fbocc-r50-cbgs_depth_16f_16x4_20e.py as the backbone. All other parameters remained the same, and I then followed the example in the start.md and ran the code ./tools/dist_train.sh ./occupancy_configs/fb_occ/fbocc-r50-cbgs_depth_16f_16x4_20e.py 2 to distribute the training across two devices.
  • Problem: I successfully trained two epochs, but at the third epoch, an error was reported, and although the error does not seem to be a problem that appears specifically in FB-OCC, I would like to get your help.
  • Detailed Error Information:
...
2024-10-30 19:18:05,512 - mmdet - INFO - Iter [4000/39980]   lr:2.000e-04, eta:19:56:44,  time:  2.128,  data_time:  0.016,  memory:  26404,  loss_voxel_ce_c_0:  1.1668,  loss_voxel_sem_scal_c_0:  6.0262,  loss_voxel_geo_scal_c_0:  1.1700,  loss_voxel_lovasz_c_0:  0.8033,  loss_depth:  4.5904,  loss:  13.7567,  grad_norm:  438604
Traceback (most recent call last):
  File "./tools/train.py", line 373, in <module>
     main()
  File " ./tools/train.py", line 362, in main
     train_model(
  File "/path/to/my/workspace/mmdetection3d/mmdet3d/apis/train.py", line 28, in train_model
     train_detector(
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector
     runner.run(data_loaders, cfg.workflow)
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
     iter_runner(iter_loaders[i], **kwards)
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
     outputs = self.model.train_step(data_batch, self.optimizer, **kwards)
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
     output = self.module.train_step(*inputs[0], **kwards[0])
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmdet/models/detectors/base.py", in line 237, in train_step
     losses = self(**data)
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
     return forward_call(*input, **kwards)
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/runner/fp_utils.py", line 128, in new_func
     output = old_func(*new_args, **new_kwards)
  File "/path/to/my/workspace/mmdetion3d/mmdet3d/models/detectors/base.py", line 59, in forward
     return self.forward_train(**kwards)
  File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 430, in forward_train
     results= self.extract_feat(
  File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 384, in extract_feat
     results.update(self.extract_img_bev_feat(img, img_metas, **kwards))
  File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 362, in extract_img_bev_feat
     bev_feat = self.fuse_history(bev_feat, img_metas, img[6])
  File "opt/conda/evs/mtbev/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
     output = old_func(*new_args, **new_kwards)
  File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 207, in fuse_history
     assert (self.history_seq_ids != seq_ids)[~start_of_sequence].sum() == 0, \
AssertionError: tensor([555, 555, 555, 555], device='cuda:1'), tensor([965, 965, 965, 965], device='cuda:1'), tensor([False, False, False, False], device='cuda:1')
ERROR:torch.distributed.elastic. multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 414487) of binary: /opt/conda/envs/mtbev/bin/python
...

It's worth mentioning that I used your data prep tool exclusively to prepare the data.

@KUGDXL
Copy link

KUGDXL commented Dec 29, 2024

I've also encountered a similar problem. Have you resolved it? Thanks.

@polyethylene16
Copy link
Author

I've also encountered a similar problem. Have you resolved it? Thanks.

No, I still have no idea about it. Unfortunately, I'm no longer working on the Occ project for some reason, but I still look forward to replies from the author or other practitioners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants