out of memory #80

Jackie-LJQ · 2021-01-22T11:36:36Z

I got cuda out of memory every time I continued training. But I won't get error if I load initial weight and train from the first epoch.
I used my own dataset, but I thought is more likely something wrong with distributed training? Any suggestion how should I check the code?

traing from epoch 0 and everything works well

Epoch: [0][0/167] Time: 9.341s (9.341s) Speed: 2.1 samples/s Data: 7.671s (7.671s) Stage0-heatmaps: 2.215e-03 (2.215e-03) Stage1-heatmaps: 6.406e-04 (6.406e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 4.953e-08 (4.953e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [0][0/167] Time: 9.341s (9.341s) Speed: 2.1 samples/s Data: 7.873s (7.873s) Stage0-heatmaps: 1.990e-03 (1.990e-03) Stage1-heatmaps: 5.832e-04 (5.832e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 4.789e-08 (4.789e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [0][100/167] Time: 0.539s (0.651s) Speed: 37.1 samples/s Data: 0.000s (0.101s) Stage0-heatmaps: 4.487e-04 (1.019e-03) Stage1-heatmaps: 4.257e-04 (5.118e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 3.724e-07 (4.452e-07) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [0][100/167] Time: 0.541s (0.651s) Speed: 36.9 samples/s Data: 0.000s (0.099s) Stage0-heatmaps: 4.705e-04 (1.050e-03) Stage1-heatmaps: 4.493e-04 (5.196e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 3.321e-07 (4.364e-07) Stage1-pull: 0.000e+00 (0.000e+00)
=> saving checkpoint to output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3

continue training and get error

Target Transforms (if any): None=> loading checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar'
=> loading checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar'
=> loaded checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' (epoch 5)
=> loaded checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' (epoch 5)
Epoch: [5][0/167] Time: 9.577s (9.577s) Speed: 2.1 samples/s Data: 8.164s (8.164s) Stage0-heatmaps: 1.595e-04 (1.595e-04) Stage1-heatmaps: 7.866e-05 (7.866e-05) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 6.155e-08 (6.155e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [5][0/167] Time: 9.665s (9.665s) Speed: 2.1 samples/s Data: 7.976s (7.976s) Stage0-heatmaps: 1.904e-04 (1.904e-04) Stage1-heatmaps: 8.872e-05 (8.872e-05) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 5.090e-08 (5.090e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Traceback (most recent call last):
File "tools/dist_train.py", line 323, in
main()
File "tools/dist_train.py", line 115, in main
args=(ngpus_per_node, args, final_output_dir, tb_log_dir)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, args)
File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/dist_train.py", line 285, in main_worker
final_output_dir, tb_log_dir, writer_dict, fp16=cfg.FP16.ENABLED)
File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/../lib/core/trainer.py", line 76, in do_train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 52.00 MiB (GPU 0; 7.80 GiB total capacity; 5.73 GiB already allocated; 27.31 MiB free; 5.86 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fa6f3122536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1cf1e (0x7fa6f336bf1e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1df9e (0x7fa6f336cf9e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x135 (0x7fa6f5f00535 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xf7a66b (0x7fa6f44f866b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xfc3f57 (0x7fa6f4541f57 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x1075389 (0x7fa730a7c389 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0x10756c7 (0x7fa730a7c6c7 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0xe3c42e (0x7fa73084342e in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: at::TensorIterator::fast_set_up() + 0x5cf (0x7fa7308442af in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: at::TensorIterator::build() + 0x4c (0x7fa730844b6c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x146 (0x7fa730845216 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: at::native::mul(at::Tensor const&, at::Tensor const&) + 0x3a (0x7fa730564eba in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0xf76ef8 (0x7fa6f44f4ef8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #14: + 0x10c3ec0 (0x7fa730acaec0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: + 0x2d2e779 (0x7fa732735779 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x10c3ec0 (0x7fa730acaec0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: at::Tensor::mul(at::Tensor const&) const + 0xf0 (0x7fa73f108ab0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #18: torch::autograd::generated::PowBackward0::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x1a6 (0x7fa7322caa06 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #19: + 0x2d89c05 (0x7fa732790c05 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node, torch::autograd::InputBuffer&) + 0x16f3 (0x7fa73278df03 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7fa73278ece2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fa732787359 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #23: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fa73eec64d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #24: + 0xbd66f (0x7fa73ff9766f in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #25: + 0x76db (0x7fa742c906db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #26: clone + 0x3f (0x7fa742fc988f in /lib/x86_64-linux-gnu/libc.so.6)

root@a2bff378da93:/kpoints/HigherHRNet-Human-Pose-Estimation# /usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
len(cache))

chenmingjian · 2021-02-03T07:52:00Z

I had the same problem. :(

wusaisa · 2023-10-30T01:17:37Z

I had the same problem, did you solve it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out of memory #80

out of memory #80

Jackie-LJQ commented Jan 22, 2021 •

edited

Loading

chenmingjian commented Feb 3, 2021

wusaisa commented Oct 30, 2023

out of memory #80

out of memory #80

Comments

Jackie-LJQ commented Jan 22, 2021 • edited Loading

chenmingjian commented Feb 3, 2021

wusaisa commented Oct 30, 2023

Jackie-LJQ commented Jan 22, 2021 •

edited

Loading