Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of memory #80

Open
Jackie-LJQ opened this issue Jan 22, 2021 · 2 comments
Open

out of memory #80

Jackie-LJQ opened this issue Jan 22, 2021 · 2 comments

Comments

@Jackie-LJQ
Copy link

Jackie-LJQ commented Jan 22, 2021

I got cuda out of memory every time I continued training. But I won't get error if I load initial weight and train from the first epoch.
I used my own dataset, but I thought is more likely something wrong with distributed training? Any suggestion how should I check the code?

traing from epoch 0 and everything works well

Epoch: [0][0/167] Time: 9.341s (9.341s) Speed: 2.1 samples/s Data: 7.671s (7.671s) Stage0-heatmaps: 2.215e-03 (2.215e-03) Stage1-heatmaps: 6.406e-04 (6.406e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 4.953e-08 (4.953e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [0][0/167] Time: 9.341s (9.341s) Speed: 2.1 samples/s Data: 7.873s (7.873s) Stage0-heatmaps: 1.990e-03 (1.990e-03) Stage1-heatmaps: 5.832e-04 (5.832e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 4.789e-08 (4.789e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [0][100/167] Time: 0.539s (0.651s) Speed: 37.1 samples/s Data: 0.000s (0.101s) Stage0-heatmaps: 4.487e-04 (1.019e-03) Stage1-heatmaps: 4.257e-04 (5.118e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 3.724e-07 (4.452e-07) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [0][100/167] Time: 0.541s (0.651s) Speed: 36.9 samples/s Data: 0.000s (0.099s) Stage0-heatmaps: 4.705e-04 (1.050e-03) Stage1-heatmaps: 4.493e-04 (5.196e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 3.321e-07 (4.364e-07) Stage1-pull: 0.000e+00 (0.000e+00)
=> saving checkpoint to output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3

continue training and get error

Target Transforms (if any): None=> loading checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar'
=> loading checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar'
=> loaded checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' (epoch 5)
=> loaded checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' (epoch 5)
Epoch: [5][0/167] Time: 9.577s (9.577s) Speed: 2.1 samples/s Data: 8.164s (8.164s) Stage0-heatmaps: 1.595e-04 (1.595e-04) Stage1-heatmaps: 7.866e-05 (7.866e-05) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 6.155e-08 (6.155e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [5][0/167] Time: 9.665s (9.665s) Speed: 2.1 samples/s Data: 7.976s (7.976s) Stage0-heatmaps: 1.904e-04 (1.904e-04) Stage1-heatmaps: 8.872e-05 (8.872e-05) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 5.090e-08 (5.090e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Traceback (most recent call last):
File "tools/dist_train.py", line 323, in
main()
File "tools/dist_train.py", line 115, in main
args=(ngpus_per_node, args, final_output_dir, tb_log_dir)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, args)
File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/dist_train.py", line 285, in main_worker
final_output_dir, tb_log_dir, writer_dict, fp16=cfg.FP16.ENABLED)
File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/../lib/core/trainer.py", line 76, in do_train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 52.00 MiB (GPU 0; 7.80 GiB total capacity; 5.73 GiB already allocated; 27.31 MiB free; 5.86 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fa6f3122536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1cf1e (0x7fa6f336bf1e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1df9e (0x7fa6f336cf9e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x135 (0x7fa6f5f00535 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xf7a66b (0x7fa6f44f866b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xfc3f57 (0x7fa6f4541f57 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x1075389 (0x7fa730a7c389 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0x10756c7 (0x7fa730a7c6c7 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0xe3c42e (0x7fa73084342e in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: at::TensorIterator::fast_set_up() + 0x5cf (0x7fa7308442af in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: at::TensorIterator::build() + 0x4c (0x7fa730844b6c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x146 (0x7fa730845216 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: at::native::mul(at::Tensor const&, at::Tensor const&) + 0x3a (0x7fa730564eba in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0xf76ef8 (0x7fa6f44f4ef8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #14: + 0x10c3ec0 (0x7fa730acaec0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: + 0x2d2e779 (0x7fa732735779 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x10c3ec0 (0x7fa730acaec0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: at::Tensor::mul(at::Tensor const&) const + 0xf0 (0x7fa73f108ab0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #18: torch::autograd::generated::PowBackward0::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x1a6 (0x7fa7322caa06 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #19: + 0x2d89c05 (0x7fa732790c05 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node
, torch::autograd::InputBuffer&) + 0x16f3 (0x7fa73278df03 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7fa73278ece2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fa732787359 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #23: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fa73eec64d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #24: + 0xbd66f (0x7fa73ff9766f in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #25: + 0x76db (0x7fa742c906db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #26: clone + 0x3f (0x7fa742fc988f in /lib/x86_64-linux-gnu/libc.so.6)

root@a2bff378da93:/kpoints/HigherHRNet-Human-Pose-Estimation# /usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
len(cache))

@chenmingjian
Copy link

I had the same problem. :(

@wusaisa
Copy link

wusaisa commented Oct 30, 2023

I had the same problem, did you solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants