You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I got cuda out of memory every time I continued training. But I won't get error if I load initial weight and train from the first epoch.
I used my own dataset, but I thought is more likely something wrong with distributed training? Any suggestion how should I check the code?
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, args)
File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/dist_train.py", line 285, in main_worker
final_output_dir, tb_log_dir, writer_dict, fp16=cfg.FP16.ENABLED)
File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/../lib/core/trainer.py", line 76, in do_train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 52.00 MiB (GPU 0; 7.80 GiB total capacity; 5.73 GiB already allocated; 27.31 MiB free; 5.86 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fa6f3122536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1cf1e (0x7fa6f336bf1e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1df9e (0x7fa6f336cf9e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x135 (0x7fa6f5f00535 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xf7a66b (0x7fa6f44f866b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xfc3f57 (0x7fa6f4541f57 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x1075389 (0x7fa730a7c389 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0x10756c7 (0x7fa730a7c6c7 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0xe3c42e (0x7fa73084342e in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: at::TensorIterator::fast_set_up() + 0x5cf (0x7fa7308442af in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: at::TensorIterator::build() + 0x4c (0x7fa730844b6c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x146 (0x7fa730845216 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: at::native::mul(at::Tensor const&, at::Tensor const&) + 0x3a (0x7fa730564eba in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0xf76ef8 (0x7fa6f44f4ef8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #14: + 0x10c3ec0 (0x7fa730acaec0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: + 0x2d2e779 (0x7fa732735779 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x10c3ec0 (0x7fa730acaec0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: at::Tensor::mul(at::Tensor const&) const + 0xf0 (0x7fa73f108ab0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #18: torch::autograd::generated::PowBackward0::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x1a6 (0x7fa7322caa06 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #19: + 0x2d89c05 (0x7fa732790c05 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node, torch::autograd::InputBuffer&) + 0x16f3 (0x7fa73278df03 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7fa73278ece2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fa732787359 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #23: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fa73eec64d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #24: + 0xbd66f (0x7fa73ff9766f in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #25: + 0x76db (0x7fa742c906db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #26: clone + 0x3f (0x7fa742fc988f in /lib/x86_64-linux-gnu/libc.so.6)
root@a2bff378da93:/kpoints/HigherHRNet-Human-Pose-Estimation# /usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
len(cache))
The text was updated successfully, but these errors were encountered:
I got cuda out of memory every time I continued training. But I won't get error if I load initial weight and train from the first epoch.
I used my own dataset, but I thought is more likely something wrong with distributed training? Any suggestion how should I check the code?
traing from epoch 0 and everything works well
continue training and get error
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, args)
File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/dist_train.py", line 285, in main_worker
final_output_dir, tb_log_dir, writer_dict, fp16=cfg.FP16.ENABLED)
File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/../lib/core/trainer.py", line 76, in do_train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 52.00 MiB (GPU 0; 7.80 GiB total capacity; 5.73 GiB already allocated; 27.31 MiB free; 5.86 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fa6f3122536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1cf1e (0x7fa6f336bf1e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1df9e (0x7fa6f336cf9e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x135 (0x7fa6f5f00535 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xf7a66b (0x7fa6f44f866b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xfc3f57 (0x7fa6f4541f57 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x1075389 (0x7fa730a7c389 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0x10756c7 (0x7fa730a7c6c7 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0xe3c42e (0x7fa73084342e in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: at::TensorIterator::fast_set_up() + 0x5cf (0x7fa7308442af in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: at::TensorIterator::build() + 0x4c (0x7fa730844b6c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x146 (0x7fa730845216 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: at::native::mul(at::Tensor const&, at::Tensor const&) + 0x3a (0x7fa730564eba in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0xf76ef8 (0x7fa6f44f4ef8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #14: + 0x10c3ec0 (0x7fa730acaec0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: + 0x2d2e779 (0x7fa732735779 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x10c3ec0 (0x7fa730acaec0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: at::Tensor::mul(at::Tensor const&) const + 0xf0 (0x7fa73f108ab0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #18: torch::autograd::generated::PowBackward0::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x1a6 (0x7fa7322caa06 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #19: + 0x2d89c05 (0x7fa732790c05 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node, torch::autograd::InputBuffer&) + 0x16f3 (0x7fa73278df03 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7fa73278ece2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fa732787359 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #23: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fa73eec64d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #24: + 0xbd66f (0x7fa73ff9766f in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #25: + 0x76db (0x7fa742c906db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #26: clone + 0x3f (0x7fa742fc988f in /lib/x86_64-linux-gnu/libc.so.6)
root@a2bff378da93:/kpoints/HigherHRNet-Human-Pose-Estimation# /usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
len(cache))
The text was updated successfully, but these errors were encountered: