CUDA out of memory. #3

soichih · 2020-07-08T14:01:43Z

I've tested this App with a small test .trk file (generated by TractSeg as tck then converted to trk). It has ~22k fibers

When I run the App on gpu1, it fails with the following error message.

tcktransform: applying spatial transformation to tracks... [==================================================]
Traceback (most recent call last):
  File "/tractogram_filtering/tractogram_filtering.py", line 204, in <module>
    logits = classifier(points)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/tractogram_filtering/models/dec.py", line 58, in forward
    x2 = self.conv2(x1, batch)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch_geometric/nn/conv/edge_conv.py", line 83, in forward
    return super(DynamicEdgeConv, self).forward(x, edge_index)
  File "/opt/conda/lib/python3.7/site-packages/torch_geometric/nn/conv/edge_conv.py", line 46, in forward
    return self.propagate(edge_index, x=x)
  File "/opt/conda/lib/python3.7/site-packages/torch_geometric/nn/conv/message_passing.py", line 263, in propagate
    out = self.message(**msg_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch_geometric/nn/conv/edge_conv.py", line 49, in message
    return self.nn(torch.cat([x_i, x_j - x_i], dim=1))
RuntimeError: CUDA out of memory. Tried to allocate 2.75 GiB (GPU 0; 10.76 GiB total capacity; 6.37 GiB already allocated; 1.27 GiB free; 8.68 GiB reserved in total by PyTorch)

I believe the "6.37 GiB already allocated" is from this App itself, as I don't see any other process running on gpu1 at the moment. Is there something wrong with my data? Maybe it doesn't work with full brain tractography?

Here is the provenance graph for my input trk data.

The text was updated successfully, but these errors were encountered:

pietroastolfi · 2020-07-08T15:54:38Z

No there is no problem in the data. The app is thought to run on entires tractograms

The problem is that the current implementation requires ~10GB of GPU memory, but from your logs it seems there were only 8.68GB allocated for Pytorch.

I just tried to rerun the app on my private test repository, and I got out of memory as well, but now with this message: RuntimeError: CUDA out of memory. Tried to allocate 1.44 GiB (GPU 0; 10.76 GiB total capacity; 3.11 GiB already allocated; 1.10 GiB free; 3.71 GiB reserved in total by PyTorch)
From this message I can imagine that right now there is something else loaded in the gpu, because only 3.71 GB were allocated for Pytorch.

I noticed that in the #PBS lines I merged from your pull request there is one saying vram 8GB. Does this refer to the main RAM or to the GPU ram?

soichih · 2020-07-08T16:18:00Z

#PBS vmem only applies to the main memory. It sounds like there is some invisible memory stuck on the GPU? I saw the same error message when I ran it through PSC Bridges, so I am a bit skeptical of the invisible memory theory though.

pietroastolfi · 2020-07-08T16:28:24Z

Yesterday I was able to run on both gpu1 and Bridges. Here's the screenshot of Bridges run:

The input in this case was a trk with ~40k streamlines, so I'm sure it's not because of the size of the data you used.

I agree that invisible memory stuck is not plausible, but it's very strange that today i'm getting this error on gpu1: RuntimeError: CUDA out of memory. Tried to allocate 1.44 GiB (GPU 0; 10.76 GiB total capacity; 3.11 GiB already allocated; 1.10 GiB free; 3.71 GiB reserved in total by PyTorch , while yesterday on the same gpu1 I ran it smoothly more than once.

soichih · 2020-07-09T02:49:41Z

@pietroastolfi I couldn't find any process that might be holding up the extra GPU memory. It looks like it's your code itself that is allocating the extra memory.

The question is, why did it work a few days ago, and why it doesn't work now? I see that the container you are using pietroastolfi/tractogram-filtering:gpu was updated 19 hours ago. What exactly did you change on this container?

Also, I'd like to propose doing the following..

Instead of storing the whole application inside the container, only store the dependencies of your App (pytorch, numpy, etc..) that should not change very often, and feed the script you might be editing (like tractogram_filtering.py) on the github repo and feed the script into the container to run it.
Related to 1).. store main script as part of the same branch that contains the tractogram_filtering.py (on master) and register that branch with brainlife.
Everytime you do make an update to your container, tag version number (like gpu-1.3, etc..) so you know which version of the container was used to run your code.

soichih closed this as completed Jul 9, 2020

soichih reopened this Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory. #3

CUDA out of memory. #3

soichih commented Jul 8, 2020

pietroastolfi commented Jul 8, 2020

soichih commented Jul 8, 2020

pietroastolfi commented Jul 8, 2020 •

edited

Loading

soichih commented Jul 9, 2020

CUDA out of memory. #3

CUDA out of memory. #3

Comments

soichih commented Jul 8, 2020

pietroastolfi commented Jul 8, 2020

soichih commented Jul 8, 2020

pietroastolfi commented Jul 8, 2020 • edited Loading

soichih commented Jul 9, 2020

pietroastolfi commented Jul 8, 2020 •

edited

Loading