Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory. #3

Open
soichih opened this issue Jul 8, 2020 · 4 comments
Open

CUDA out of memory. #3

soichih opened this issue Jul 8, 2020 · 4 comments

Comments

@soichih
Copy link

soichih commented Jul 8, 2020

I've tested this App with a small test .trk file (generated by TractSeg as tck then converted to trk). It has ~22k fibers

image

When I run the App on gpu1, it fails with the following error message.

tcktransform: applying spatial transformation to tracks... [==================================================]
Traceback (most recent call last):
  File "/tractogram_filtering/tractogram_filtering.py", line 204, in <module>
    logits = classifier(points)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/tractogram_filtering/models/dec.py", line 58, in forward
    x2 = self.conv2(x1, batch)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch_geometric/nn/conv/edge_conv.py", line 83, in forward
    return super(DynamicEdgeConv, self).forward(x, edge_index)
  File "/opt/conda/lib/python3.7/site-packages/torch_geometric/nn/conv/edge_conv.py", line 46, in forward
    return self.propagate(edge_index, x=x)
  File "/opt/conda/lib/python3.7/site-packages/torch_geometric/nn/conv/message_passing.py", line 263, in propagate
    out = self.message(**msg_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch_geometric/nn/conv/edge_conv.py", line 49, in message
    return self.nn(torch.cat([x_i, x_j - x_i], dim=1))
RuntimeError: CUDA out of memory. Tried to allocate 2.75 GiB (GPU 0; 10.76 GiB total capacity; 6.37 GiB already allocated; 1.27 GiB free; 8.68 GiB reserved in total by PyTorch)

I believe the "6.37 GiB already allocated" is from this App itself, as I don't see any other process running on gpu1 at the moment. Is there something wrong with my data? Maybe it doesn't work with full brain tractography?

Here is the provenance graph for my input trk data.

image

@pietroastolfi
Copy link
Member

No there is no problem in the data. The app is thought to run on entires tractograms

The problem is that the current implementation requires ~10GB of GPU memory, but from your logs it seems there were only 8.68GB allocated for Pytorch.

I just tried to rerun the app on my private test repository, and I got out of memory as well, but now with this message: RuntimeError: CUDA out of memory. Tried to allocate 1.44 GiB (GPU 0; 10.76 GiB total capacity; 3.11 GiB already allocated; 1.10 GiB free; 3.71 GiB reserved in total by PyTorch)
From this message I can imagine that right now there is something else loaded in the gpu, because only 3.71 GB were allocated for Pytorch.

I noticed that in the #PBS lines I merged from your pull request there is one saying vram 8GB. Does this refer to the main RAM or to the GPU ram?

@soichih
Copy link
Author

soichih commented Jul 8, 2020

#PBS vmem only applies to the main memory. It sounds like there is some invisible memory stuck on the GPU? I saw the same error message when I ran it through PSC Bridges, so I am a bit skeptical of the invisible memory theory though.

@pietroastolfi
Copy link
Member

pietroastolfi commented Jul 8, 2020

Yesterday I was able to run on both gpu1 and Bridges. Here's the screenshot of Bridges run:
image

The input in this case was a trk with ~40k streamlines, so I'm sure it's not because of the size of the data you used.

I agree that invisible memory stuck is not plausible, but it's very strange that today i'm getting this error on gpu1: RuntimeError: CUDA out of memory. Tried to allocate 1.44 GiB (GPU 0; 10.76 GiB total capacity; 3.11 GiB already allocated; 1.10 GiB free; 3.71 GiB reserved in total by PyTorch , while yesterday on the same gpu1 I ran it smoothly more than once.

@soichih
Copy link
Author

soichih commented Jul 9, 2020

@pietroastolfi I couldn't find any process that might be holding up the extra GPU memory. It looks like it's your code itself that is allocating the extra memory.

The question is, why did it work a few days ago, and why it doesn't work now? I see that the container you are using pietroastolfi/tractogram-filtering:gpu was updated 19 hours ago. What exactly did you change on this container?

Also, I'd like to propose doing the following..

  1. Instead of storing the whole application inside the container, only store the dependencies of your App (pytorch, numpy, etc..) that should not change very often, and feed the script you might be editing (like tractogram_filtering.py) on the github repo and feed the script into the container to run it.
  2. Related to 1).. store main script as part of the same branch that contains the tractogram_filtering.py (on master) and register that branch with brainlife.
  3. Everytime you do make an update to your container, tag version number (like gpu-1.3, etc..) so you know which version of the container was used to run your code.

@soichih soichih closed this as completed Jul 9, 2020
@soichih soichih reopened this Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants