-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP training freezes immediately #17389
Comments
Are you able to get other ddp jobs to run? import lightning as L
from lightning.pytorch.demos.boring_classes import BoringModel
ngpus = 3
model = BoringModel()
trainer = L.Trainer(max_epochs=10,
devices=ngpus)
trainer.fit(model) |
Ah. Thanks for the reduction. No, this doesn't seem to work either. Again, I get
and then nothing. |
I don't understand why torchvision is outputting an error here as it wasn't in the script. Did you install PyTorch using Miniconda or pip? conda create -n testenv python=3.9
conda activate testenv
pip install torch torchvision lightning
python -c "import torch; print(torch.__version__)" Also do |
Hi, was this ever fixed? I'm running into the same issue using |
@shoang22 the exact problem wasn't really identified. It looks like a problem in the installation. Have you tried creating a clean environment with the above steps? |
I did try a clean install, but the problem persisted. I was, however, able to solve the problem. I was running my script on a SLURM cluster. It turns out that I needed to include |
Bug description
I'm trying to run a job with several GPUs. My script immediately gets stuck after outputting:
What version are you seeing the problem on?
2.0+ and 1.9.x
How to reproduce the bug
Error messages and logs
Environment
Current environment
Current environment
- GPU:
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- available: True
- version: 11.7
- lightning: 2.0.1
- lightning-cloud: 0.5.32
- lightning-utilities: 0.7.0
- pytorch-lightning: 1.9.3
- torch: 2.0.0
- torchaudio: 0.13.1
- torchmetrics: 0.11.1
- torchvision: 0.14.1
- absl-py: 1.4.0
- aiohttp: 3.8.4
- aiosignal: 1.3.1
- altair: 4.2.2
- anyio: 3.6.2
- appdirs: 1.4.4
- arrow: 1.2.3
- asttokens: 2.2.1
- astunparse: 1.6.3
- async-timeout: 4.0.2
- attrs: 22.2.0
- backcall: 0.2.0
- backports.functools-lru-cache: 1.6.4
- beautifulsoup4: 4.12.0
- black: 23.3.0
- blessed: 1.20.0
- brotlipy: 0.7.0
- cachetools: 5.3.0
- certifi: 2022.12.7
- cffi: 1.15.1
- charset-normalizer: 2.0.4
- click: 8.1.3
- cmake: 3.26.1
- colorama: 0.4.6
- contourpy: 1.0.7
- croniter: 1.3.8
- cryptography: 38.0.4
- cycler: 0.11.0
- dateutils: 0.6.12
- debugpy: 1.5.1
- decorator: 5.1.1
- deepdiff: 6.3.0
- deepxde: 1.8.0
- dnspython: 2.3.0
- docker-pycreds: 0.4.0
- email-validator: 1.3.1
- entrypoints: 0.4
- exceptiongroup: 1.1.0
- executing: 1.2.0
- fastapi: 0.88.0
- filelock: 3.10.7
- flatbuffers: 23.1.21
- flit-core: 3.6.0
- fonttools: 4.38.0
- frozenlist: 1.3.3
- fsspec: 2023.1.0
- gast: 0.4.0
- gitdb: 4.0.10
- gitpython: 3.1.31
- google-auth: 2.16.1
- google-auth-oauthlib: 0.4.6
- google-pasta: 0.2.0
- gpustat: 1.0.0
- grpcio: 1.51.1
- h11: 0.14.0
- h5py: 3.8.0
- hcpdenn: 0.0.1
- httpcore: 0.16.3
- httptools: 0.5.0
- httpx: 0.23.3
- idna: 3.4
- importlib-metadata: 6.0.0
- importlib-resources: 5.12.0
- iniconfig: 2.0.0
- inquirer: 3.1.3
- ipykernel: 6.15.0
- ipython: 8.10.0
- itsdangerous: 2.1.2
- jax: 0.3.25
- jaxlib: 0.3.25+cuda11.cudnn82
- jedi: 0.18.2
- jinja2: 3.1.2
- joblib: 1.2.0
- jsonschema: 4.17.3
- jupyter-client: 7.0.6
- jupyter-core: 4.12.0
- keras: 2.11.0
- kiwisolver: 1.4.4
- libclang: 15.0.6.1
- lightning: 2.0.1
- lightning-cloud: 0.5.32
- lightning-utilities: 0.7.0
- lit: 16.0.0
- markdown: 3.4.1
- markdown-it-py: 2.2.0
- markupsafe: 2.1.2
- matplotlib: 3.7.0
- matplotlib-inline: 0.1.6
- mdurl: 0.1.2
- mkl-fft: 1.3.1
- mkl-random: 1.2.2
- mkl-service: 2.4.0
- ml-dtypes: 0.0.4
- mpmath: 1.3.0
- multidict: 6.0.4
- mypy-extensions: 1.0.0
- nest-asyncio: 1.5.6
- networkx: 3.0
- numpy: 1.23.5
- nvidia-cublas-cu11: 11.10.3.66
- nvidia-cuda-cupti-cu11: 11.7.101
- nvidia-cuda-nvrtc-cu11: 11.7.99
- nvidia-cuda-runtime-cu11: 11.7.99
- nvidia-cudnn-cu11: 8.5.0.96
- nvidia-cufft-cu11: 10.9.0.58
- nvidia-curand-cu11: 10.2.10.91
- nvidia-cusolver-cu11: 11.4.0.1
- nvidia-cusparse-cu11: 11.7.4.91
- nvidia-ml-py: 11.495.46
- nvidia-nccl-cu11: 2.14.3
- nvidia-nvtx-cu11: 11.7.91
- oauthlib: 3.2.2
- opt-einsum: 3.3.0
- ordered-set: 4.1.0
- orjson: 3.8.9
- packaging: 23.0
- pandas: 1.5.3
- parso: 0.8.3
- pathspec: 0.11.1
- pathtools: 0.1.2
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 9.3.0
- pip: 22.3.1
- platformdirs: 3.2.0
- pluggy: 1.0.0
- pooch: 1.6.0
- prompt-toolkit: 3.0.36
- protobuf: 3.19.6
- psutil: 5.9.4
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- pyaml: 21.10.1
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pybind11: 2.10.3
- pycparser: 2.21
- pydantic: 1.10.7
- pygments: 2.14.0
- pyjwt: 2.6.0
- pyopenssl: 22.0.0
- pyparsing: 3.0.9
- pyrsistent: 0.19.3
- pysocks: 1.7.1
- pytest: 7.2.1
- python-dateutil: 2.8.2
- python-dotenv: 1.0.0
- python-editor: 1.0.4
- python-multipart: 0.0.6
- pytorch-lightning: 1.9.3
- pytz: 2022.7.1
- pyyaml: 6.0
- pyzmq: 19.0.2
- readchar: 4.0.5
- requests: 2.28.1
- requests-oauthlib: 1.3.1
- rfc3986: 1.5.0
- rich: 13.3.3
- rsa: 4.9
- scienceplots: 2.0.1
- scikit-learn: 1.2.1
- scikit-optimize: 0.9.0
- scikit-sparse: 0.4.8
- scipy: 1.10.1
- seaborn: 0.12.2
- sentry-sdk: 1.16.0
- setproctitle: 1.3.2
- setuptools: 65.6.3
- six: 1.16.0
- sklearn: 0.0.post1
- smmap: 5.0.0
- sniffio: 1.3.0
- soupsieve: 2.4
- stack-data: 0.6.2
- starlette: 0.22.0
- starsessions: 1.3.0
- sympy: 1.11.1
- tensorboard: 2.11.2
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- tensorflow: 2.11.0
- tensorflow-addons: 0.19.0
- tensorflow-estimator: 2.11.0
- tensorflow-io-gcs-filesystem: 0.30.0
- termcolor: 2.2.0
- theseus-ai: 0.1.4
- threadpoolctl: 3.1.0
- tomli: 2.0.1
- toolz: 0.12.0
- torch: 2.0.0
- torchaudio: 0.13.1
- torchmetrics: 0.11.1
- torchvision: 0.14.1
- tornado: 6.2
- tqdm: 4.64.1
- traitlets: 5.9.0
- triton: 2.0.0
- typeguard: 2.13.3
- typing-extensions: 4.4.0
- ujson: 5.7.0
- urllib3: 1.26.14
- uvicorn: 0.21.1
- uvloop: 0.17.0
- wandb: 0.13.10
- watchfiles: 0.19.0
- wcwidth: 0.2.6
- websocket-client: 1.5.1
- websockets: 10.4
- werkzeug: 2.2.3
- wheel: 0.38.4
- wrapt: 1.14.1
- yarl: 1.8.2
- zipp: 3.14.0
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.16
- version: Quantisation and Pruning Support #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023
More info
No response
cc @justusschock @awaelchli
The text was updated successfully, but these errors were encountered: