Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation problem #105

Open
caiobarrosv opened this issue Jun 27, 2024 · 5 comments
Open

Installation problem #105

caiobarrosv opened this issue Jun 27, 2024 · 5 comments

Comments

@caiobarrosv
Copy link

caiobarrosv commented Jun 27, 2024

Guys, I'm having a lot of problems trying to execute the train.py file.

OS: Ubuntu 22.04
Graphics card: RTX 3060
Driver version: 545.29.06

I installed cuda 11.8 and configured bashrc accondingly:

export PATH=/usr/local/cuda-11.8/bin:$PATH 
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH

The output of nvcc --version command is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Then I followed these steps:

$ # Clone the repo.
$ git clone https://github.com/SuLvXiangXin/zipnerf-pytorch.git
$ cd zipnerf-pytorch

$ # Make a conda environment.
$ conda create --name zipnerf python=3.9
$ conda activate zipnerf

$ # Install requirements.
$ pip install -r requirements.txt

These are the packages installed after running pip install -r requirements.txt:

_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
absl-py                   2.1.0                    pypi_0    pypi
accelerate                0.31.0                   pypi_0    pypi
asttokens                 2.4.1                    pypi_0    pypi
bzip2                     1.0.8                hd590300_5    conda-forge
ca-certificates           2024.6.2             hbcca054_0    conda-forge
certifi                   2024.6.2                 pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
contourpy                 1.2.1                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
exceptiongroup            1.2.1                    pypi_0    pypi
executing                 2.0.1                    pypi_0    pypi
filelock                  3.15.4                   pypi_0    pypi
fonttools                 4.53.0                   pypi_0    pypi
fsspec                    2024.6.0                 pypi_0    pypi
gin-config                0.5.0                    pypi_0    pypi
grpcio                    1.64.1                   pypi_0    pypi
huggingface-hub           0.23.4                   pypi_0    pypi
idna                      3.7                      pypi_0    pypi
imageio                   2.34.2                   pypi_0    pypi
imageio-ffmpeg            0.5.1                    pypi_0    pypi
importlib-metadata        8.0.0                    pypi_0    pypi
importlib-resources       6.4.0                    pypi_0    pypi
ipython                   8.18.1                   pypi_0    pypi
jedi                      0.19.1                   pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
joblib                    1.4.2                    pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
lazy-loader               0.4                      pypi_0    pypi
ld_impl_linux-64          2.40                 hf3520f5_7    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0              h77fa898_13    conda-forge
libgomp                   13.2.0              h77fa898_13    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.46.0               hde9e2c9_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libzlib                   1.3.1                h4ab18f5_1    conda-forge
markdown                  3.6                      pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
matplotlib                3.9.0                    pypi_0    pypi
matplotlib-inline         0.1.7                    pypi_0    pypi
mediapy                   1.2.2                    pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
ncurses                   6.5                  h59595ed_0    conda-forge
networkx                  3.2.1                    pypi_0    pypi
ninja                     1.11.1.1                 pypi_0    pypi
numpy                     2.0.0                    pypi_0    pypi
nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
nvidia-cudnn-cu12         8.9.2.26                 pypi_0    pypi
nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.5.40                  pypi_0    pypi
nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
opencv-contrib-python     4.10.0.84                pypi_0    pypi
opencv-python             4.10.0.84                pypi_0    pypi
openssl                   3.3.1                h4ab18f5_1    conda-forge
packaging                 24.1                     pypi_0    pypi
parso                     0.8.4                    pypi_0    pypi
pexpect                   4.9.0                    pypi_0    pypi
pillow                    10.3.0                   pypi_0    pypi
pip                       24.0               pyhd8ed1ab_0    conda-forge
plyfile                   1.0.3                    pypi_0    pypi
prompt-toolkit            3.0.47                   pypi_0    pypi
protobuf                  4.25.3                   pypi_0    pypi
psutil                    6.0.0                    pypi_0    pypi
ptyprocess                0.7.0                    pypi_0    pypi
pure-eval                 0.2.2                    pypi_0    pypi
pygments                  2.18.0                   pypi_0    pypi
pymeshlab                 2023.12.post1            pypi_0    pypi
pyparsing                 3.1.2                    pypi_0    pypi
python                    3.9.19          h0755675_0_cpython    conda-forge
python-dateutil           2.9.0.post0              pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
rawpy                     0.22.0                   pypi_0    pypi
readline                  8.2                  h8228510_1    conda-forge
requests                  2.32.3                   pypi_0    pypi
safetensors               0.4.3                    pypi_0    pypi
scikit-image              0.24.0                   pypi_0    pypi
scikit-learn              1.5.0                    pypi_0    pypi
scipy                     1.13.1                   pypi_0    pypi
setuptools                70.1.1             pyhd8ed1ab_0    conda-forge
six                       1.16.0                   pypi_0    pypi
stack-data                0.6.3                    pypi_0    pypi
sympy                     1.12.1                   pypi_0    pypi
tensorboard               2.17.0                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
tensorboardx              2.6.2.2                  pypi_0    pypi
threadpoolctl             3.5.0                    pypi_0    pypi
tifffile                  2024.6.18                pypi_0    pypi
tk                        8.6.13          noxft_h4845f30_101    conda-forge
torch                     2.3.1                    pypi_0    pypi
tqdm                      4.66.4                   pypi_0    pypi
traitlets                 5.14.3                   pypi_0    pypi
trimesh                   4.4.1                    pypi_0    pypi
triton                    2.3.1                    pypi_0    pypi
typing-extensions         4.12.2                   pypi_0    pypi
tzdata                    2024a                h0c530f3_0    conda-forge
urllib3                   2.2.2                    pypi_0    pypi
wcwidth                   0.2.13                   pypi_0    pypi
werkzeug                  3.0.3                    pypi_0    pypi
wheel                     0.43.0             pyhd8ed1ab_1    conda-forge
xatlas                    0.0.9                    pypi_0    pypi
xz                        5.2.6                h166bdaf_0    conda-forge
zipp                      3.19.2                   pypi_0    pypi

The first problem arises when I try to run the following command:

$ pip install ./extensions/cuda

The output is:

The detected CUDA version (11.8) mismatches the version that was used to compile
PyTorch (12.1). Please make sure to use the same CUDA versions.

It happens because the torch installed from the requirements.txt uses cuda 12.1:

$ python -c "import torch; print(torch.version.cuda)"
12.1

Therefore, I changed the cuda version for 12.1 in bashrc:

export PATH=/usr/local/cuda-12.1/bin:$PATH 
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH 

The g++ version is:

g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Even after changing the cuda version, I get a lot of errors:
extension_cuda.log

I then uninstalled ninja:

pip uninstall ninja

And changed the string -std=c++14 to -std=c++17 in the extension/cuda/setup.py.
After this change, everything compiles:

(zipnerf) caio@caio:~/2_company/OmniverseGaussianSplatIsaacSimProject/splats_folders/zipnerf-pytorch$ python -m pip install ./extensions/cuda
Processing ./extensions/cuda
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: cuda_backend
  Building wheel for cuda_backend (setup.py) ... done
  Created wheel for cuda_backend: filename=cuda_backend-0.0.0-cp39-cp39-linux_x86_64.whl size=3176457 sha256=f9076cdabe55877cebfc06f06e4b2258980dbaf3012d27fd26e3f6c6d445e421
  Stored in directory: /tmp/pip-ephem-wheel-cache-jw40zyr5/wheels/af/21/ed/afe122eadd56c6f1f7a8fefee691e9ef1def176f5abb977063
Successfully built cuda_backend
Installing collected packages: cuda_backend
Successfully installed cuda_backend-0.0.0

Finally, I tried to install torch-scatter for CUDA 12.1 and torch 2.3.1 (version installed from the requirements.txt)

$  pip install torch-scatter -f https://data.pyg.org/whl/torch-2.1.0+cu121.html

When I try to run the train.py script with the bycicle dataset I get:

(base) (base) caio@caio:~/2_company/OmniverseGaussianSplatIsaacSimProject/splats_folders/zipnerf-pytorch$ /home/caio/anaconda3/envs/zipnerf/bin/python /home/caio/2_company/OmniverseGaussianSplatIsaacSimProject/splats_folders/zipnerf-pytorch/train.py
Traceback (most recent call last):
  File "/home/caio/2_company/OmniverseGaussianSplatIsaacSimProject/splats_folders/zipnerf-pytorch/train.py", line 15, in <module>
    from internal import datasets
  File "/home/caio/2_company/OmniverseGaussianSplatIsaacSimProject/splats_folders/zipnerf-pytorch/internal/datasets.py", line 22, in <module>
    from .pycolmap import pycolmap
  File "/home/caio/2_company/OmniverseGaussianSplatIsaacSimProject/splats_folders/zipnerf-pytorch/internal/pycolmap/pycolmap/__init__.py", line 4, in <module>
    from .scene_manager import SceneManager
  File "/home/caio/2_company/OmniverseGaussianSplatIsaacSimProject/splats_folders/zipnerf-pytorch/internal/pycolmap/pycolmap/scene_manager.py", line 21, in <module>
    class SceneManager:
  File "/home/caio/2_company/OmniverseGaussianSplatIsaacSimProject/splats_folders/zipnerf-pytorch/internal/pycolmap/pycolmap/scene_manager.py", line 22, in SceneManager
    INVALID_POINT3D = np.uint64(-1)
OverflowError: Python integer -1 out of bounds for uint64

Any thoughts on how to solve it? Thank you very much :)

@caiobarrosv
Copy link
Author

Update:

if I change

INVALID_POINT3D = np.uint64(-1)

to

INVALID_POINT3D = np.uint64(2**64 - 1)

Code code now works up to this point:

(zipnerf) caio@caio:~/2_company/OmniverseGaussianSplatIsaacSimProject/splats_folders/zipnerf-pytorch$ python -m train
2024-06-27 12:40:19: Config(dataset_loader='llff', batching='all_images', batch_size=8192, patch_size=1, factor=4, multiscale=False, multiscale_levels=4, forward_facing=False, render_path=False, llffhold=8, llff_use_all_images_for_training=False, llff_use_all_images_for_testing=False, use_tiffs=False, compute_disp_metrics=False, compute_normal_metrics=False, disable_multiscale_loss=False, randomized=True, near=2.0, far=6.0, exp_name='test', data_dir='/home/caio/2_company/OmniverseGaussianSplatIsaacSimProject/datasets/bicycle', vocab_tree_path=None, render_chunk_size=65536, num_showcase_images=5, deterministic_showcase=True, vis_num_rays=16, vis_decimate=0, dpcpp_backend=False, importance_sampling=False, max_steps=25000, early_exit_steps=None, checkpoint_every=5000, resume_from_checkpoint=True, checkpoints_total_limit=1, gradient_scaling=False, print_every=100, train_render_every=500, data_loss_type='charb', charb_padding=0.001, data_loss_mult=1.0, data_coarse_loss_mult=0.0, interlevel_loss_mult=0.0, anti_interlevel_loss_mult=0.01, orientation_loss_mult=0.0, orientation_coarse_loss_mult=0.0, orientation_loss_target='normals_pred', predicted_normal_loss_mult=0.0, predicted_normal_coarse_loss_mult=0.0, hash_decay_mults=0.1, lr_init=0.01, lr_final=0.001, lr_delay_steps=5000, lr_delay_mult=1e-08, adam_beta1=0.9, adam_beta2=0.99, adam_eps=1e-15, grad_max_norm=0.0, grad_max_val=0.0, distortion_loss_mult=0.005, opacity_loss_mult=0.0, eval_only_once=True, eval_save_output=True, eval_save_ray_data=False, eval_render_interval=1, eval_dataset_limit=2147483647, eval_quantize_metrics=True, eval_crop_borders=0, render_video_fps=60, render_video_crf=18, render_path_frames=120, z_variation=0.0, z_phase=0.0, render_dist_percentile=0.5, render_dist_curve_fn=<ufunc 'log'>, render_path_file=None, render_resolution=None, render_focal=None, render_camtype=None, render_spherical=False, render_save_async=True, render_spline_keyframes=None, render_spline_n_interp=30, render_spline_degree=5, render_spline_smoothness=0.03, render_spline_interpolate_exposure=False, rawnerf_mode=False, exposure_percentile=97.0, num_border_pixels_to_mask=0, apply_bayer_mask=False, autoexpose_renders=False, eval_raw_affine_cc=False, zero_glo=False, valid_weight_thresh=0.05, isosurface_threshold=20, mesh_voxels=134217728, visibility_resolution=512, mesh_radius=1.0, mesh_max_radius=10.0, std_value=0.0, compute_visibility=False, extract_visibility=True, decimate_target=-1, vertex_color=True, vertex_projection=True, tsdf_radius=2.0, tsdf_resolution=512, truncation_margin=5.0, tsdf_max_radius=10.0)
2024-06-27 12:40:19: Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: no

Warning: image_path not found for reconstruction
Warning: image_path not found for reconstruction                                                                                                                                          
2024-06-27 12:40:24: Checkpoint does not exist. Starting a new training run.                                                                                                              
2024-06-27 12:40:24: Number of parameters being optimized: 130306473
2024-06-27 12:40:24: Begin training...
Training:   0%|                                                                                                                                                 | 0/25000 [00:00<?, ?it/s]
2024-06-27 12:40:25: Error!
Traceback (most recent call last):
  File "/home/caio/anaconda3/envs/zipnerf/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/caio/anaconda3/envs/zipnerf/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/caio/.local/share/Trash/files/zipnerf-pytorch.2/train.py", line 390, in <module>
    app.run(main)
  File "/home/caio/anaconda3/envs/zipnerf/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/caio/anaconda3/envs/zipnerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/caio/.local/share/Trash/files/zipnerf-pytorch.2/train.py", line 169, in main
    renderings, ray_history = model(
  File "/home/caio/anaconda3/envs/zipnerf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/caio/anaconda3/envs/zipnerf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/caio/.local/share/Trash/files/zipnerf-pytorch.2/internal/models.py", line 229, in forward
    ray_results = mlp(
  File "/home/caio/anaconda3/envs/zipnerf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/caio/anaconda3/envs/zipnerf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/caio/.local/share/Trash/files/zipnerf-pytorch.2/internal/models.py", line 523, in forward
    raw_grad_density = torch.autograd.grad(
  File "/home/caio/anaconda3/envs/zipnerf/lib/python3.9/site-packages/torch/autograd/__init__.py", line 412, in grad
    result = _engine_run_backward(
  File "/home/caio/anaconda3/envs/zipnerf/lib/python3.9/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

@caiobarrosv
Copy link
Author

caiobarrosv commented Jun 27, 2024

Update:

I also had to change this part of the code inside the forward method:

if self.disable_density_normals:
    raw_density, x, means_contract = self.predict_density(means, stds, rand=rand, no_warp=no_warp)
    raw_grad_density = None
    normals = None
else:
    with torch.enable_grad():
        means.requires_grad_(True)
        raw_density, x, means_contract = self.predict_density(means, stds, rand=rand, no_warp=no_warp)
        d_output = torch.ones_like(raw_density, requires_grad=False, device=raw_density.device)
        raw_grad_density = torch.autograd.grad(
            outputs=raw_density,
            inputs=means,
            grad_outputs=d_output,
            create_graph=True,
            retain_graph=True,
            only_inputs=True, 
            allow_unused=True
        )[0]
    if raw_grad_density is not None:
        raw_grad_density = raw_grad_density.mean(-2)
        # Compute normal vectors as negative normalized density gradient.
        # We normalize the gradient of raw (pre-activation) density because
        # it's the same as post-activation density, but is more numerically stable
        # when the activation function has a steep or flat gradient.
        normals = -ref_utils.l2_normalize(raw_grad_density)
    else:
        # Handle the case where raw_grad_density is None
        normals = None

and then change the batch size (in the file zipnerf-pytorch/internal/configs.py) to:

batch_size: int = 2 ** 12  # The number of rays/pixels in each batch.

I also had to change the version of matplotlib and numpy to make it work:

pip install matplotlib==3.7.3
pip install numpy==1.26.4

Now the training is executing and will take almost 3 hours to train the bicycle scene :(
Also, the results are really poor

image

@LiHaodong0217
Copy link

So 3 hours are longer than your expectation?

@LiHaodong0217
Copy link

i have the same problem with you,maybe it is because the download link is changed now

@Lee-JaeWon
Copy link

@caiobarrosv
Is there no way to do this with CUDA 11.8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants