Skip to content

Commit

Permalink
translation faq finished (#33)
Browse files Browse the repository at this point in the history
* translation faq finished

* translate_err_fixed

* translate_err_fixed

* translate_err_fixed

* translate_err_fixed

* translate_err_fixed

* translate_err_fixed

* translate_err_fixed

* translate_err_fixed

* punctuation err fixed

* Update faq.md

* Update faq.md

* linted

* fix end of files

* 解决了一些符号问题

* en fixed

* en fixed

* 0 fixed

* 0 fixed

* check it word by word

* check docs

* check docs

Co-authored-by: hx <[email protected]>
Co-authored-by: yangxue <[email protected]>
Co-authored-by: jbwang1997 <[email protected]>
  • Loading branch information
4 people authored Feb 26, 2022
1 parent 559b467 commit 7c9a592
Show file tree
Hide file tree
Showing 2 changed files with 53 additions and 58 deletions.
6 changes: 3 additions & 3 deletions docs/en/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ We list some common troubles faced by many users and their corresponding solutio
2. If those symbols are PyTorch symbols (e.g., symbols containing caffe, aten, and TH), check whether the PyTorch version is the same as that used for compiling mmcv.
3. Run `python mmdet/utils/collect_env.py` to check whether PyTorch, torchvision, and MMCV are built by and running on the same environment.

- setuptools.sandbox.UnpickleableException: DistutilsSetupError("each element of 'ext_modules' option must be an Extension instance or 2-tuple")
- "setuptools.sandbox.UnpickleableException: DistutilsSetupError("each element of 'ext_modules' option must be an Extension instance or 2-tuple")"

1. If you are using miniconda rather than anaconda, check whether Cython is installed as indicated in [#3379](https://github.com/open-mmlab/mmdetection/issues/3379).
You need to manually install Cython first and then run command `pip install -r requirements.txt`.
Expand Down Expand Up @@ -68,14 +68,14 @@ We list some common troubles faced by many users and their corresponding solutio
2. Reduce the learning rate: the learning rate might be too large due to some reasons, e.g., change of batch size. You can rescale them to the value that could stably train the model.
3. Extend the warmup iterations: some models are sensitive to the learning rate at the start of the training. You can extend the warmup iterations, e.g., change the `warmup_iters` from 500 to 1000 or 2000.
4. Add gradient clipping: some models requires gradient clipping to stabilize the training process. The default of `grad_clip` is `None`, you can add gradient clippint to avoid gradients that are too large, i.e., set `optimizer_config=dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))` in your config file. If your config does not inherits from any basic config that contains `optimizer_config=dict(grad_clip=None)`, you can simply add `optimizer_config=dict(grad_clip=dict(max_norm=35, norm_type=2))`.
- GPU out of memory"
- "GPU out of memory"
1. There are some scenarios when there are large amounts of ground truth boxes, which may cause OOM during target assignment. You can set `gpu_assign_thr=N` in the config of assigner thus the assigner will calculate box overlaps through CPU when there are more than N GT boxes.
2. Set `with_cp=True` in the backbone. This uses the sublinear strategy in PyTorch to reduce GPU memory cost in the backbone.
3. Try mixed precision training using following the examples in `config/fp16`. The `loss_scale` might need further tuning for different models.

- "RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one"
1. This error indicates that your module has parameters that were not used in producing loss. This phenomenon may be caused by running different branches in your code in DDP mode.
2. You can set ` find_unused_parameters = True` in the config to solve the above problems or find those unused parameters manually.
2. You can set `find_unused_parameters = True` in the config to solve the above problems or find those unused parameters manually.

## Evaluation

Expand Down
105 changes: 50 additions & 55 deletions docs/zh_cn/faq.md
Original file line number Diff line number Diff line change
@@ -1,90 +1,85 @@
# Frequently Asked Questions
# 常见问题解答

We list some common troubles faced by many users and their corresponding solutions here. Feel free to enrich the list if you find any frequent issues and have ways to help others to solve them. If the contents here do not cover your issue, please create an issue using the [provided templates](https://github.com/open-mmlab/mmdetection/blob/master/.github/ISSUE_TEMPLATE/error-report.md/) and make sure you fill in all required information in the template.
我们在这里列出了使用时的一些常见问题及其相应的解决方案。 如果您发现有一些问题被遗漏,请随时提 PR 丰富这个列表。如果您无法在此获得帮助,请使用 [issue模板](https://github.com/open-mmlab/mmdetection/blob/master/.github/ISSUE_TEMPLATE/error-report.md/) 创建问题,但是请在模板中填写所有必填信息,这有助于我们更快定位问题。

## MMCV Installation
## MMCV 安装相关

- Compatibility issue between MMCV and MMDetection; "ConvWS is already registered in conv layer"; "AssertionError: MMCV==xxx is used but incompatible. Please install mmcv>=xxx, <=xxx."
- MMCV MMDetection 的兼容问题: "ConvWS is already registered in conv layer"; "AssertionError: MMCV==xxx is used but incompatible. Please install mmcv>=xxx, <=xxx."

Please install the correct version of MMCV for the version of your MMDetection following the [installation instruction](https://mmdetection.readthedocs.io/en/latest/get_started.html#installation).
请按 [安装说明](https://mmdetection.readthedocs.io/zh_CN/latest/get_started.html#installation) 为你的 MMDetection 安装正确版本的 MMCV。

- "No module named 'mmcv.ops'"; "No module named 'mmcv._ext'".

1. Uninstall existing mmcv in the environment using `pip uninstall mmcv`.
2. Install mmcv-full following the [installation instruction](https://mmcv.readthedocs.io/en/latest/#installation).
原因是安装了 `mmcv` 而不是 `mmcv-full`

## PyTorch/CUDA Environment
1. 使用 `pip uninstall mmcv` 卸载。
2. 根据 [安装说明](https://mmcv.readthedocs.io/zh/latest/#installation) 安装 `mmcv-full`

## PyTorch/CUDA 环境相关

- "RTX 30 series card fails when building MMCV or MMDet"

1. Temporary work-around: do `MMCV_WITH_OPS=1 MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80' pip install -e .`.
The common issue is `nvcc fatal : Unsupported gpu architecture 'compute_86'`. This means that the compiler should optimize for sm_86, i.e., nvidia 30 series card, but such optimizations have not been supported by CUDA toolkit 11.0.
This work-around modifies the compile flag by adding `MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80'`, which tells `nvcc` to optimize for **sm_80**, i.e., Nvidia A100. Although A100 is different from the 30 series card, they use similar ampere architecture. This may hurt the performance but it works.
2. PyTorch developers have updated that the default compiler flags should be fixed by [pytorch/pytorch#47585](https://github.com/pytorch/pytorch/pull/47585). So using PyTorch-nightly may also be able to solve the problem, though we have not tested it yet.
1. 常见报错信息为 `nvcc fatal: Unsupported gpu architecture 'compute_86'` 意思是你的编译器应该为 sm_86 进行优化,例如, 英伟达 30 系列的显卡,但这样的优化 CUDA toolkit 11.0 并不支持。
此解决方案通过添加 `MMCV_WITH_OPS=1 MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80' pip install -e .` 来修改编译标志,这告诉编译器 `nvcc`**sm_80** 进行优化,例如 Nvidia A100,尽管 A100 不同于 30 系列的显卡,但他们使用相似的图灵架构。这种解决方案可能会丧失一些性能但的确有效。
2. PyTorch 开发者已经在 [pytorch/pytorch#47585](https://github.com/pytorch/pytorch/pull/47585) 更新了 PyTorch 默认的编译标志,所以使用 PyTorch-nightly 可能也能解决这个问题,但是我们对此并没有验证这种方式是否有效。

- "invalid device function" or "no kernel image is available for execution".

1. Check if your cuda runtime version (under `/usr/local/`), `nvcc --version` and `conda list cudatoolkit` version match.
2. Run `python mmdet/utils/collect_env.py` to check whether PyTorch, torchvision, and MMCV are built for the correct GPU architecture.
You may need to set `TORCH_CUDA_ARCH_LIST` to reinstall MMCV.
The GPU arch table could be found [here](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list),
i.e. run `TORCH_CUDA_ARCH_LIST=7.0 pip install mmcv-full` to build MMCV for Volta GPUs.
The compatibility issue could happen when using old GPUS, e.g., Tesla K80 (3.7) on colab.
3. Check whether the running environment is the same as that when mmcv/mmdet has compiled.
For example, you may compile mmcv using CUDA 10.0 but run it on CUDA 9.0 environments.
1. 检查您的 cuda 运行时版本(一般在 `/usr/local/`)、指令 `nvcc --version` 显示的版本以及 `conda list cudatoolkit` 指令显式的版本是否匹配。
2. 通过运行 `python mmdet/utils/collect_env.py` 来检查是否为当前的GPU架构编译了正确的 PyTorch、torchvision 和 MMCV,你可能需要设置 `TORCH_CUDA_ARCH_LIST` 来重新安装 MMCV。可以参考 [GPU 架构表](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list),即通过运行 `TORCH_CUDA_ARCH_LIST=7.0 pip install mmcv-full` 为 Volta GPU 编译 MMCV。这种架构不匹配的问题一般会出现在使用一些旧型号的 GPU 时候,例如,Tesla K80。
3. 检查运行环境是否与 mmcv/mmdet 编译时相同,例如,您可能使用 CUDA 10.0 编译 MMCV,但在 CUDA 9.0 环境中运行它。

- "undefined symbol" or "cannot open xxx.so".

1. If those symbols are CUDA/C++ symbols (e.g., libcudart.so or GLIBCXX), check whether the CUDA/GCC runtimes are the same as those used for compiling mmcv,
i.e. run `python mmdet/utils/collect_env.py` to see if `"MMCV Compiler"`/`"MMCV CUDA Compiler"` is the same as `"GCC"`/`"CUDA_HOME"`.
2. If those symbols are PyTorch symbols (e.g., symbols containing caffe, aten, and TH), check whether the PyTorch version is the same as that used for compiling mmcv.
3. Run `python mmdet/utils/collect_env.py` to check whether PyTorch, torchvision, and MMCV are built by and running on the same environment.
1. 如果这些 symbols 属于 CUDA/C++ (例如,libcudart.so 或者 GLIBCXX),检查 CUDA/GCC 运行时环境是否与编译 MMCV 的一致。例如使用 `python mmdet/utils/collect_env.py` 检查 `"MMCV Compiler"`/`"MMCV CUDA Compiler"` 是否和 `"GCC"`/`"CUDA_HOME"` 一致。
2. 如果这些 symbols 属于 PyTorch,(例如,symbols containing caffe、aten 和 TH), 检查当前 PyTorch 版本是否与编译 MMCV 的版本一致。
3. 运行 `python mmdet/utils/collect_env.py` 检查 PyTorch、torchvision、MMCV 等的编译环境与运行环境一致。

- setuptools.sandbox.UnpickleableException: DistutilsSetupError("each element of 'ext_modules' option must be an Extension instance or 2-tuple")
- "setuptools.sandbox.UnpickleableException: DistutilsSetupError("each element of 'ext_modules' option must be an Extension instance or 2-tuple")"

1. If you are using miniconda rather than anaconda, check whether Cython is installed as indicated in [#3379](https://github.com/open-mmlab/mmdetection/issues/3379).
You need to manually install Cython first and then run command `pip install -r requirements.txt`.
2. You may also need to check the compatibility between the `setuptools`, `Cython`, and `PyTorch` in your environment.
1. 如果你在使用 miniconda 而不是 anaconda,检查是否正确的安装了 Cython 如 [#3379](https://github.com/open-mmlab/mmdetection/issues/3379)。您需要先手动安装 Cpython 然后运命令 `pip install -r requirements.txt`
2. 检查环境中的 `setuptools``Cython``PyTorch` 相互之间版本是否匹配。

- "Segmentation fault".
1. Check you GCC version and use GCC 5.4. This usually caused by the incompatibility between PyTorch and the environment (e.g., GCC < 4.9 for PyTorch). We also recommend the users to avoid using GCC 5.5 because many feedbacks report that GCC 5.5 will cause "segmentation fault" and simply changing it to GCC 5.4 could solve the problem.

2. Check whether PyTorch is correctly installed and could use CUDA op, e.g. type the following command in your terminal.
1. 检查 GCC 的版本并使用 GCC 5.4,通常是因为 PyTorch 版本与 GCC 版本不匹配 (例如,对于 Pytorch GCC < 4.9),我们推荐用户使用 GCC 5.4,我们也不推荐使用 GCC 5.5, 因为有反馈 GCC 5.5 会导致 "segmentation fault" 并且切换到 GCC 5.4 就可以解决问题。
2. 检查是是否 PyTorch 被正确的安装并可以使用 CUDA 算子,例如在终端中键入如下的指令。

```shell
python -c 'import torch; print(torch.cuda.is_available())'
```
```shell
python -c 'import torch; print(torch.cuda.is_available())'
```

And see whether they could correctly output results.
并判断是否是否返回 True。

3. If Pytorch is correctly installed, check whether MMCV is correctly installed.
3. 如果 `torch` 的安装是正确的,检查是否正确编译了 MMCV

```shell
python -c 'import mmcv; import mmcv.ops'
```
```shell
python -c 'import mmcv; import mmcv.ops'
```

If MMCV is correctly installed, then there will be no issue of the above two commands.
如果 MMCV 被正确的安装了,那么上面的两条指令不会有问题。

4. If MMCV and Pytorch is correctly installed, you man use `ipdb`, `pdb` to set breakpoints or directly add 'print' in mmdetection code and see which part leads the segmentation fault.
4. 如果 MMCV 与 PyTorch 都被正确安装了,则使用 `ipdb``pdb` 设置断点,直接查找哪一部分的代码导致了 `segmentation fault`

## Training
## Training 相关

- "Loss goes Nan"
1. Check if the dataset annotations are valid: zero-size bounding boxes will cause the regression loss to be Nan due to the commonly used transformation for box regression. Some small size (width or height are smaller than 1) boxes will also cause this problem after data augmentation (e.g., instaboost). So check the data and try to filter out those zero-size boxes and skip some risky augmentations on the small-size boxes when you face the problem.
2. Reduce the learning rate: the learning rate might be too large due to some reasons, e.g., change of batch size. You can rescale them to the value that could stably train the model.
3. Extend the warmup iterations: some models are sensitive to the learning rate at the start of the training. You can extend the warmup iterations, e.g., change the `warmup_iters` from 500 to 1000 or 2000.
4. Add gradient clipping: some models requires gradient clipping to stabilize the training process. The default of `grad_clip` is `None`, you can add gradient clippint to avoid gradients that are too large, i.e., set `optimizer_config=dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))` in your config file. If your config does not inherits from any basic config that contains `optimizer_config=dict(grad_clip=None)`, you can simply add `optimizer_config=dict(grad_clip=dict(max_norm=35, norm_type=2))`.
- ’GPU out of memory"
1. There are some scenarios when there are large amounts of ground truth boxes, which may cause OOM during target assignment. You can set `gpu_assign_thr=N` in the config of assigner thus the assigner will calculate box overlaps through CPU when there are more than N GT boxes.
2. Set `with_cp=True` in the backbone. This uses the sublinear strategy in PyTorch to reduce GPU memory cost in the backbone.
3. Try mixed precision training using following the examples in `config/fp16`. The `loss_scale` might need further tuning for different models.
1. 检查数据的标注是否正常,长或宽为 0 的框可能会导致回归 loss 变为 nan,一些小尺寸(宽度或高度小于 1)的框在数据增强(例如,instaboost)后也会导致此问题。因此,可以检查标注并过滤掉那些特别小甚至面积为 0 的框,并关闭一些可能会导致 0 面积框出现数据增强。
2. 降低学习率:由于某些原因,例如 batch size 大小的变化,导致当前学习率可能太大。您可以降低为可以稳定训练模型的值。
3. 延长 warm up 的时间:一些模型在训练初始时对学习率很敏感,您可以把 `warmup_iters` 从 500 更改为 1000 或 2000。
4. 添加 gradient clipping: 一些模型需要梯度裁剪来稳定训练过程。默认的 `grad_clip``None`,你可以在 config 设置 `optimizer_config=dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))`。 如果你的 config 没有继承任何包含 `optimizer_config=dict(grad_clip=None)`,你可以直接设置 `optimizer_config=dict(grad_clip=dict(max_norm=35, norm_type=2))`

- "GPU out of memory"
1. 存在大量 ground truth boxes 或者大量 anchor 的场景,可能在 assigner 会 OOM。您可以在 assigner 的配置中设置 `gpu_assign_thr=N`,这样当超过 N 个 GT boxes 时,assigner 会通过 CPU 计算 IoU。
2. 在 backbone 中设置 `with_cp=True`。这使用 PyTorch 中的 `sublinear strategy` 来降低 backbone 占用的 GPU 显存。
3. 使用 `config/fp16` 中的示例尝试混合精度训练。`loss_scale` 可能需要针对不同模型进行调整。

- "RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one"
1. This error indicates that your module has parameters that were not used in producing loss. This phenomenon may be caused by running different branches in your code in DDP mode.
2. You can set ` find_unused_parameters = True` in the config to solve the above problems or find those unused parameters manually.
1. 错误表明,您的模块有没用于产生损失的参数,这种现象可能是由于在 DDP 模式下运行代码中的不同分支造成的。
2. 您可以在配置中设置 `find_unused_parameters = True` 来解决上述问题,或者手动查找那些未使用的参数。

## Evaluation
## Evaluation 相关

- COCO Dataset, AP or AR = -1
1. According to the definition of COCO dataset, the small and medium areas in an image are less than 1024 (32\*32), 9216 (96\*96), respectively.
2. If the corresponding area has no object, the result of AP and AR will set to -1.
- 使用 COCO Dataset 的测评接口时,测评结果中 AP 或者 AR = -1
1. 根据 COCO 数据集的定义,一张图像中的中等物体与小物体面积的阈值分别为 921696\*96)与 1024(32\*32)。
2. 如果在某个区间没有物体即 GT,AP AR 将被设置为 -1。

0 comments on commit 7c9a592

Please sign in to comment.