Fix more linting errors

ROCm · May 16, 2024 · 0f49636 · 0f49636
1 parent 4cef352
commit 0f49636
Show file tree

Hide file tree

Showing 20 changed files with 318 additions and 142 deletions.
diff --git a/.gitlab/issue_templates/example.md b/.gitlab/issue_templates/example.md
@@ -1,12 +1,12 @@
 # Example checklist
 
 - Elaboration
-    - [ ] Example concept is described and agreed upon
+  - [ ] Example concept is described and agreed upon
 - Implementation
-    - [ ] Example is implemented
+  - [ ] Example is implemented
 - Internal review
-    - [ ] Internal code review is done
+  - [ ] Internal code review is done
 - External review
-    - [ ] Upstreaming PR is opened, external review is done
+  - [ ] Upstreaming PR is opened, external review is done
 - Done
-    - [ ] Example merged to upstream
+  - [ ] Example merged to upstream
diff --git a/.gitlab/merge_request_templates/example.md b/.gitlab/merge_request_templates/example.md
@@ -1,16 +1,18 @@
 ## Notes for the reviewer
+
 _The reviewer should acknowledge all these topics._
 <insert notes>
 
 ## Checklist before merge
+
 - [ ] CMake support is added
-    - [ ] Dependencies are copied via `IMPORTED_RUNTIME_ARTIFACTS` if applicable
+  - [ ] Dependencies are copied via `IMPORTED_RUNTIME_ARTIFACTS` if applicable
 - [ ] GNU Make support is added (Linux)
 - [ ] Visual Studio project is added for VS2017, 2019, 2022 (Windows) (use [the script](https://projects.streamhpc.com/departments/knowledge/employee-handbook/-/wikis/Projects/AMD/Libraries/examples/Adding-Visual-Studio-Projects-to-new-examples#scripts))
-    - [ ] DLL dependencies are copied via `<Content Include`
-    - [ ] Visual Studio project is added to `ROCm-Examples-vs*.sln` (ROCm)
-    - [ ] Visual Studio project is added to `ROCm-Examples-Portable-vs*.sln` (ROCm/CUDA) if applicable
+  - [ ] DLL dependencies are copied via `<Content Include`
+  - [ ] Visual Studio project is added to `ROCm-Examples-vs*.sln` (ROCm)
+  - [ ] Visual Studio project is added to `ROCm-Examples-Portable-vs*.sln` (ROCm/CUDA) if applicable
 - [ ] Inline code documentation is added
 - [ ] README is added according to template
-    - [ ] Related READMEs, ToC are updated
+  - [ ] Related READMEs, ToC are updated
 - [ ] The CI passes for Linux/ROCm, Linux/CUDA, Windows/ROCm, Windows/CUDA.
diff --git a/AI/MIGraphX/Quantization/Running-Quantized-ResNet50-via-MIGraphX.md b/AI/MIGraphX/Quantization/Running-Quantized-ResNet50-via-MIGraphX.md
@@ -1,102 +1,109 @@
-
-
 # Running quantized ResNet50 via MIGraphX
 
 ## Summary
+
 This example walks through the dynamo Post Training Quantization (PTQ) workflow for running a quantized model using torch_migraphx.
 
 ## Prerequisites
 
-- You must follow the installation instructions for the torch_migraphx library in [Readme.md](https://github.com/Rmalavally/rocm-examples/blob/develop/AI/MIGraphX/Quantization/Readme.md) before using this example.
+- You must follow the installation instructions for the torch_migraphx library in [README.md](README.md) before using this example.
 
 ## Steps for running a quantized model using torch_migraphx
 
-1. Use torch.export and quantize_pt2e APIs to perform quantization
-*Note*: The export API call is considered a prototype feature at the time this tutorial is written. Some call signatures may be modified in the future.
+1. Use torch.export and quantize_pt2e APIs to perform quantization.
+
+    **Note**: The export API call is considered a prototype feature at the time this tutorial is written. Some call signatures may be modified in the future.
+
+    ```python
+    import torch
+    from torchvision import models
+    from torch._export import capture_pre_autograd_graph
+    from torch.ao.quantization.quantize_pt2e import prepare_pt2e, convert_pt2e
+    ```
 
-```import torch
-from torchvision import models
-from torch._export import capture_pre_autograd_graph
-from torch.ao.quantization.quantize_pt2e import prepare_pt2e, convert_pt2e
-```
+    ```python
+    import torch_migraphx
+    from torch_migraphx.dynamo.quantization import MGXQuantizer
 
-```import torch_migraphx
-from torch_migraphx.dynamo.quantization import MGXQuantizer
-model_fp32 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1).eval()
-input_fp32 = torch.randn(2, 3, 28, 28)
+    model_fp32 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1).eval()
+    input_fp32 = torch.randn(2, 3, 28, 28)
 
-torch_fp32_out = model_fp32(input_fp32)
-```
+    torch_fp32_out = model_fp32(input_fp32)
+    ```
 
+    ```python
+    model_export = capture_pre_autograd_graph(model_fp32, (input_fp32, ))
+    ```
 
-```
-model_export = capture_pre_autograd_graph(model_fp32, (input_fp32, ))
-```
-Use the pt2e API to prepare, calibrate, and convert the model. Torch-MIGraphX provides a custom Quantizer for performing quantization that is compatible with MIGraphX.
+    Use the pt2e API to prepare, calibrate, and convert the model. Torch-MIGraphX provides a custom Quantizer for performing quantization that is compatible with MIGraphX.
 
-```
-quantizer = MGXQuantizer()
-m = prepare_pt2e(model_export, quantizer)
-# psudo calibrate
-with torch.no_grad():
-    for _ in range(10):
+    ```python
+    quantizer = MGXQuantizer()
+    m = prepare_pt2e(model_export, quantizer)
+
+    # psudo calibrate
+    with torch.no_grad():
+      for _ in range(10):
         m(torch.randn(2, 3, 28, 28))
-q_m = convert_pt2e(m)
-torch_qout = q_m(input_fp32)
-```
+    q_m = convert_pt2e(m)
+    torch_qout = q_m(input_fp32)
+    ```
 
 2. Lower Quantized model to MIGraphX. This step is the same as lowering any other model using torch.compile!
 
-```
-mgx_mod = torch.compile(q_m, backend='migraphx').cuda()
-mgx_out = mgx_mod(input_fp32.cuda())
-print(f"PyTorch FP32 (Gold Value):\n{torch_fp32_out}")
-print(f"PyTorch INT8 (Fake Quantized):\n{torch_qout}")
-print(f"MIGraphX INT8:\n{mgx_out}")
-```
+    ```python
+    mgx_mod = torch.compile(q_m, backend='migraphx').cuda()
+    mgx_out = mgx_mod(input_fp32.cuda())
+
+    print(f"PyTorch FP32 (Gold Value):\n{torch_fp32_out}")
+    print(f"PyTorch INT8 (Fake Quantized):\n{torch_qout}")
+    print(f"MIGraphX INT8:\n{mgx_out}")
+    ```
 
 3. Performance
-
-Do a quick test to measure the performance gain from using quantization.
-
-```
-import copy
-import torch._dynamo
-# We will use this function to benchmark all modules:
-def benchmark_module(model, inputs, iterations=100):
-    model(*inputs)
-    torch.cuda.synchronize()
-
-    start_event = torch.cuda.Event(enable_timing=True)
-    end_event = torch.cuda.Event(enable_timing=True)
-
-    start_event.record()
-    for _ in range(iterations):
+
+    Do a quick test to measure the performance gain from using quantization.
+
+    ```python
+    import copy
+    import torch._dynamo
+
+    # We will use this function to benchmark all modules:
+    def benchmark_module(model, inputs, iterations=100):
         model(*inputs)
-    end_event.record()
-    torch.cuda.synchronize()
-
-    return start_event.elapsed_time(end_event) / iterations
-# Benchmark MIGraphX INT8
-mgx_int8_time = benchmark_module(mgx_mod, [input_fp32.cuda()])
-torch._dynamo.reset()
-# Benchmark MIGraphX FP32
-mgx_module_fp32 = torch.compile(copy.deepcopy(model_fp32), backend='migraphx').cuda()
-mgx_module_fp32(input_fp32.cuda())
-mgx_fp32_time = benchmark_module(mgx_module_fp32, [input_fp32.cuda()])
-torch._dynamo.reset()
-# Benchmark MIGraphX FP16
-mgx_module_fp16 = torch.compile(copy.deepcopy(model_fp32).half(), backend='migraphx').cuda()
-input_fp16 = input_fp32.cuda().half()
-mgx_module_fp16(input_fp16)
-mgx_fp16_time = benchmark_module(mgx_module_fp16, [input_fp16])
-print(f"{mgx_fp32_time=:0.4f}ms")
-print(f"{mgx_fp16_time=:0.4f}ms")
-print(f"{mgx_int8_time=:0.4f}ms")
-
-```
- Note that these performance gains (or lack of gains) will vary depending on the specific hardware in use.
+        torch.cuda.synchronize()
+
+        start_event = torch.cuda.Event(enable_timing=True)
+        end_event = torch.cuda.Event(enable_timing=True)
+
+        start_event.record()
 
+        for _ in range(iterations):
+            model(*inputs)
+        end_event.record()
+        torch.cuda.synchronize()
 
+        return start_event.elapsed_time(end_event) / iterations
 
+    # Benchmark MIGraphX INT8
+    mgx_int8_time = benchmark_module(mgx_mod, [input_fp32.cuda()])
+    torch._dynamo.reset()
 
+    # Benchmark MIGraphX FP32
+    mgx_module_fp32 = torch.compile(copy.deepcopy(model_fp32), backend='migraphx').cuda()
+    mgx_module_fp32(input_fp32.cuda())
+    mgx_fp32_time = benchmark_module(mgx_module_fp32, [input_fp32.cuda()])
+    torch._dynamo.reset()
+
+    # Benchmark MIGraphX FP16
+    mgx_module_fp16 = torch.compile(copy.deepcopy(model_fp32).half(), backend='migraphx').cuda()
+    input_fp16 = input_fp32.cuda().half()
+    mgx_module_fp16(input_fp16)
+    mgx_fp16_time = benchmark_module(mgx_module_fp16, [input_fp16])
+
+    print(f"{mgx_fp32_time=:0.4f}ms")
+    print(f"{mgx_fp16_time=:0.4f}ms")
+    print(f"{mgx_int8_time=:0.4f}ms")
+    ```
+
+ Note that these performance gains (or lack of gains) will vary depending on the specific hardware in use.
diff --git a/HIP-Basic/assembly_to_executable/README.md b/HIP-Basic/assembly_to_executable/README.md
@@ -1,36 +1,45 @@
 # HIP-Basic Assembly to Executable Example
 
 ## Description
+
 This example shows how to manually compile and link a HIP application from device assembly. Pre-generated assembly files are compiled into an _offload bundle_, a bundle of device object files, and then linked with the host object file to produce the final executable.
 
 Building HIP executables from device assembly can be useful for example to experiment with specific instructions, perform specific optimizations, or can help debugging.
 
 ### Building
 
 - Build with Makefile: to compile for specific GPU architectures, optionally provide the HIP_ARCHITECTURES variable. Provide the architectures separated by comma.
+
     ```shell
     make HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102"
     ```
+
 - Build with CMake:
+
     ```shell
     cmake -S . -B build -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102"
     cmake --build build
     ```
+
     On Windows the path to RC compiler may be needed: `-DCMAKE_RC_COMPILER="C:/Program Files (x86)/Windows Kits/path/to/x64/rc.exe"`
-    HIP SDK for window does not support HIP device architecture gfx942. 
+    HIP SDK for window does not support HIP device architecture gfx942.
 
 ## Generating device assembly
+
 This example creates a HIP file from device assembly code, however, such assembly files can also be created from HIP source code using `hipcc`. This can be done by passing `-S` and `--cuda-device-only` to hipcc. The former flag instructs the compiler to generate human-readable assembly instead of machine code, and the latter instruct the compiler to only compile the device part of the program. The six assembly files for this example were generated as follows:
+
 ```shell
 $ROCM_INSTALL_DIR/bin/hipcc -S --cuda-device-only --offload-arch=gfx803 --offload-arch=gfx900 --offload-arch=gfx906 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx942 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 main.hip
 ```
 
 The user may modify the `--offload-arch` flag to build for other architectures and choose to either enable or disable extra device code-generation features such as `xnack` or `sram-ecc`, which can be specified as `--offload-arch=<arch>:<feature>+` to enable it or `--offload-arch=<arch>:<feature>-` to disable it. Multiple features may be present, separated by colons.
 
 ## Build Process
+
 A HIP binary consists of a regular host executable, which has an offload bundle containing device code embedded inside it. This offload bundle contains object files for each of the target devices that it is compiled for, and is loaded at runtime to provide the machine code for the current device. A HIP executable can be built from device assembly files and host HIP code according to the following process:
 
 1. The `main.hip` file is compiled to an object file that only contains host code with `hipcc` by using the `--cuda-host-only` option. `main.hip` is a program that launches a simple kernel to compute the square of each element of a vector. The `-c` option is required to prevent the compiler from creating an executable, and make it create an object file containing the compiled host code instead.
+
     ```shell
     $ROCM_INSTALL_DIR/bin/hipcc -c --cuda-host-only main.hip
     ```
@@ -56,6 +65,7 @@ A HIP binary consists of a regular host executable, which has an offload bundle
     Note: using -bundle-align=4096 only works on ROCm 4.0 and newer compilers. Also, the architecture must match the same `--offload-arch` as when compiling to assembly.
 
 4. The offload bundle is embedded inside an object file that can be linked with the object file containing the host code. The offload bundle must be placed in the `.hip_fatbin` section, and must be placed after the symbol `__hip_fatbin`. This can be done by creating an assembly file that places the offload bundle in the appropriate section using the `.incbin` directive:
+
     ```nasm
         .type __hip_fatbin,@object
         ; Tell the assembler to place the offload bundle in the appropriate section.
@@ -68,31 +78,41 @@ A HIP binary consists of a regular host executable, which has an offload bundle
         ; Include the binary
         .incbin "offload_bundle.hipfb"
     ```
+
     This file can then be assembled using `llvm-mc` as follows:
-    ```
+
+    ```shell
     $ROCM_INSTALL_DIR/llvm/bin/llvm-mc -triple <host target> -o main_device.o hip_obj_gen.mcin --filetype=obj
     ```
 
 5. Finally, using the system linker, hipcc, or clang, the host object and device objects are linked into an executable:
+
     ```shell
     <ROCM_PATH>/hip/bin/hipcc -o hip_assembly_to_executable main.o main_device.o
     ```
 
 ### Visual Studio 2019
+
 The above compilation steps are implemented in Visual Studio through Custom Build Steps and Custom Build Tools:
+
 - The host compilation from step 1 is performed by adding extra options to the source file, under `main.hip -> properties -> C/C++ -> Command Line`:
-    ```
+
+    ```shell
     Additional Options: --cuda-host-only
     ```
+
 - Each device assembly .s file has a custom build tool associated to it, which performs the operation associated to step 2 from the previous section:
-    ```
+
+    ```shell
     Command Line: "$(ClangToolPath)clang++" -o "$(IntDir)%(FileName).o" "%(Identity)" -target amdgcn-amd-amdhsa -mcpu=gfx90a
     Description: Compiling Device Assembly %(Identity)
     Output: $(IntDir)%(FileName).o
     Execute Before: ClCompile
     ```
+
 - Steps 3 and 4 are implemented using a custom build step:
-    ```
+
+    ```shell
     Command Line:
         "$(ClangToolPath)clang-offload-bundler" -type=o -bundle-align=4096 -targets=host-x86_64-pc-windows-msvc,hipv4-amdgcn-amd-amdhsa--gfx803,hipv4-amdgcn-amd-amdhsa--gfx900,hipv4-amdgcn-amd-amdhsa--gfx906,hipv4-amdgcn-amd-amdhsa--gfx908,hipv4-amdgcn-amd-amdhsa--gfx90a,hipv4-amdgcn-amd-amdhsa--gfx1030,hipv4-amdgcn-amd-amdhsa--gfx1100,hipv4-amdgcn-amd-amdhsa--gfx1101,hipv4-amdgcn-amd-amdhsa--gfx1102 -input=nul "-input=$(IntDir)main_gfx803.o" "-input=$(IntDir)main_gfx900.o" "-input=$(IntDir)main_gfx906.o" "-input=$(IntDir)main_gfx908.o" "-input=$(IntDir)main_gfx90a.o" "-input=$(IntDir)main_gfx1030.o" "-input=$(IntDir)main_gfx1100.o" "-input=$(IntDir)main_gfx1101.o" "-input=$(IntDir)main_gfx1102.o" "-output=$(IntDir)offload_bundle.hipfb"
         cd $(IntDir) && "$(ClangToolPath)llvm-mc" -triple host-x86_64-pc-windows-msvc "hip_obj_gen_win.mcin" -o "main_device.obj" --filetype=obj</Command>
@@ -101,8 +121,10 @@ The above compilation steps are implemented in Visual Studio through Custom Buil
     Additional Dependencies: $(IntDir)main_gfx803.o;$(IntDir)main_gfx900.o;$(IntDir)main_gfx906.o;$(IntDir)main_gfx908.o;$(IntDir)main_gfx90a.o;$(IntDir)main_gfx1030.o;$(IntDir)main_gfx1100.o;$(IntDir)main_gfx1101.o;$(IntDir)main_gfx1102.o;$(IntDir)hip_objgen_win.mcin;%(Inputs)
     Execute Before: ClCompile
     ```
+
 - Finally step 5 is implemented by passing additional inputs to the linker in `project -> properties -> Linker -> Input`:
-    ```
+
+    ```shell
     Additional Dependencies: $(IntDir)main_device.obj;%(AdditionalDependencies)
     ```
 
@@ -116,7 +138,9 @@ This example depends on the following tools:
 `rocm-llvm` is installed with most ROCm installations.
 
 ## Used API surface
+
 ### HIP runtime
+
 - `hipFree`
 - `hipGetDeviceProperties`
 - `hipGetLastError`

diff --git a/HIP-Basic/bandwidth/README.md b/HIP-Basic/bandwidth/README.md
@@ -1,22 +1,27 @@
 # Cookbook Bandwidth Example
 
 ## Description
+
 This example measures the memory bandwith capacity of GPU devices. It performs memcpy from host to GPU device, GPU device to host, and within a single GPU.
 
-### Application flow 
+### Application flow
+
 1. User commandline arguments are parsed and test parameters initialized. If there are no commandline arguments then the test paramenters are initialized with default values.
 2. Bandwidth tests are launched.
 3. If the memory type for the test set to `-memory pageable` then the host side data is instantiated in `std::vector<unsigned char>`. If the memory type for the test set to `-memory pinned` then the host side data is instantiated in `unsigned char*` and allocated using `hipHostMalloc`.
 4. Device side storage is allocated using `hipMalloc` in `unsigned char*`
 5. Memory transfer is performed `trail` amount of times using `hipMemcpy` for pageable memory or using `hipMemcpyAsync` for host allocated pinned memory.
 6. Time of memory transfer operations is measured that is then used to calculate the bandwidth.
-9. All device memory is freed using `hipFree` and all host allocated pinned memory is freed using `hipHostFree`.
+7. All device memory is freed using `hipFree` and all host allocated pinned memory is freed using `hipHostFree`.
 
 ## Key APIs and Concepts
+
 The program uses HIP pageable and pinned memory. It is important to note that the pinned memory is allocated using `hipHostMalloc` and is destroyed using `hipHostFree`. The HIP memory transfer routine `hipMemcpyAsync` will behave synchronously if the host memory is not pinned. Therefore, it is important to allocate pinned host memory using `hipHostMalloc` for `hipMemcpyAsync` to behave asynchronously.
 
 ## Demonstrated API Calls
+
 ### HIP runtime
+
 - `hipMalloc`
 - `hipMemcpy`
 - `hipMemcpyAsync`