Skip to content

Commit

Permalink
Fix more linting errors
Browse files Browse the repository at this point in the history
  • Loading branch information
dgaliffiAMD committed May 16, 2024
1 parent 4cef352 commit 0f49636
Show file tree
Hide file tree
Showing 20 changed files with 318 additions and 142 deletions.
10 changes: 5 additions & 5 deletions .gitlab/issue_templates/example.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Example checklist

- Elaboration
- [ ] Example concept is described and agreed upon
- [ ] Example concept is described and agreed upon
- Implementation
- [ ] Example is implemented
- [ ] Example is implemented
- Internal review
- [ ] Internal code review is done
- [ ] Internal code review is done
- External review
- [ ] Upstreaming PR is opened, external review is done
- [ ] Upstreaming PR is opened, external review is done
- Done
- [ ] Example merged to upstream
- [ ] Example merged to upstream
12 changes: 7 additions & 5 deletions .gitlab/merge_request_templates/example.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
## Notes for the reviewer

_The reviewer should acknowledge all these topics._
<insert notes>

## Checklist before merge

- [ ] CMake support is added
- [ ] Dependencies are copied via `IMPORTED_RUNTIME_ARTIFACTS` if applicable
- [ ] Dependencies are copied via `IMPORTED_RUNTIME_ARTIFACTS` if applicable
- [ ] GNU Make support is added (Linux)
- [ ] Visual Studio project is added for VS2017, 2019, 2022 (Windows) (use [the script](https://projects.streamhpc.com/departments/knowledge/employee-handbook/-/wikis/Projects/AMD/Libraries/examples/Adding-Visual-Studio-Projects-to-new-examples#scripts))
- [ ] DLL dependencies are copied via `<Content Include`
- [ ] Visual Studio project is added to `ROCm-Examples-vs*.sln` (ROCm)
- [ ] Visual Studio project is added to `ROCm-Examples-Portable-vs*.sln` (ROCm/CUDA) if applicable
- [ ] DLL dependencies are copied via `<Content Include`
- [ ] Visual Studio project is added to `ROCm-Examples-vs*.sln` (ROCm)
- [ ] Visual Studio project is added to `ROCm-Examples-Portable-vs*.sln` (ROCm/CUDA) if applicable
- [ ] Inline code documentation is added
- [ ] README is added according to template
- [ ] Related READMEs, ToC are updated
- [ ] Related READMEs, ToC are updated
- [ ] The CI passes for Linux/ROCm, Linux/CUDA, Windows/ROCm, Windows/CUDA.
157 changes: 82 additions & 75 deletions AI/MIGraphX/Quantization/Running-Quantized-ResNet50-via-MIGraphX.md
Original file line number Diff line number Diff line change
@@ -1,102 +1,109 @@


# Running quantized ResNet50 via MIGraphX

## Summary

This example walks through the dynamo Post Training Quantization (PTQ) workflow for running a quantized model using torch_migraphx.

## Prerequisites

- You must follow the installation instructions for the torch_migraphx library in [Readme.md](https://github.com/Rmalavally/rocm-examples/blob/develop/AI/MIGraphX/Quantization/Readme.md) before using this example.
- You must follow the installation instructions for the torch_migraphx library in [README.md](README.md) before using this example.

## Steps for running a quantized model using torch_migraphx

1. Use torch.export and quantize_pt2e APIs to perform quantization
*Note*: The export API call is considered a prototype feature at the time this tutorial is written. Some call signatures may be modified in the future.
1. Use torch.export and quantize_pt2e APIs to perform quantization.

**Note**: The export API call is considered a prototype feature at the time this tutorial is written. Some call signatures may be modified in the future.

```python
import torch
from torchvision import models
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import prepare_pt2e, convert_pt2e
```

```import torch
from torchvision import models
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import prepare_pt2e, convert_pt2e
```
```python
import torch_migraphx
from torch_migraphx.dynamo.quantization import MGXQuantizer

```import torch_migraphx
from torch_migraphx.dynamo.quantization import MGXQuantizer
model_fp32 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1).eval()
input_fp32 = torch.randn(2, 3, 28, 28)
model_fp32 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1).eval()
input_fp32 = torch.randn(2, 3, 28, 28)

torch_fp32_out = model_fp32(input_fp32)
```
torch_fp32_out = model_fp32(input_fp32)
```

```python
model_export = capture_pre_autograd_graph(model_fp32, (input_fp32, ))
```

```
model_export = capture_pre_autograd_graph(model_fp32, (input_fp32, ))
```
Use the pt2e API to prepare, calibrate, and convert the model. Torch-MIGraphX provides a custom Quantizer for performing quantization that is compatible with MIGraphX.
Use the pt2e API to prepare, calibrate, and convert the model. Torch-MIGraphX provides a custom Quantizer for performing quantization that is compatible with MIGraphX.

```
quantizer = MGXQuantizer()
m = prepare_pt2e(model_export, quantizer)
# psudo calibrate
with torch.no_grad():
for _ in range(10):
```python
quantizer = MGXQuantizer()
m = prepare_pt2e(model_export, quantizer)

# psudo calibrate
with torch.no_grad():
for _ in range(10):
m(torch.randn(2, 3, 28, 28))
q_m = convert_pt2e(m)
torch_qout = q_m(input_fp32)
```
q_m = convert_pt2e(m)
torch_qout = q_m(input_fp32)
```

2. Lower Quantized model to MIGraphX. This step is the same as lowering any other model using torch.compile!

```
mgx_mod = torch.compile(q_m, backend='migraphx').cuda()
mgx_out = mgx_mod(input_fp32.cuda())
print(f"PyTorch FP32 (Gold Value):\n{torch_fp32_out}")
print(f"PyTorch INT8 (Fake Quantized):\n{torch_qout}")
print(f"MIGraphX INT8:\n{mgx_out}")
```
```python
mgx_mod = torch.compile(q_m, backend='migraphx').cuda()
mgx_out = mgx_mod(input_fp32.cuda())

print(f"PyTorch FP32 (Gold Value):\n{torch_fp32_out}")
print(f"PyTorch INT8 (Fake Quantized):\n{torch_qout}")
print(f"MIGraphX INT8:\n{mgx_out}")
```

3. Performance

Do a quick test to measure the performance gain from using quantization.

```
import copy
import torch._dynamo
# We will use this function to benchmark all modules:
def benchmark_module(model, inputs, iterations=100):
model(*inputs)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(iterations):

Do a quick test to measure the performance gain from using quantization.

```python
import copy
import torch._dynamo

# We will use this function to benchmark all modules:
def benchmark_module(model, inputs, iterations=100):
model(*inputs)
end_event.record()
torch.cuda.synchronize()
return start_event.elapsed_time(end_event) / iterations
# Benchmark MIGraphX INT8
mgx_int8_time = benchmark_module(mgx_mod, [input_fp32.cuda()])
torch._dynamo.reset()
# Benchmark MIGraphX FP32
mgx_module_fp32 = torch.compile(copy.deepcopy(model_fp32), backend='migraphx').cuda()
mgx_module_fp32(input_fp32.cuda())
mgx_fp32_time = benchmark_module(mgx_module_fp32, [input_fp32.cuda()])
torch._dynamo.reset()
# Benchmark MIGraphX FP16
mgx_module_fp16 = torch.compile(copy.deepcopy(model_fp32).half(), backend='migraphx').cuda()
input_fp16 = input_fp32.cuda().half()
mgx_module_fp16(input_fp16)
mgx_fp16_time = benchmark_module(mgx_module_fp16, [input_fp16])
print(f"{mgx_fp32_time=:0.4f}ms")
print(f"{mgx_fp16_time=:0.4f}ms")
print(f"{mgx_int8_time=:0.4f}ms")
```
Note that these performance gains (or lack of gains) will vary depending on the specific hardware in use.
torch.cuda.synchronize()

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()

for _ in range(iterations):
model(*inputs)
end_event.record()
torch.cuda.synchronize()

return start_event.elapsed_time(end_event) / iterations

# Benchmark MIGraphX INT8
mgx_int8_time = benchmark_module(mgx_mod, [input_fp32.cuda()])
torch._dynamo.reset()

# Benchmark MIGraphX FP32
mgx_module_fp32 = torch.compile(copy.deepcopy(model_fp32), backend='migraphx').cuda()
mgx_module_fp32(input_fp32.cuda())
mgx_fp32_time = benchmark_module(mgx_module_fp32, [input_fp32.cuda()])
torch._dynamo.reset()

# Benchmark MIGraphX FP16
mgx_module_fp16 = torch.compile(copy.deepcopy(model_fp32).half(), backend='migraphx').cuda()
input_fp16 = input_fp32.cuda().half()
mgx_module_fp16(input_fp16)
mgx_fp16_time = benchmark_module(mgx_module_fp16, [input_fp16])

print(f"{mgx_fp32_time=:0.4f}ms")
print(f"{mgx_fp16_time=:0.4f}ms")
print(f"{mgx_int8_time=:0.4f}ms")
```

Note that these performance gains (or lack of gains) will vary depending on the specific hardware in use.
36 changes: 30 additions & 6 deletions HIP-Basic/assembly_to_executable/README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,45 @@
# HIP-Basic Assembly to Executable Example

## Description

This example shows how to manually compile and link a HIP application from device assembly. Pre-generated assembly files are compiled into an _offload bundle_, a bundle of device object files, and then linked with the host object file to produce the final executable.

Building HIP executables from device assembly can be useful for example to experiment with specific instructions, perform specific optimizations, or can help debugging.

### Building

- Build with Makefile: to compile for specific GPU architectures, optionally provide the HIP_ARCHITECTURES variable. Provide the architectures separated by comma.

```shell
make HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102"
```

- Build with CMake:

```shell
cmake -S . -B build -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102"
cmake --build build
```

On Windows the path to RC compiler may be needed: `-DCMAKE_RC_COMPILER="C:/Program Files (x86)/Windows Kits/path/to/x64/rc.exe"`
HIP SDK for window does not support HIP device architecture gfx942.
HIP SDK for window does not support HIP device architecture gfx942.

## Generating device assembly

This example creates a HIP file from device assembly code, however, such assembly files can also be created from HIP source code using `hipcc`. This can be done by passing `-S` and `--cuda-device-only` to hipcc. The former flag instructs the compiler to generate human-readable assembly instead of machine code, and the latter instruct the compiler to only compile the device part of the program. The six assembly files for this example were generated as follows:

```shell
$ROCM_INSTALL_DIR/bin/hipcc -S --cuda-device-only --offload-arch=gfx803 --offload-arch=gfx900 --offload-arch=gfx906 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx942 --offload-arch=gfx1030 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 main.hip
```

The user may modify the `--offload-arch` flag to build for other architectures and choose to either enable or disable extra device code-generation features such as `xnack` or `sram-ecc`, which can be specified as `--offload-arch=<arch>:<feature>+` to enable it or `--offload-arch=<arch>:<feature>-` to disable it. Multiple features may be present, separated by colons.

## Build Process

A HIP binary consists of a regular host executable, which has an offload bundle containing device code embedded inside it. This offload bundle contains object files for each of the target devices that it is compiled for, and is loaded at runtime to provide the machine code for the current device. A HIP executable can be built from device assembly files and host HIP code according to the following process:

1. The `main.hip` file is compiled to an object file that only contains host code with `hipcc` by using the `--cuda-host-only` option. `main.hip` is a program that launches a simple kernel to compute the square of each element of a vector. The `-c` option is required to prevent the compiler from creating an executable, and make it create an object file containing the compiled host code instead.

```shell
$ROCM_INSTALL_DIR/bin/hipcc -c --cuda-host-only main.hip
```
Expand All @@ -56,6 +65,7 @@ A HIP binary consists of a regular host executable, which has an offload bundle
Note: using -bundle-align=4096 only works on ROCm 4.0 and newer compilers. Also, the architecture must match the same `--offload-arch` as when compiling to assembly.

4. The offload bundle is embedded inside an object file that can be linked with the object file containing the host code. The offload bundle must be placed in the `.hip_fatbin` section, and must be placed after the symbol `__hip_fatbin`. This can be done by creating an assembly file that places the offload bundle in the appropriate section using the `.incbin` directive:

```nasm
.type __hip_fatbin,@object
; Tell the assembler to place the offload bundle in the appropriate section.
Expand All @@ -68,31 +78,41 @@ A HIP binary consists of a regular host executable, which has an offload bundle
; Include the binary
.incbin "offload_bundle.hipfb"
```

This file can then be assembled using `llvm-mc` as follows:
```

```shell
$ROCM_INSTALL_DIR/llvm/bin/llvm-mc -triple <host target> -o main_device.o hip_obj_gen.mcin --filetype=obj
```

5. Finally, using the system linker, hipcc, or clang, the host object and device objects are linked into an executable:

```shell
<ROCM_PATH>/hip/bin/hipcc -o hip_assembly_to_executable main.o main_device.o
```

### Visual Studio 2019

The above compilation steps are implemented in Visual Studio through Custom Build Steps and Custom Build Tools:

- The host compilation from step 1 is performed by adding extra options to the source file, under `main.hip -> properties -> C/C++ -> Command Line`:
```

```shell
Additional Options: --cuda-host-only
```

- Each device assembly .s file has a custom build tool associated to it, which performs the operation associated to step 2 from the previous section:
```

```shell
Command Line: "$(ClangToolPath)clang++" -o "$(IntDir)%(FileName).o" "%(Identity)" -target amdgcn-amd-amdhsa -mcpu=gfx90a
Description: Compiling Device Assembly %(Identity)
Output: $(IntDir)%(FileName).o
Execute Before: ClCompile
```

- Steps 3 and 4 are implemented using a custom build step:
```

```shell
Command Line:
"$(ClangToolPath)clang-offload-bundler" -type=o -bundle-align=4096 -targets=host-x86_64-pc-windows-msvc,hipv4-amdgcn-amd-amdhsa--gfx803,hipv4-amdgcn-amd-amdhsa--gfx900,hipv4-amdgcn-amd-amdhsa--gfx906,hipv4-amdgcn-amd-amdhsa--gfx908,hipv4-amdgcn-amd-amdhsa--gfx90a,hipv4-amdgcn-amd-amdhsa--gfx1030,hipv4-amdgcn-amd-amdhsa--gfx1100,hipv4-amdgcn-amd-amdhsa--gfx1101,hipv4-amdgcn-amd-amdhsa--gfx1102 -input=nul "-input=$(IntDir)main_gfx803.o" "-input=$(IntDir)main_gfx900.o" "-input=$(IntDir)main_gfx906.o" "-input=$(IntDir)main_gfx908.o" "-input=$(IntDir)main_gfx90a.o" "-input=$(IntDir)main_gfx1030.o" "-input=$(IntDir)main_gfx1100.o" "-input=$(IntDir)main_gfx1101.o" "-input=$(IntDir)main_gfx1102.o" "-output=$(IntDir)offload_bundle.hipfb"
cd $(IntDir) && "$(ClangToolPath)llvm-mc" -triple host-x86_64-pc-windows-msvc "hip_obj_gen_win.mcin" -o "main_device.obj" --filetype=obj</Command>
Expand All @@ -101,8 +121,10 @@ The above compilation steps are implemented in Visual Studio through Custom Buil
Additional Dependencies: $(IntDir)main_gfx803.o;$(IntDir)main_gfx900.o;$(IntDir)main_gfx906.o;$(IntDir)main_gfx908.o;$(IntDir)main_gfx90a.o;$(IntDir)main_gfx1030.o;$(IntDir)main_gfx1100.o;$(IntDir)main_gfx1101.o;$(IntDir)main_gfx1102.o;$(IntDir)hip_objgen_win.mcin;%(Inputs)
Execute Before: ClCompile
```

- Finally step 5 is implemented by passing additional inputs to the linker in `project -> properties -> Linker -> Input`:
```

```shell
Additional Dependencies: $(IntDir)main_device.obj;%(AdditionalDependencies)
```

Expand All @@ -116,7 +138,9 @@ This example depends on the following tools:
`rocm-llvm` is installed with most ROCm installations.

## Used API surface

### HIP runtime

- `hipFree`
- `hipGetDeviceProperties`
- `hipGetLastError`
Expand Down
9 changes: 7 additions & 2 deletions HIP-Basic/bandwidth/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,27 @@
# Cookbook Bandwidth Example

## Description

This example measures the memory bandwith capacity of GPU devices. It performs memcpy from host to GPU device, GPU device to host, and within a single GPU.

### Application flow
### Application flow

1. User commandline arguments are parsed and test parameters initialized. If there are no commandline arguments then the test paramenters are initialized with default values.
2. Bandwidth tests are launched.
3. If the memory type for the test set to `-memory pageable` then the host side data is instantiated in `std::vector<unsigned char>`. If the memory type for the test set to `-memory pinned` then the host side data is instantiated in `unsigned char*` and allocated using `hipHostMalloc`.
4. Device side storage is allocated using `hipMalloc` in `unsigned char*`
5. Memory transfer is performed `trail` amount of times using `hipMemcpy` for pageable memory or using `hipMemcpyAsync` for host allocated pinned memory.
6. Time of memory transfer operations is measured that is then used to calculate the bandwidth.
9. All device memory is freed using `hipFree` and all host allocated pinned memory is freed using `hipHostFree`.
7. All device memory is freed using `hipFree` and all host allocated pinned memory is freed using `hipHostFree`.

## Key APIs and Concepts

The program uses HIP pageable and pinned memory. It is important to note that the pinned memory is allocated using `hipHostMalloc` and is destroyed using `hipHostFree`. The HIP memory transfer routine `hipMemcpyAsync` will behave synchronously if the host memory is not pinned. Therefore, it is important to allocate pinned host memory using `hipHostMalloc` for `hipMemcpyAsync` to behave asynchronously.

## Demonstrated API Calls

### HIP runtime

- `hipMalloc`
- `hipMemcpy`
- `hipMemcpyAsync`
Expand Down
Loading

0 comments on commit 0f49636

Please sign in to comment.