Skip to content

Commit

Permalink
Fixing more markdown linting errors.
Browse files Browse the repository at this point in the history
  • Loading branch information
dgaliffiAMD committed May 16, 2024
1 parent e8f5ce3 commit 4cef352
Show file tree
Hide file tree
Showing 33 changed files with 881 additions and 446 deletions.
26 changes: 15 additions & 11 deletions AI/MIGraphX/Quantization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# MIGraphX - Torch Examples

# Summary

The examples in this subdirectory showcase the functionality for executing quantized models using MIGraphX. The Torch-MIGraphX integration library is used to achieve this, where PyTorch is used to quantize models, and MIGraphX is used to execute them on AMD GPUs.

For more information, refer to the [Torch-MIGraphX](https://github.com/ROCmSoftwarePlatform/torch_migraphx/tree/master) library.
Expand All @@ -10,41 +11,44 @@ For more information, refer to the [Torch-MIGraphX](https://github.com/ROCmSoftw

The quantization workflow consists of two main steps:

- Generate quantization parameters
- Generate quantization parameters

- Convert relevant operations in the model's computational graph to use the quantized datatype

### Generating quantization parameters

There are three main methods for computing quantization parameters:

- Dynamic Quantization:
- Model weights are pre-quantized , input/activation quantization parameters are computed dynamically at runtime


- Model weights are pre-quantized , input/activation quantization parameters are computed dynamically at runtime

- Static Post Training Quantization (PTQ):
- Quantization parameters are computed via calibration. Calibration involves calculating statistical attributes for relevant model nodes using provided sample input data


- Quantization parameters are computed via calibration. Calibration involves calculating statistical attributes for relevant model nodes using provided sample input data

- Static Quantization Aware Training (QAT):

- Quantization parameters are calibrated during the training process

**Note**: All three of these techniques are supported by PyTorch (at least in a prototype form), and so the examples leverage PyTorch's quantization APIs to perform this step.

### Converting and executing the quantized model
As of the latest PyTorch release, there is no support for executing quantized models on GPUs directly through the framework. To execute these quantized models, use AMD's graph optimizer, MIGraphX, which is built using the ROCm stack. The [torch_migraphx](https://github.com/ROCmSoftwarePlatform/torch_migraphx) library provides a friendly interface for optimizing PyTorch models using the MIGraphX graph optimizer.

As of the latest PyTorch release, there is no support for executing quantized models on GPUs directly through the framework. To execute these quantized models, use AMD's graph optimizer, MIGraphX, which is built using the ROCm stack. The [torch_migraphx](https://github.com/ROCmSoftwarePlatform/torch_migraphx) library provides a friendly interface for optimizing PyTorch models using the MIGraphX graph optimizer.

The examples show how to use this library to convert and execute PyTorch quantized models on GPUs using MIGraphX.

## Torch-MIGraphX

Torch-MIGraphX integrates AMD's graph inference engine with the PyTorch ecosystem. It provides a `mgx_module` object that may be invoked in the same manner as any other torch module, but utilizes the MIGraphX inference engine internally.
Torch-MIGraphX integrates AMD's graph inference engine with the PyTorch ecosystem. It provides a `mgx_module` object that may be invoked in the same manner as any other torch module, but utilizes the MIGraphX inference engine internally.

This library currently supports two paths for lowering:

- FX Tracing: Uses tracing API provided by the `torch.fx` library.

- Dynamo Backend: Importing torch_migraphx automatically registers the "migraphx" backend that can be used with the `torch.compile` API.

### Installation instructions

Refer to the [Torch_MIGraphX](https://github.com/ROCmSoftwarePlatform/torch_migraphx/blob/master/README.md) page for Docker and source installation instructions.

Refer to the [Torch_MIGraphX](https://github.com/ROCmSoftwarePlatform/torch_migraphx/blob/master/README.md) page for Docker and source installation instructions.
11 changes: 11 additions & 0 deletions Applications/bitonic_sort/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Applications Bitonic Sort Example

## Description

This example showcases a GPU implementation of the [bitonic sort](https://en.wikipedia.org/wiki/Bitonic_sorter) and uses it to order increasingly (or decreasingly) an array of $n$ elements. Another implementation of the said algorithm exists in rocPRIM and could be used instead. Also, rocPRIM's algorithm would likely offer an improved performance.

A sequence $\{x_n\}_{n=1}^m$ is called bitonic if it possesses one of the following two properties:
Expand All @@ -15,6 +16,7 @@ Below is presented an example of how an array of length 8 would be ordered incre
![bitonic_sort.svg](bitonic_sort.svg)

### Application flow

1. Parse user input.
2. Allocate and initialize host input array and make a copy for the CPU comparison.
3. Define a number of constants for kernel execution.
Expand All @@ -25,26 +27,35 @@ Below is presented an example of how an array of length 8 would be ordered incre
8. Compare the array obtained with the CPU implementation of the bitonic sort and print to standard output the result.

### Command line interface

There are three options available:

- `-h` displays information about the available parameters and their default values.
- `-l <length>` sets `length` as the number of elements of the array that will be sorted. It must be a power of $2$. Its default value is $2^{15}$.
- `-s <sort>` sets `sort` as the type or sorting that we want our array to have: decreasing ("dec") or increasing ("inc"). The default value is "inc".

## Key APIs and Concepts

- Device memory is allocated with `hipMalloc` and deallocated with `hipFree`.

- With `hipMemcpy` data bytes can be transferred from host to device (using `hipMemcpyHostToDevice`) or from device to host (using `hipMemcpyDeviceToHost`).

- `hipEventCreate` creates events, which are used in this example to measure the kernels execution time. `hipEventRecord` starts recording an event, `hipEventSynchronize` waits for all the previous work in the stream when the specified event was recorded. With these three functions it can be measured the start and stop times of the kernel and with `hipEventElapsedTime` it can be obtained the kernel execution time in milliseconds. Lastly, `hipEventDestroy` destroys an event.

- `myKernelName<<<...>>>` queues kernel execution on the device. All the kernels are launched on the `hipStreamDefault`, meaning that these executions are performed in order. `hipGetLastError` returns the last error produced by any runtime API call, allowing to check if any kernel launch resulted in error.

## Demonstrated API Calls

### HIP runtime

#### Device symbols

- `blockDim`
- `blockIdx`
- `threadIdx`

#### Host symbols

- `__global__`
- `hipEvent_t`
- `hipEventCreate`
Expand Down
17 changes: 15 additions & 2 deletions Applications/convolution/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# Applications Convolution Example

## Description

This example showcases a simple GPU implementation for calculating the [discrete convolution](https://en.wikipedia.org/wiki/Convolution#Discrete_convolution). The key point of this implementation is that in the GPU kernel each thread calculates the value for a convolution for a given element in the resulting grid.

For storing the mask constant memory is used. Constant memory is a read-only memory that is limited in size, but offers faster access times than regular memory. Furthermore on some architectures it has a separate cache. Therefore accessing constant memory can reduce the pressure on the memory system.

### Application flow

1. Default values for the size of the grid, mask and the number of iterations for the algorithm execution are set.
2. Command line arguments are parsed.
3. Host memory is allocated for the input, output and the mask. Input data is initialized with random numbers between 0-256.
Expand All @@ -17,30 +19,41 @@ For storing the mask constant memory is used. Constant memory is a read-only mem
9. In case requested the convoluted grid, the input grid, and the reference results are printed to standard output.

### Command line interface

There are three parameters available:

- `-h` displays information about the available parameters and their default values.
- `-x width` sets the grid size in the x direction. Default value is 4096.
- `-y height` sets the grid size in the y direction. Default value is 4096.
- `-p` Toggles the printing of the input, reference and output grids.
- `-i iterations` sets the number of times that the algorithm will be applied to the (same) grid. It must be an integer greater than 0. Its default value is 10.

## Key APIs and Concepts
- For this GPU implementation of the simple convolution calculation, the main kernel (`convolution`) is launched in a 2-dimensional grid. Each thread computes the convolution for one element of the resulting grid.

- For this GPU implementation of the simple convolution calculation, the main kernel (`convolution`) is launched in a 2-dimensional grid. Each thread computes the convolution for one element of the resulting grid.

- Device memory is allocated with `hipMalloc` which is later freed by `hipFree`.
- Constant memory is declared in global scope for the mask, using the `__constant__` qualifier. The size of the object stored in constant memory must be available at compile time. Later the memory is initialized with `hipMemcpyToSymbol`.

- Constant memory is declared in global scope for the mask, using the `__constant__` qualifier. The size of the object stored in constant memory must be available at compile time. Later the memory is initialized with `hipMemcpyToSymbol`.

- With `hipMemcpy` data can be transferred from host to device (using `hipMemcpyHostToDevice`) or from device to host (using `hipMemcpyDeviceToHost`).

- `myKernelName<<<...>>>` queues the kernel execution on the device. All the kernels are launched on the default stream `hipStreamDefault`, meaning that these executions are performed in order. `hipGetLastError` returns the last error produced by any runtime API call, allowing to check if any kernel launch resulted in an error.

- `hipEventCreate` creates the events used to measure kernel execution time, `hipEventRecord` starts recording an event and `hipEventSynchronize` waits for all the previous work in the stream when the specified event was recorded. These three functions can be used to measure the start and stop times of the kernel, and with `hipEventElapsedTime` the kernel execution time (in milliseconds) can be obtained. With `hipEventDestroy` the created events are freed.

## Demonstrated API Calls

### HIP runtime

#### Device symbols

- `blockIdx`
- `blockDim`
- `threadIdx`

#### Host symbols

- `__global__`
- `__constant__`
- `hipEventCreate`
Expand Down
14 changes: 13 additions & 1 deletion Applications/floyd_warshall/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Applications Floyd-Warshall Example

## Description

This example showcases a GPU implementation of the [Floyd-Warshall algorithm](https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm), which computes the shortest path between each pair of nodes in a given directed and (in this case) complete graph $G = (V, E, \omega)$. The key point of this implementation is that each kernel launch represents a step $k$ of the traditional CPU-implemented algorithm. Therefore, the kernel is launched as much times as nodes $\left(n = \vert V \vert \right)$ has the graph.

In this example, there are `iterations` (consecutive) executions of the algorithm on the same graph. As each execution requires an unmodified graph input, multiple copy operations are required. Hence, the performance of the example can be improved by using _pinned memory_.
Expand All @@ -10,6 +11,7 @@ Pinned memory is simply a special kind of memory that cannot be paged out the ph
Therefore, using pinned memory saves significant time needed to copy from/to host memory. In this example, performances is improved by using this type of memory, given that there are `iterations` (consecutive) executions of the algorithm on the same graph.

### Application flow

1. Default values for the number of nodes of the graph and the number of iterations for the algorithm execution are set.
2. Command line arguments are parsed (if any) and the previous values are updated.
3. A number of constants are defined for kernel execution and input/output data size.
Expand All @@ -20,30 +22,40 @@ Therefore, using pinned memory saves significant time needed to copy from/to hos
8. The mean time in milliseconds needed for each iteration is printed to standard output.
9. The results obtained are compared with the CPU implementation of the algorithm. The result of the comparison is printed to the standard output.


### Command line interface

There are three parameters available:

- `-h` displays information about the available parameters and their default values.
- `-n nodes` sets `nodes` as the number of nodes of the graph to which the Floyd-Warshall algorithm will be applied. It must be a (positive) multiple of `block_size` (= 16). Its default value is 16.
- `-i iterations` sets `iterations` as the number of times that the algorithm will be applied to the (same) graph. It must be an integer greater than 0. Its default value is 1.

## Key APIs and Concepts

- For this GPU implementation of the Floyd-Warshall algorithm, the main kernel (`floyd_warshall_kernel`) that is launched in a 2-dimensional grid. Each thread in the grid computes the shortest path between two nodes of the graph at a certain step $k$ $\left(0 \leq k < n \right)$. The threads compare the previously computed shortest paths using only the nodes in $V'=\{v_0,v_1,...,v_{k-1}\} \subseteq V$ as intermediate nodes with the paths that include node $v_k$ as an intermediate node, and take the shortest option. Therefore, the kernel is launched $n$ times.

- For improved performance, pinned memory is used to pass the results obtained in each iteration to the next one. With `hipHostMalloc` pinned host memory (accessible by the device) can be allocated, and `hipHostFree` frees it. In this example, host pinned memory is allocated using the `hipHostMallocMapped` flag, which indicates that `hipHostMalloc` must map the allocation into the address space of the current device. Beware that an excessive allocation of pinned memory can slow down the host execution, as the program is left with less physical memory available to map the rest of the virtual addresses used.

- Device memory is allocated using `hipMalloc` which is later freed using `hipFree`

- With `hipMemcpy` data bytes can be transferred from host to device (using `hipMemcpyHostToDevice`) or from device to host (using `hipMemcpyDeviceToHost`), among others.

- `myKernelName<<<...>>>` queues the kernel execution on the device. All the kernels are launched on the `hipStreamDefault`, meaning that these executions are performed in order. `hipGetLastError` returns the last error produced by any runtime API call, allowing to check if any kernel launch resulted in error.

- `hipEventCreate` creates the events used to measure kernel execution time, `hipEventRecord` starts recording an event and `hipEventSynchronize` waits for all the previous work in the stream when the specified event was recorded. With these three functions it can be measured the start and stop times of the kernel, and with `hipEventElapsedTime` the kernel execution time (in milliseconds) can be obtained.

## Demonstrated API Calls

### HIP runtime

#### Device symbols

- `blockIdx`
- `blockDim`
- `threadIdx`

#### Host symbols

- `__global__`
- `hipEventCreate`
- `hipEventDestroy`
Expand Down
6 changes: 5 additions & 1 deletion Applications/histogram/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Applications: Histogram Example

## Description

This program showcases a GPU kernel and its invocation of a histogram computation over a byte (`unsigned char`) array. A histogram constructs a table with the counts of each discrete value.
The diagram below showcases a 4 bin histogram over an 8-element long array:

Expand All @@ -14,8 +15,8 @@ This is solved by striding over the input such a way that each thread accesses a

![A diagram illustrating bank conflicts and solution using striding.](bank_conflict_reduction.svg)


### Application flow

1. Define and allocate inputs and outputs on host.
2. Allocate the memory on device and copy the input.
3. Launch the histogram kernel.
Expand All @@ -24,6 +25,7 @@ This is solved by striding over the input such a way that each thread accesses a
6. Verify the results on host.

### Key APIs and concepts

- _Bank conflicts._ Memory is stored across multiple banks. Elements in banks are stored in 4-byte words. Each thread within a wavefront should access different banks to ensure high throughput.
- `__ffs(int input)` finds the 1-index of the first set least significant bit of the input.
- `__syncthreads()` halts this thread until all threads within the same block have reached this point.
Expand All @@ -34,6 +36,7 @@ This is solved by striding over the input such a way that each thread accesses a
### HIP runtime

#### Device symbols

- `blockDim`
- `blockIdx`
- `threadIdx`
Expand All @@ -42,6 +45,7 @@ This is solved by striding over the input such a way that each thread accesses a
- `__shared__`

#### Host symbols

- `__global__`
- `hipEvent_t`
- `hipEventCreate`
Expand Down
Loading

0 comments on commit 4cef352

Please sign in to comment.