Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vitis accelerator #991

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
143350b
Initial version of VitisAccelerator backend:
axiotisk Jun 1, 2023
a616c28
fixing discrepancies post-merge
alex-yang-upenn May 16, 2024
31f5851
reverting unnecessary changes
alex-yang-upenn May 16, 2024
c16888c
final adjustments
alex-yang-upenn May 16, 2024
e1af21a
minor fixes and testing notebook
alex-yang-upenn May 17, 2024
432b1f5
minor fixes
alex-yang-upenn May 17, 2024
43c5a93
Updated host code and added more board support
alex-yang-upenn May 17, 2024
bbcffe3
cleaned up c++ code generation and added build functionality
alex-yang-upenn May 19, 2024
30d6b51
Added ability to use numpy array as I/O + CNN fixes
alex-yang-upenn May 24, 2024
793060c
Optimizations for reading dat + copytree bugfix
alex-yang-upenn May 25, 2024
d07d1ab
updated testing notebook
alex-yang-upenn May 25, 2024
00ae141
Cleaned-up host code + improved .dat generation
alex-yang-upenn May 28, 2024
5a1ae52
fixed testing notebook
alex-yang-upenn May 28, 2024
f29b5a4
build() signature alignment + xcl update + write_host() overwrite
alex-yang-upenn Jun 3, 2024
1289a4d
Fix VCK5000 part definition
Jun 13, 2024
22076b7
Documentation draft
axiotisk Jun 14, 2024
f2b59fa
Default directives + HLS Clock control
alex-yang-upenn Jun 19, 2024
45dfd8b
implementing hw quant option
alex-yang-upenn Jun 28, 2024
0ed662e
Update makefile
Jun 13, 2024
05c83c8
Fix vck5000 detection in makefile
Jun 13, 2024
f82f683
Remove messageDb from config file now that it is handled in makefile
Jun 13, 2024
172f6f1
build dir name + versal packaging + ultraclean
alex-yang-upenn Jun 21, 2024
76ba4ed
minor fixes
alex-yang-upenn Jun 28, 2024
18924fd
Fix Makefile template and Makefile generation
Jul 1, 2024
25a0a7c
Python black formating
Jul 1, 2024
2a89e4e
Apply pre-commit suggested changes (formating)
Jul 1, 2024
4eed1ae
Update manifest and remove developpement requirement.txt
Jul 1, 2024
1c5a4e5
Update documentation.
Jul 1, 2024
5780d2d
Documentation update
alex-yang-upenn Jul 1, 2024
6e129da
fixing build() behavior + documentation
alex-yang-upenn Jul 1, 2024
d560c75
Whitespace cleanup
Jul 2, 2024
3074d8c
Fix missing parameter in create_initial_config() (due to rebase)
Jul 2, 2024
86ff4d3
Remove duplication in documentation
axiotisk Jul 4, 2024
62bb04b
Fix pre-commit
axiotisk Jul 5, 2024
a752f77
fixing spacing in generated code
alex-yang-upenn Jul 6, 2024
5376926
Fix typo
Jul 9, 2024
5be9569
Update bulild():
Jul 9, 2024
92c0692
Add a target parameter to hardware_predict()
Jul 9, 2024
f69a950
Update documentation.
Jul 10, 2024
236e7db
Setup emu in Makefile and edit tb_input_features in host
axiotisk Jul 19, 2024
51e9fd5
Backend and Makefile fixes for emulation
axiotisk Jul 23, 2024
1a3908d
Update host code for clarity & better data handling
alex-yang-upenn Aug 2, 2024
a8e8466
Allowing flexibility with platforms
alex-yang-upenn Aug 7, 2024
18f7fc7
VitisAccelerator Host code refactor:
Jan 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@ graft contrib
recursive-include hls4ml/templates *
global-exclude .git .gitmodules .gitlab-ci.yml
include hls4ml/backends/vivado_accelerator/supported_boards.json
include hls4ml/backends/vitis_accelerator/supported_boards.json
include hls4ml/backends/vitis_accelerator/vivado_directives.json
109 changes: 109 additions & 0 deletions docs/backend/accelerator.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,112 @@ The ``predict`` method will send the input data to the PL and return the output

nn = NeuralNetworkOverlay('hls4ml_nn.bit', X_test.shape, y_test.shape)
y_hw, latency, throughput = nn.predict(X_test, profile=True)

================
VitisAccelerator
================

The ``VitsAccelerator`` backend leverages the `Vitis System Design Flow <https://www.xilinx.com/products/design-tools/vitis.html#design-flows>`_ to automate and simplify the creation of an hls4ml project targeting `AMD Alveo PCIe accelerators <https://www.amd.com/en/products/accelerators/alveo.html>`_.
The Vitis accelerator backend has been tested with the following boards:

* `Alveo u50 <https://www.xilinx.com/products/boards-and-kits/alveo/u50.html>`_
* `Alveo u55c <https://www.xilinx.com/products/boards-and-kits/alveo/u55c.html>`_
* `Alveo u250 <https://www.xilinx.com/products/boards-and-kits/alveo/u250.html>`_
* `Versal vck5000 <https://www.xilinx.com/products/boards-and-kits/vck5000.html>`_

Kernel wrapper
==============

To integrate with the Vitis System Design Flow and run on an accelerator, the generated ``hls4ml`` model must be encapsulated and built as a Vitis kernel (``.xo`` file) and linked into a binary file (``.xclbin``) during the implementation step. On the host side, standard C++ code using either `OpenCL <https://xilinx.github.io/XRT/master/html/opencl_extension.html>`_ or `XRT API <https://xilinx.github.io/XRT/master/html/xrt_native_apis.html>`_ can be used to download the ``.xclbin`` file to the accelerator card and use any kernel it contains.

The ``VitisAccelerator`` backend automatically generates a kernel wrapper, an host code example, and a Makefile to build the project.

**Note:** The current implementation of the kernel wrapper code is oriented toward throughput benchmarking and not general inference uses (See :ref:`here<hardware_predict-method>`). It can nonetheless be further customized to fit specific applications.

Options
=======

As PCIe accelerators are not suitable for ultra-low latency applications, it is assumed that they are used for high-throughput applications. To accommodate this, the backend supports the following options to optimize the kernel for throughput:

* ``num_kernel``: Number of kernel instance to implement in the hardware architecture.
* ``num_thread``: Number of host threads used to exercise the kernels in the host application.
* ``batchsize``: Number of samples to be processed in a single kernel execution.

Additionaly, the backend proposes the following options to customize the implementation:

* ``board``: The target board, must match one entry in ``supported_boards.json``.
* ``clock_period``: The target clock period in ns.
* ``hw_quant``: Is arbitrary precision quantization performed in hardware or not. If True, the quantization is performed in hardware and float are used at the kernel interface, otherwise it is performed in software and arbitrary precision types are used at the interface. (Defaults to ``False``).
* ``vivado_directives``: A list of strings to be added under the ``[Vivado]`` section of the generated ``accelerator_card.cfg`` link configuration file. Can be used to add custom directives to the Vivado project.

Build workflow
==============

At the call of the ``build`` method, the following option affect the build process:

* ``reset``: If True, clears files generated during previous build processes (Equivalent to ``make clean`` in build folder).
* ``target``: Can be one of ``hw``, ``hw_emu``, ``sw_emu``, to define which build target to use (Default is ``hw``).
* ``debug``: If True, compiles the c++ host code and the HLS in debug mode.

Once the project is generated, it possible to run manually the build steps by using one of the following ``make`` targets in the generated project directory:

* ``host``: Compiles the host application.
* ``hls``: Produces only the kernel's object file.
* ``xclbin``: Produces only the kernel's .xclbin file.
* ``clean``: Removes all generated files.
* ``run``: Run the host application using the .xclbin file and the input data present in ``tb_data/tb_input_features.dat``.

It is also possible to run the full build process by calling ``make`` without any target. Modifications to the ``accelerator_card.cfg`` file can be done manually before running the build process (e.g., to change the clock period, or add addition ``.xo`` kernel to the build).

Host code
=========

Once built, the host program can be run to load the board and perform inferences:

.. code-block:: Bash

./host

By defaut, all Computing Unit (CU) on all compatible devices will be used, with 3 worker thread per CU.

The generated host code application support the following options to tweak the execution:

* ``-d``: device BDF to use (can be specified multiple times)
* ``-x``: XCLBIN path
* ``-i``: input feature file
* ``-o``: output feature file
* ``-c``: maximum computing units count to use
* ``-n``: number of worker threads to use
* ``-r``: number of repeatition of the input feature file (For artificially increasing the data size for benchmarking purpose)
* ``-v``: enable verbose output
* ``-h``: print help

The following example shows how to limit on only one device, one CU, and on worker thread:

.. code-block:: Bash

./host -d 0000:c1:00.1 -c 1 -n 1

Example
=======

The following example is a modified version of `hsl4ml example 7 <https://github.com/fastmachinelearning/hls4ml-tutorial/blob/master/part7_deployment.ipynb>`_.

.. code-block:: Python

import hls4ml
hls_model = hls4ml.converters.convert_from_keras_model(
model,
hls_config=config,
output_dir='model_3/hls4ml_prj_vitis_accel',
backend='VitisAccelerator',
board='alveo-u55c',
num_kernel=4,
num_thread=8,
batchsize=8192,
hw_quant=False,
vivado_directives=["prop=run.impl_1.STEPS.PLACE_DESIGN.ARGS.DIRECTIVE=Explore"]
)
hls_model.compile()
hls_model.build()
y = hls_model.predict_hardware(y) # Limited to batchsize * num_kernel * num_thread for now
21 changes: 21 additions & 0 deletions docs/ir/modelgraph.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,24 @@ The trace method is an advanced version of the ``predict`` method. It's used to

#We also support a similar function for keras
keras_trace = hls4ml.model.profiling.get_ymodel_keras(keras_model, X)

----

.. _hardware_predict-method:

``hardware_predict`` method
===========================

A specialized version of the ``predict`` method, for the VitisAccelerator backend after a successful build. Runs the project on the FPGA and obtains prediction for the supplied numpy array.

**Note:** The host code being run under the hood is an example written for generic benchmarking purposes, helpful for validating projects and gauging maximum throughput. It should be further adapted for more specific applications. Currently, the maximum number of input samples that can be processed is ``batchsize * num_cu * num_buffer``. If the input array exceeds that size, the additional samples will be ignored.

An optional ``target`` argument can be used to specify the target emulation mode (``hw``, ``sw_emu``, ``hw_emu``) to run the project on. The default is ``hw``.

.. code-block:: python

# Suppose that you already have input array X
# Note that you have to do both hls_model.compile() and hls_model.build(), ensuring the
# .xclbin file is successfully created, before using hardware_predict

y = hls_model.hardware_predict(X)
6 changes: 4 additions & 2 deletions hls4ml/backends/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,19 @@
from hls4ml.backends.oneapi.oneapi_backend import OneAPIBackend
from hls4ml.backends.quartus.quartus_backend import QuartusBackend
from hls4ml.backends.symbolic.symbolic_backend import SymbolicExpressionBackend
from hls4ml.backends.vitis_accelerator.vitis_accelerator_config import VitisAcceleratorConfig # noqa: F401
from hls4ml.backends.vivado.vivado_backend import VivadoBackend
from hls4ml.backends.vivado_accelerator.vivado_accelerator_backend import VivadoAcceleratorBackend
from hls4ml.backends.vivado_accelerator.vivado_accelerator_config import VivadoAcceleratorConfig # noqa: F401

from hls4ml.backends.catapult.catapult_backend import CatapultBackend # isort: skip

from hls4ml.backends.vitis.vitis_backend import VitisBackend # isort: skip
from hls4ml.backends.vitis_accelerator.vitis_accelerator_backend import VitisAcceleratorBackend # isort: skip


register_backend('Vivado', VivadoBackend)
register_backend('VivadoAccelerator', VivadoAcceleratorBackend)
register_backend('Vitis', VitisBackend)
register_backend('VitisAccelerator', VitisAcceleratorBackend)
register_backend('Quartus', QuartusBackend)
register_backend('Catapult', CatapultBackend)
register_backend('SymbolicExpression', SymbolicExpressionBackend)
Expand Down
Empty file.
Empty file.
34 changes: 34 additions & 0 deletions hls4ml/backends/vitis_accelerator/passes/feature_check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
from hls4ml.model.optimizer import OptimizerPass


class ValidateConvImplementation(OptimizerPass):
def match(self, node):
return 'Conv' in node.class_name

def transform(self, model, node):
if node.get_attr('implementation', 'linebuffer') == 'encoded':
print(
f'WARNING: "Encoded" implementation in "{node.name}" ({node.class_name}) is not supported in Vitis backend. '
'Switching to "LineBuffer" implementation.'
)
node.set_attr('implementation', 'linebuffer')


class ValidateStrategy(OptimizerPass):
_resource_layer_cls = ['Conv1D', 'Conv2D', 'Dense']

def match(self, node):
is_resource_layer = len([layer_cls for layer_cls in self._resource_layer_cls if layer_cls in node.class_name]) > 0
is_resource_strategy = node.model.config.is_resource_strategy(node)

return is_resource_layer and is_resource_strategy

def transform(self, model, node):
n_in, _ = model.config.backend.get_layer_mult_size(node)
rf = node.get_attr('reuse_factor')
if rf > n_in and rf % n_in > 0:
print(
f'WARNING: "Resource" strategy in "{node.name}" ({node.class_name}) may have suboptimal QoR in Vitis '
'backend due to use of "urem" cores.\n'
'Consider using a different ReuseFactor or switching to "Latency" strategy.'
)
26 changes: 26 additions & 0 deletions hls4ml/backends/vitis_accelerator/supported_boards.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"alveo-u55c": {
"board_type": "alveo",
"part": "xcu55c-fsvh2892-2L-e",
"platform": ["xilinx_u55c_gen3x16_xdma_3_202210_1"],
"memory": {"type": "hbm", "channels": 32, "capacity": 16}
},
"alveo-u50": {
"board_type": "alveo",
"part": "xcu50-fsvh2104-2-e",
"platform": ["xilinx_u50_gen3x16_xdma_5_202210_1"],
"memory": {"type": "hbm", "channels": 32, "capacity": 8}
},
"alveo-u250": {
"board_type": "alveo",
"part": "xcu250-figd2104-2L-e",
"platform": ["xilinx_u250_xdma_201830_2"],
"memory": {"type": "ddr", "channels": 4, "capacity": 64}
},
"vck5000": {
"board_type": "versal",
"part": "xcvc1902-vsvd1760-2MP-e-S",
"platform": ["xilinx_vck5000_gen4x8_qdma_2_202220_1"],
"memory":{"type": "ddr", "channels": 3, "capacity": 12}
}
}
165 changes: 165 additions & 0 deletions hls4ml/backends/vitis_accelerator/vitis_accelerator_backend.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
import os
import subprocess
import sys

import numpy as np

from hls4ml.backends import VitisBackend, VivadoBackend
from hls4ml.model.flow import get_flow, register_flow


class VitisAcceleratorBackend(VitisBackend):
def __init__(self):
super(VivadoBackend, self).__init__(name="VitisAccelerator")
self._register_layer_attributes()
self._register_flows()

def create_initial_config(
self,
board="alveo-u55c",
platform=None,
part=None,
clock_period=5,
clock_uncertainty='27%',
io_type="io_parallel",
num_kernel=1,
num_worker=1,
batchsize=8192,
hw_quant=False,
vivado_directives=None,
**_,
):
"""
Create initial accelerator config with default parameters

Args:
board: one of the keys defined in supported_boards.json
clock_period: clock period passed to hls project
io_type: io_parallel or io_stream
num_kernel: how many compute units to create on the fpga
num_worker: how many threads the host cpu uses to drive each CU on the fpga
batchsize: how many samples to process within a single buffer on the fpga
vivado_directives: Directives passed down to Vivado that controls the hardware synthesis and implementation steps
Returns:
populated config
"""
board = board if board is not None else "alveo-u55c"
config = super().create_initial_config(part, clock_period, clock_uncertainty, io_type)
config["AcceleratorConfig"] = {}
config["AcceleratorConfig"]["Board"] = board
config["AcceleratorConfig"]["Platform"] = platform
config["AcceleratorConfig"]["Num_Kernel"] = num_kernel
config["AcceleratorConfig"]["Num_Worker"] = num_worker
config["AcceleratorConfig"]["Batchsize"] = batchsize
config["AcceleratorConfig"]["HW_Quant"] = hw_quant
config["AcceleratorConfig"]["Vivado_Directives"] = vivado_directives
return config

def build(
self,
model,
reset=False,
target="hw",
debug=False,
**kwargs,
):
self._validate_target(target)

if "linux" in sys.platform:

curr_dir = os.getcwd()
os.chdir(model.config.get_output_dir())

command = f"TARGET={target} "

if debug:
command += "DEBUG=1 "

command += " make all"

# Cleaning
if reset:
os.system(f"TARGET={target} make clean")

# Pre-loading libudev
ldconfig_output = subprocess.check_output(["ldconfig", "-p"]).decode("utf-8")
for line in ldconfig_output.split("\n"):
if "libudev.so" in line and "x86" in line:
command = "LD_PRELOAD=" + line.split("=>")[1].strip() + " " + command
break
os.system(command)

os.chdir(curr_dir)
else:
raise Exception("Currently untested on non-Linux OS")

def numpy_to_dat(self, model, x):
if len(model.get_input_variables()) != 1:
raise Exception("Currently unsupported for multi-input/output projects")

# Verify numpy array of correct shape
expected_shape = model.get_input_variables()[0].size()
actual_shape = np.prod(x.shape[1:])
if expected_shape != actual_shape:
raise Exception(f"Input shape mismatch, got {x.shape}, expected (_, {expected_shape})")

# Write to tb_data/tb_input_features.dat
samples = x.reshape(x.shape[0], -1)
input_dat = f"{model.config.get_output_dir()}/tb_data/tb_input_features.dat"
np.savetxt(input_dat, samples, fmt="%.4e")

def dat_to_numpy(self, model):
expected_shape = model.get_output_variables()[0].size()
output_file = f"{model.config.get_output_dir()}/tb_data/hw_results.dat"
y = np.loadtxt(output_file, dtype=float).reshape(-1, expected_shape)
return y

def hardware_predict(self, model, x, target="hw", debug=False, profilingRepeat=-1):
if debug:
command = "DEBUG=1 "
if isinstance(profilingRepeat, int) and profilingRepeat > 0:
command += "PROFILING_DATA_REPEAT_COUNT=" + profilingRepeat + " "
self._validate_target(target)

self.numpy_to_dat(model, x)

currdir = os.getcwd()
os.chdir(model.config.get_output_dir())
command += "TARGET=" + target + " make run"
os.system(command)
os.chdir(currdir)

return self.dat_to_numpy(model)

def _register_flows(self):
validation_passes = [
"vitisaccelerator:validate_conv_implementation",
"vitisaccelerator:validate_strategy",
]
validation_flow = register_flow(
"validation",
validation_passes,
requires=["vivado:init_layers"],
backend=self.name,
)

# Any potential templates registered specifically for Vitis backend
template_flow = register_flow(
"apply_templates",
self._get_layer_templates,
requires=["vivado:init_layers"],
backend=self.name,
)

writer_passes = ["make_stamp", "vitisaccelerator:write_hls"]
self._writer_flow = register_flow("write", writer_passes, requires=["vitis:ip"], backend=self.name)

ip_flow_requirements = get_flow("vivado:ip").requires.copy()
ip_flow_requirements.insert(ip_flow_requirements.index("vivado:init_layers"), validation_flow)
ip_flow_requirements.insert(ip_flow_requirements.index("vivado:apply_templates"), template_flow)

self._default_flow = register_flow("ip", None, requires=ip_flow_requirements, backend=self.name)

def _validate_target(self, target):
if target not in ["hw", "hw_emu", "sw_emu"]:
raise Exception("Invalid target, must be one of 'hw', 'hw_emu' or 'sw_emu'")
Loading