Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logcumsumexp_1.0 #996

Closed
wants to merge 17 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
19f77c8
[Feature](mluOpAdamW): add param check for adam_w (#983)
gggghja Apr 2, 2024
f595288
[Feature](bangc-ops): update proto-repo. (#986)
PetrelYy Apr 2, 2024
9d9195c
[Fix](mlu-ops): Fix independent_build.sh return code. (#987)
mahxn0 Apr 3, 2024
6c604cc
[Feature](mlu-ops): remove the compiling to target 200 series when mp…
DanieeelLiu Apr 3, 2024
649be31
[Fix](mlu-ops): Fix the softlink relation in rpm package. (#991)
DanieeelLiu Apr 7, 2024
35f4f10
[Fix](mluOpAdamW): fix invoke kernel error (#997)
gggghja Apr 12, 2024
f57d8c9
[Doc](mlu-ops): update release note (#1001)
PetrelYy Apr 15, 2024
b94e228
[Feature](mluOpGenerateProposalsV2):check nan/inf state and reconstru…
mahxn0 Apr 19, 2024
b6e4b09
[Feature](mluOpGenerateProposalsV2):update.
mahxn0 Apr 25, 2024
223363a
[Fix](mluOpDeformRoiPoolBackward, mluOpRoiAlignRotated*): Roi ops get…
chqy99 Apr 28, 2024
a427550
[Fix](mluOpGenerateProposalsV2):fix and update nan/inf support state.
mahxn0 Apr 30, 2024
f18ba84
[Fix](mluOpGenerateProposalsV2):fix and update nan/inf support state.
mahxn0 Apr 30, 2024
b62fa54
[Fix](mluOpGenerateProposalsV2):fix and update nan/inf support state.
mahxn0 Apr 30, 2024
91badd8
[Fix](mluOpGenerateProposalsV2):fix and update nan/inf support state.
mahxn0 Apr 30, 2024
f510ac5
[Fix](mluOpGenerateProposalsV2): Test pipeline.
mahxn0 May 9, 2024
b16b811
[Fix](mlu-ops): Fix compiling error on Kylin (#1022)
DanieeelLiu May 10, 2024
cb42431
[Docs](mluOpAdamW): Update design docs (#1024)
gggghja May 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/daily.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
strategy:
matrix:
runner: [mlu370-m8]
mlu_ops_version : [1.1.0]
mlu_ops_version : [1.1.1]
cntoolkit_version : [3.8.4]
cnnl_version: [1.23.2]
runs-on: ${{matrix.runner}}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/mluops_all_system_ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
strategy:
matrix:
runner: [mlu370-m8]
mlu_ops_version : [1.1.0]
mlu_ops_version : [1.1.1]
cntoolkit_version : [3.8.4]
cnnl_version: [1.23.2]
os: [ubuntu20.04, centos7, centos8, kylin10]
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/mluops_ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:
strategy:
matrix:
runner: [mlu370-m8]
mlu_ops_version : [v1.1.0]
mlu_ops_version : [v1.1.1]
runs-on: [yellow]
steps:
- uses: actions/checkout@v3
Expand Down
2 changes: 1 addition & 1 deletion build.property
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"version": "1.1.0-1",
"version": "1.1.1-1",
"python": "3.6.0",
"build_requires": {"cntoolkit": ["release","3.8.4-1"],
"cnnl":["release","1.23.2-1"],
Expand Down
9 changes: 9 additions & 0 deletions docs/api_guide/update.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,15 @@ Update History

This section lists contents that were made for each product release.

* V1.1.1

**Date:** April 12, 2024

**Changes:**

- None.


* V1.1.0

**Date:** March 28, 2024
Expand Down
4 changes: 2 additions & 2 deletions docs/design_docs/adam_w/adam_w.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ adamw算子是element wise类型的算子,因此只需要按照数据量进行
2. 由于NRAM的存储空间有限,导致每个Core分配到的数据量无法一次性处理完成,因此需要做多次处理,在这种平分中,也存在剩余数据量。为保证剩余数据为零,在拆分时需要考虑每个核处理数据对单次循环能处理的数据长度对齐。
3. 对每个核分配到的数据量进行循环处理之后,最后一个核处理rem_for_all的数据。由于大多数MLU存在多个计算核,因此将数据的计算拆分到不同的核上并行计算可以大幅提升性能,本算子拆分后的任务无需进行数据通信交互,任务类型为block。

多核间的拆分:由于软流水中的内存拷贝指令要求内存地址为128的整数倍,因此在拆分时就需要保证每个部分的首地址都是128的整数倍,具体的做法是先将整体向量以128字节为单位分解成小块,然后再将小块分配到核上,每个核计算一份,然后由第一个核计算分解为小块剩余的元素和分配给core时剩余的小块。
多核间的拆分:由于软流水中的计算指令要求内存地址为128的整数倍,因此在拆分时就需要保证每个部分的首地址都是128的整数倍,具体的做法是先将整体向量以128字节为单位分解成小块,然后再将小块分配到核上,每个核计算一份,然后由第一个核计算分解为小块剩余的元素和分配给core时剩余的小块。

单核内的拆分:由于片上NRAM空间的限制,因此在数据量较大时不能一次处理完单核分配到的全部数据,这时就需要在单核内循环,每一次只处理一部分元素,元素的个数由NRAM上指定给该向量的空间大小决定。

Expand Down Expand Up @@ -235,4 +235,4 @@ bangc代码中加入必要的 log信息,比如输入的规模、数据类型

### 4.2 已经过优化的规模说明

(首次提交,暂无)
(首次提交,暂无)
34 changes: 22 additions & 12 deletions docs/design_docs/generate_proposals_v2/generate_proposals_v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
| 版本号| 修订人 | 修订日期 | 修订描述 |
| ----- | ------ | ------- | ------- |
| V1.0 | 谷中豪 | 2022-08-22 | 首次提交 |
| V1.1 | 马向军 | 2024-04-19 | nan/inf支持及性能优化 |

* #### 内容描述
本文档为`generate_proposals_v2`算子的设计文档,包括需求分析、接口设计、方案设计、性能优化记录。
Expand Down Expand Up @@ -396,8 +397,8 @@ int rem_num = per_core_num % seg_pad_k;
// | HWA | taskDim | taskDim |
```

#### 3.1.2 createAndRemoveBoxes 实现
##### 3.1.2.1 createAndRemoveBoxes 过程中每个core上的数据量和偏移的计算
#### 3.1.2 FilterBoxes 实现
##### 3.1.2.1 FilterBoxes 过程中每个core上的数据量和偏移的计算
```c+
// 计算每个cluster上的数据量
int rem_num = pre_nms_top_n % taskDimY;
Expand All @@ -415,31 +416,31 @@ int repeat = per_core_num / seg_pad_1;
int rem_num = per_core_num % seg_pad_1;
```

##### 3.1.2.2 createAndRemoveBox 实现过程
##### 3.1.2.2 FilterBoxes 实现过程
1. 从 GDRAM 上load scores、anchors、bbox_deltas、variances数据,平分到每个 core 上的 nram 空间,每个core上 load 的大小为 per_core_num, core 上每次循环load seg_pad_1 个数据;

2. 单次循环load完数据后,使用bang_ge 获取 nram 上 scores 大于等于 k_score 的mask;

3. 使用 bang_collect,根据 第2步的mask, 把 mask 等于1位置的`scores`、`anchors`、`bbox_deltas`、`variances`值collect到一起, `scores` 需要collect一次, `anchors`、`bbox_deltas`、`variances`需要对四个值分别进行collect, 每次循环 collect 数量为seg_pad_1;

4. 用 collect 后的数据,根据 createbox 计算过程创建 proposals;
4. 用 collect 后的数据,根据 proposalsBoxesDecode 计算过程创建 proposals;

5. 根据 removeSmallBox 的计算方法,生成新的 mask2, 用 bang_collect 操作移除proposal中宽和高小于 min_size 的 proposal,把有效的 proposals 集中到一起,此时,单次循环内的计算过程结束;

6. 把单次循环时创建好 proposal 数据,保存到 workspace 空间内, 若单 core 内数据未处理完,回到第 2 步;<br>

##### 3.1.2.3 createAndRemoveBox 的nram空间和workspace划分
##### 3.1.2.3 FilterBoxes 的nram空间和workspace划分
```c++
// nram:重新从 worksapce 上 load scores、anchors、bbox_deltas、variance, seg_pad_1 = max_nram_size / (13 + X)
// | scores | anchors | bbox_deltas | variances | nram_temp |
// | seg_pad_1 | 4 * seg_pad_1 | 4 * seg_pad_1 | 4 * seg_pad_1 | X * seg_pad_1 |

// workspace : createAndRemoveBox 后的 proposals 存放到 worksapce 中, 其对应的 scores 也存放在 worksapce 中
// workspace : FilterBoxes 后的 proposals 存放到 worksapce 中, 其对应的 scores 也存放在 worksapce 中
// | scores | proposals | scores_tmp | proposals_tmp | collect_num |
// | AHW | AHW * 4 | AHW | AHW * 4 | taskDim |
```

##### 3.1.2.4 createbox 计算过程
##### 3.1.2.4 proposalsBoxesDecode 计算过程
a. 根据anchor 两个点坐标 (xmin,ymin,xmax,ymax) 计算 box_anchor的中心点坐标 (cx, cy) 及 anchor的宽高;<br>
```c++
offset = pixes_offset? 1.0 : 0;
Expand Down Expand Up @@ -473,7 +474,7 @@ proposals[1] = Max(Min(oymin, im_shape[0] - offset), 0.);
proposals[2] = Max(Min(oxmax, im_shape[1] - offset), 0.);
proposals[3] = Max(Min(oymax, im_shape[0] - offset), 0.);
```
##### 3.1.2.5 removeSmallBoxs 计算过程
##### 3.1.2.5 filterBoxes 计算过程
1. 通过proposals的两点坐标计算 proposal的宽 box_w和高 box_h;

2. 用bang_ge方法,分别获取box_w 和 box_h 和 min_size 比较的mask,记为mask_w 和 mask_h;
Expand Down Expand Up @@ -556,7 +557,7 @@ __mlu_func__ void mluOpsGeneratorProposalsV2Kernel(){
int core_offset =(coreId < rem_core_num) ? coreId * per_core_num : coreId * per_core_num + rem_core_num;
...
getTopKVal();
createBox();
proposalsBoxesDecode();
removeSmallBox();
nms();
...
Expand Down Expand Up @@ -600,8 +601,8 @@ __mul_func__ void getTopKVal(T * scores, T * bbox_deltas, T *anchors, T *varianc
}
}

// createbox实现
// `createbox` 根据输入anchor、bbox_deltas、variances的坐标,生成proposals;
// proposalsBoxesDecode实现
// `proposalsBoxesDecode` 根据输入anchor、bbox_deltas、variances的坐标,生成proposals;

// output = exp(input)
__mlu__func void calcExp(T * output, const T * input, cpnst int length){
Expand All @@ -614,7 +615,7 @@ __mlu__func void calcExp(T * output, const T * input, cpnst int length){
#endif
}
// 生成proposals
__mlu__func void createBox(const T* anchor, const T *deltas, const T *var, const int deal_size, T * proposals, T *nram_temp, bool pixes_offset = true){
__mlu__func void proposalsBoxesDecode(const T* anchor, const T *deltas, const T *var, const int deal_size, T * proposals, T *nram_temp, bool pixes_offset = true){
T *axmin = anchor;
T *aymin = anchor + deal_size;
T *axmax= anchor + 2 * deal_size;
Expand Down Expand Up @@ -768,6 +769,15 @@ __mlu_func__ void removeSmallBox(T * boxes, T *scores, const T *im_size,
__bang_collect(scores, scores, mask_result, deal_size);
}
```
### 优化方案
1. host调用topk kernel,输出按照降序排序的前topk大的scores和对应的indexes,替换在nram实现的只输出第topk大的score的kernel。原实现top标量计算及分支判断较多,性能较差。优化前后Load的IO数据量相同。主要优化点为计算效率提升。

2. 根据indexes只gather前topk个scores。具体步骤为:
- 根据indexes计算gather score的offset_score,gather对应的score到nram
- 计算gather deltaboxes,anchors,variances对应的offset,offset = 4 * offset_score,
- 对gather到nram的deltaboxes,anchors,variances进行transpose操作,使x1,y1,x2,y2连续
- 计算部分不再需要比较和select操作,其它同proposalsBoxesDecode计算一样,减少计算量
- Load IO量减少到N * topk + 4 * N * topk + topk + topk(topk <= H * W * A。 实际网络中H * W * A远大与topk)

### 3.3 拆分(任务拆分,多核拆分)
**拆分策略**
Expand Down
2 changes: 1 addition & 1 deletion docs/design_docs/roi_align_rotated/roi_align_rotated.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ roi_align_rotated算子应用于FOTS网络结构中,以双线性插值的方
| spatial_scale | rois在feature map上的缩放比例 | 输入 | float | / | 无 |
| aligned | 决定rois中的像素是否需要偏移 | 输入 | bool | / | 无 |
| clockwise | 是否顺时针旋转 | 输入 | bool | / | 无 |
| output_desc | 输出数据的描述信息 | 输入 | | / | output的维度必须是4,且第一维大小与rois的第一维大小一致,第二维大小与pooled_height一致,第三维大小与pooled_width一致,第四维大小与featrues的第四维大小一致 |
| output_desc | 输出数据的描述信息 | 输入 | | / | output的维度必须是4,且第一维大小与rois的第一维大小一致,第二维大小与pooled_height一致,第三维大小与pooled_width一致,第四维大小与features的第四维大小一致 |
| output | 指向输出数据的mlu首地址 | 输出 | half, float | NHWC | 无 |
#### 1.3.1 roi_align_rotated_backward
| 参数 | 语义 | 类型(输入/输出) | 支持类型 | 物理布局 | 规模限制 |
Expand Down
21 changes: 21 additions & 0 deletions docs/release_notes/mlu_ops.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,27 @@ Cambricon MLU-OPS具有以下特点:
| Cambricon MLU-OPS v1.0.z | x86_64 | MLU370 |
+-----------------------------+------------------------+--------------------------------+

v1.1.1
-----------------

特性变更
~~~~~~~~~~~~~~~~~~~~~

- 无新增特性。

已修复问题
~~~~~~~~~~~~~~~~~~~~~

- 修复以下问题:

* 修复性能分析工具处理同名测试用例时引入的功能问题。
* 修复算子 mluOpAdamW 未分配任务类型引入的算子功能问题。

已知遗留问题
~~~~~~~~~~~~~~~~~~~~~

- 无。


v1.1.0
-----------------
Expand Down
11 changes: 9 additions & 2 deletions docs/user_guide/2_update_history/index.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
更新历史
========

* **V1.1.0**
* **V1.1.1**

**更新时间**:2024年4月12日

**更新内容**:

- 无算子更新。


* **V1.1.0**
**更新时间**:2024年3月28日

**更新内容**:
Expand All @@ -12,7 +20,6 @@
+ :ref:`adam_w`
+ :ref:`exec_fft`


* **V1.0.0**

**更新时间**:2024年2月6日
Expand Down
4 changes: 3 additions & 1 deletion independent_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -395,7 +395,7 @@ fi
if [ "${MLUOP_BUILD_PREPARE_ONLY}" = "ON" ]; then
prog_log_info "You have called prepare cntoolkit explicitly."
prepare_cntoolkit
exit -1
exit 0
elif [ "${MLUOP_BUILD_PREPARE}" = "ON" ]; then
prepare_cntoolkit
build_requires_version_check
Expand Down Expand Up @@ -452,8 +452,10 @@ if [ "${MLUOP_PACKAGE_INFO_SET}" = "ON" ]; then
mkdir -p ${PACKAGE_DIR}
mkdir -p ${PACKAGE_DIR}/include
mkdir -p ${PACKAGE_DIR}/lib64
mkdir -p ${PACKAGE_DIR}/samples

cp -rf ${BUILD_DIR}/lib/libmluops.so* ${PACKAGE_DIR}/lib64
cp -r samples/* ${PACKAGE_DIR}/samples
cp mlu_op.h ${PACKAGE_DIR}/include

TEST_DIR="test_workspace/mluops"
Expand Down
18 changes: 7 additions & 11 deletions installer/centos7.5/SPECS/mluops-independent.spec
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
%define __spec_install_post /usr/lib/rpm/brp-compress || :
%define debug_package %{nil}
%define neuware_dir /usr/local/neuware
%define build_dir build
%define build_dir package

Name: mluops
Summary: The Machine Lerning Unit OPerators
Version: 1.1.0
Version: 1.1.1
Release: 1%{?dist}
License: Cambricon Release License
Vendor: Cambricon Inc.
Expand Down Expand Up @@ -47,13 +47,9 @@ The Machine Lerning Unit OPerators.
bash independent_build.sh -t %{_packagetype}

%install
install -d $RPM_BUILD_ROOT%{neuware_dir}/lib64
install -d $RPM_BUILD_ROOT%{neuware_dir}/include
strip %{build_dir}%{neuware_dir}/lib64/libmluops.so*
cp -rf %{build_dir}/* $RPM_BUILD_ROOT
install -d $RPM_BUILD_ROOT/etc/ld.so.conf.d
strip %{build_dir}/lib/libmluops.so*
cp %{build_dir}/lib/libmluops.so* $RPM_BUILD_ROOT%{neuware_dir}/lib64/
cp mlu_op.h $RPM_BUILD_ROOT%{neuware_dir}/include/
cp -r samples $RPM_BUILD_ROOT%{neuware_dir}/
cp $RPM_SOURCE_DIR/neuware-env.conf $RPM_BUILD_ROOT/etc/ld.so.conf.d/

%clean
Expand All @@ -62,15 +58,15 @@ cp $RPM_SOURCE_DIR/neuware-env.conf $RPM_BUILD_ROOT/etc/ld.so.conf.d/

%files
%defattr (-, root, root)
%{neuware_dir}/include/mlu_op.h
%{neuware_dir}/lib64/libmluops.so*
%{neuware_dir}/samples/mlu-ops
%{neuware_dir}/*
/etc/ld.so.conf.d/neuware-env.conf

%post -p /sbin/ldconfig
%postun -p /sbin/ldconfig

%changelog
* Thu Apr 12 2024 Cambricon Software Team <[email protected]>
- release mluops v1.1.1
* Thu Mar 28 2024 Cambricon Software Team <[email protected]>
- release mluops v1.1.0
* Tue Feb 6 2024 Cambricon Software Team <[email protected]>
Expand Down
4 changes: 4 additions & 0 deletions installer/independent/debian/changelog
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
mluops (1.1.1-1.ubuntu16.04) xenial; urgency=medium

* Release mluops v1.1.1

mluops (1.1.0-1.ubuntu16.04) xenial; urgency=medium

* Release mluops v1.1.0
Expand Down
15 changes: 13 additions & 2 deletions kernels/adam_w/adam_w.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,17 @@ mluOpAdamW(mluOpHandle_t handle, const mluOpAdamWDescriptor_t adamw_desc,
PARAM_CHECK("[mluOpAdamW]", momentum_desc != nullptr);
PARAM_CHECK("[mluOpAdamW]", velocity_desc != nullptr);
PARAM_CHECK("[mluOpAdamW]", grad_desc != nullptr);
PARAM_CHECK("[mluOpAdamW]", param_desc->dtype == MLUOP_DTYPE_FLOAT)
PARAM_CHECK("[mluOpAdamW]", paramh_desc->dtype == MLUOP_DTYPE_BFLOAT16)
PARAM_CHECK("[mluOpAdamW]", momentum_desc->dtype == MLUOP_DTYPE_FLOAT)
PARAM_CHECK("[mluOpAdamW]", velocity_desc->dtype == MLUOP_DTYPE_FLOAT)
PARAM_CHECK("[mluOpAdamW]", grad_desc->dtype == MLUOP_DTYPE_BFLOAT16)

PARAM_CHECK_LE("[mluOpAdamW]", beta1, 1.0)
PARAM_CHECK_GE("[mluOpAdamW]", beta1, 0.0)
PARAM_CHECK_LE("[mluOpAdamW]", beta2, 1.0)
PARAM_CHECK_GE("[mluOpAdamW]", beta2, 0.0)
PARAM_CHECK("[mluOpAdamW]", epsilon > 0)

size_t param_dims = 0;
size_t paramh_dims = 0;
Expand Down Expand Up @@ -235,12 +246,12 @@ mluOpAdamW(mluOpHandle_t handle, const mluOpAdamWDescriptor_t adamw_desc,
return MLUOP_STATUS_ARCH_MISMATCH;
}
case CNRT_FUNC_TYPE_UNION1: {
VLOG(5) << "Launch Kernel MLUUnionKernelApplyAdamW<<<Union"
VLOG(5) << "Launch Kernel KernelApplyAdamW<<<Union"
<< k_type / CORE_DIM << ", " << k_dim.x << ", " << k_dim.y << ", "
<< k_dim.z << ">>>";
CHECK_RETURN(
"[mluOpAdamW]",
MLUUnionKernelApplyAdamW(
KernelApplyAdamW(
k_dim, k_type, handle->queue, (void *)param, (void *)param_h,
(void *)grad, (void *)momentum, (void *)velocity, lr, beta1,
beta2, bias1, bias2, epsilon, adamw_desc->weight_decay,
Expand Down
2 changes: 1 addition & 1 deletion kernels/adam_w/adam_w.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ struct mluOpAdamWStruct {
bool use_nesterov = false;
};

mluOpStatus_t MLUOP_WIN_API MLUUnionKernelApplyAdamW(
mluOpStatus_t MLUOP_WIN_API KernelApplyAdamW(
const cnrtDim3_t k_dim, const cnrtFunctionType_t k_type,
const cnrtQueue_t queue, void *param, void *param_h, void *grad,
void *momentum, void *velocity, float lr, float beta1, float beta2,
Expand Down
Loading
Loading