-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869
Conversation
59e42f8
to
b0c3013
Compare
Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster. This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait! |
thanks for your comment. this PR is a very initial implementation and could/might/should be a good starting point of Qualcomm's QNN backend for GGML. it's better some domain technical experts from Qualcomm involved in this effort after it's accepted by community. I personally think this PR is also an example of GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible. another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend. |
5abb2e4
to
7a420e1
Compare
95a980a
to
b0c3013
Compare
Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the |
thanks for your guidance. I'll study how to use test-backend-ops.cpp to validate QNN backend. |
You would need to modify Line 411 in 5477041
|
thanks for your help, it's really helpful. I'm working on adapt to test-backend-ops.cpp with QNN backend on Android. |
@ggerganov, @slaren, I'm sorry to interrupt you. adapt to test-backend-ops.cpp using QNN backend already done and it works fine as expected on Xiaomi14(Qualcomm SM8650-AB Snapdragon 8 Gen 3). Could you take a moment to look at it? thanks. BTW, the design and implementation of test-backend-ops.cpp is really excellent. I never noticed this file/feature before. BTW, should the README-qnn.md be removed? |
eff9669
to
180ab5f
Compare
992cf05
to
67beeb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this review comment is very useful and I had been modified codes accordingly.
thanks too much.
8240376
to
f20e281
Compare
qnn_instance * instance = nullptr; | ||
std::string graph_name = "ggml_op_qnn_add"; | ||
Qnn_GraphHandle_t graph_handle = nullptr; | ||
Qnn_Tensor_t * tensor_0 = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a PR on your fork, to simpilify the binding from Qnn_Tensor_t
to ggml_tensor
, please have look if have time: zhouwg#2
* mul_mat_f16_f32: src0 is F16 and src1 is F32. | ||
* mul_mat_q_f32: src0 is quantized (Q4_0, Q4_1, ...), and src1 is F32. | ||
*/ | ||
static void ggml_qnn_mul_mat(ggml_backend_qnn_context * ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks the graphExecute
failed with error 6004
. maybe we can use it to find the root cause here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to reproduce, you could use my patch to constant initialize the test tensor:
llama.cpp-5e18cdc-init the test array with const values.patch
just change the tensor init in the unit test so that we can reproduce it more easily
problem 1i'm tred build in termux. problem 2qnnsdk cannot be obtained without an account. |
GGML_CALL static bool ggml_backend_qnn_offload_op(ggml_backend_t backend,const ggml_tensor * tensor) { | ||
ggml_backend_qnn_context * ctx = (ggml_backend_qnn_context *) backend->context; | ||
|
||
return ggml_qnn_compute_forward(ctx, nullptr, (ggml_tensor *) tensor); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function only needs to return true
or false
, but it must not execute the operation. The purpose of this function is to determine if an operation should be executed in this backend, even if it would require copying weights to the backend memory. As it is, this will either prevent the backend from working entirely, or it will cause many operations to be run twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, actually I've tried to make some improvement regarding those comments on my branch, also created a PR in this fork and ask for review several days before, but looks there's no responding on the original author so far.
Will spend some time on my fork next few weeks, gonna add more operators then.
Sad to see such a great PR being blocked by endless complaints. @slaren @chraac Can't we just focus on the correctness of this groundbreaking PR? If this PR could make a correct result, then there is NO reason to block it. All other problems should be discussed in new PR or issues. I hold this is the very first step we need to take. Then users around the world could have a chance to engage it and improve for the efficiency and other things you guys worried. |
Hi @ao-zz, thank you for your comment, for the PR not getting approval, as i said before:
And as you can see in my pervious post, there're some work we should do before merge, from my point of view:
Also have created a small refactoring PR on this fork, you can have a look. |
currently is public for everyone. |
This work has finished the SnapDragon CPU/GPU/NPU overhead: |
code not yet released, as you asked last week |
Hi, it looks like this PR has been inactive for a while. I've made some changes on my local fork based on this PR, including:
For anyone interested in this PR, please take a look at my fork. Comments and feedback are appreciated! |
FYI this is exactly what |
please support in termux. |
After a brief review of the |
Hi @myan-o , For problem 1, I propose implementing a CMake parameter that allows users to customize the default path for dependent libraries. For 2, maybe you can refer to this comment:
|
Great work @zhouwg! I worked for Qualcomm until early this year and I am quite used to using Qualcomm AI SDK, and I want to help you to get these things done. I think I can help you implement the to-do items that you have for the QNN backend. Let me catch up on your work in a few weeks about this PR and update more later. |
Hi @yeonseok-zeticai, this branch has been inactive for some time. In recent weeks, I've undertaken some refactoring on my own branch. If you're interested, please have a look. My branch is also based on this PR.: |
@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well. |
Sorry for misleading, when I mentioned My branch: https://github.com/chraac/llama.cpp/tree/dev-refactoring |
Thanks so much and thanks to a real QNN expert is coming.
This PR can not be accepted by the maintainer/author of ggml backend subsystem although I begged for PR approval again and again before 06/15/2024. I understand this decision according to above reasons. but, I'm not sure whether there is double-standard in PR approval consideration although I has sincerely thanks for the help from the maintainer/author of ggml backend subsystem and has fully/100% positive opinion for this great/compact/awesome/excellent/high-performance device-side AI inference framework: One more thing, I feel a little disappointment on 06/15/2024 that the maintainers of this great opensource AI project can't understand what GFW brings to the programmers/developers in mainland China and has some misunderstandings about what I think about it although I really/100% love my country: |
Thanks for your help and continued efforts of this PR(BTW, I had read source codes in your personal/forked llama.cpp although I think put everything relative things in one single source file might be more better idea). Your PR in my personal/forked llama.cpp is not make sense: That's the reason why your PR in my personal/forked llama.cpp was not merged to my personal/forked llama.cpp.thanks for your understanding. |
Your test-backend-ops.cpp is good and highly-designed but not good/robust/easy-understanding enough and there is unknown issue for the ggml-qnn.cpp. that's the reason why I provide a standalone/easy-understanding UT(some codes borrows from your test-backend-ops.cpp) for ggml-qnn.cpp.thanks for your understanding. |
As i said before,
also you can have a look on my new refactoring branch, next will utilize the existing |
I do not want to argue this opinion with you again and pls see my opinion in this PR although I feel a little surprise of/thanks for your continued efforts(which I personally think it's exactly same to this PR but with more/advanced C++ language grammars) of this PR.
I'm sorry for that because I feel great disappointment and has no positive attention for this PR after 06/15/2024.
|
No worries, your effort on adding the QNN backend won't be wasted. You've done excellent work. I'll continue iterating on my branch, and as this backend garners more attention, we're hopeful it can be integrated into the upstream project in the future. |
@zhouwg Such comments are completely inappropriate. As I already mentioned in #6210 (comment), this will not be tolerated. Therefore I’ve decided to block you from the projects. |
Self Reported Review Complexity
Purpose
Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with a market share of 70.1 percent .
Qualcomm is No.1 mobile SoC semiconductor company in our planet currently(MediaTek's market share is No.1 in Q1 2024 but I personally think Qualcomm is the real No.1 mobile SoC vendor in our planet). Hexagon NPU in Qualcomm Snapdragon 8 Gen 3 was designed for generative AI and delivering 98% faster performance and 40% improved performance-per-watt for sustained AI inferencing, it make the Hexagon NPU the leading processor for on-device AI inferencing.
QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:
As a very compact/highly well-designed/highly optimization/highly performance C/C++ machine learning framework/library, this PR aims to add Qualcomm's QNN backend for ggml and focus on this accordingly:how to utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework.
Status
Data path works fine as expected with whisper.cpp and llama.cpp using QNN backend and verified on both low-end and high-end Android phones based on Qualcomm mobile SoC.
4x performance gains for GGML_OP_MUL_MAT using QNN CPU backend with 1 thread on a Qualcomm mobile SoC equipped high-end Android phone(a flagship Qualcomm Snapdragon 8 Gen 3 mobile SoC which released on Oct 2023). The performance of GGML_OP_MUL_MAT might/should/would be improved much more using QNN NPU(aka Hexagon Tensor Processor) backend after we know the secrets(QNN RPC, multithreading in NPU backend......) of Qualcomm's NPU(aka Hexagon Tensor Processor).
A dedicated Android command line program (for purpose of UT) works fine as expected on Qualcomm SM8650-AB Snapdragon 8 Gen 3 equipped high-end Android phone and other Qualcomm's low-end mobile SoC equipped low-end Android phone(QNN NPU backend not works on Qualcomm low-end Android phone).
QNN's RPC feature(which useful for QNN NPU(aka HTP/DSP) backend) was used in this PR and it works fine as expected.there are 2+GBytes ion memory could be used for offload ggml tensors in cgraph to NPU on Qualcomm Snapdragon 8 Gen 3 equipped Android phone.
This PR is a Minimum Viable PR style and functional PR in ggml community. it'll be great helpful for other community programmer/developer/AI expert to contribute codes/ideas to GGML QNN backend if this PR can be approved and merged to master branch. Together we might/should/could reach the final target: utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework. this is might be the exact GGML way in GGML community.
Todo
Qualcomm's QNN backend for GGML has some todo tasks to make this backend can be used in real commercial application:Lack of implementation of other GGML-OPs using QNN API. I provide a GENERAL approach try to fix this problem in a standalone PR of refine ggml backend subsystem for mixed inference between CPU&GPU / CPU&NPU easily for ANY ggml backends(which the backend's ggml_backend_xxx_buffer_is_host return true) . this approach works as expected with whisper inference and llama inference in my personal ggml learning&study project.
Add more quantize data type supportive(AI expert should be here)
Peformance fine-tunning: the performance of the existing ggml qnn backend is weaker/poor then the original ggml because there are some sophisticated Qualcomm's dedicated technologies not used in this PR and the power of state-of-the-art Qualcomm's NPU(Hexagon Tensor Processor) was not utilized currently in this PR(I know the direction but limited by my knowledge of real/hardcore AI tech). The performance fine-tunning in ggml gnn-npu backend is a long-term task. the following is an example:
How to verify QNN backend or participate in development activity of GGML QNN backend
I provide a dedicated Android command line program and scripts in this PR for purpose of UT on Android device.
A suitable/qualified reviewer should/might be familiar with source code of ggml and Qualcomm QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK or other Qualcomm's AI software stack; skillsets including real/hardcore AI tech is more better(add more quantize data type and implement more GGML OPs(or kernels) require the AI skillset) but is not an essential skillset in this PR. some notes for potential qualified reviewer:
Any GGML community programmer/developer/AI expert who interesting with the topic of GGML QNN backend can use/extend the dedicated Android command line program to verify GGML QNN backend, review are greatly welcomed and appreciated.