-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do we compare against other toolkits #59
Comments
How do you test DNNL performance? |
@pengzhao-intel we used this: |
Thanks for the information @XapaJIaMnu |
@pengzhao-intel to give you some background about this particular benchmark: |
Is the lossless-ness of |
All intgemm operations are packed (though the formats are not necessarily the same). Shifted refers to adding a constant to work around Intel's unsigned * signed instruction. |
This pull request adds wrappers to the intgemm matrix multiplication library: https://github.com/kpu/intgemm . A performance comparison with DNNL aka MKL-DNN is at kpu/intgemm#59 The library targets thin matrix sizes seen in neural machine translation inference and was part of the top submission to the 2018 Workshop on Neural Generation and Translation efficiency task: https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf . The purpose of this issue is to add similar functionality to Sockeye: awslabs/sockeye#771 . Quantized Sockeye performance is 2.95x as fast. One problem with the current MXQuantizeSymbol approach is that Sockeye does not have a static graph for everything. intgemm uses a custom memory layout for the weight matrix to make more memory accesses consecutive, so there are operators to convert weights to that format. The idea is that weights are typically loaded once for inference. On architectures without VNNI, intgemm uses saturating 16-bit accumulation. This avoids an expensive madd_epi16 instruction every multiply by exploiting the fact that most neural network parameters are near 0. Because x86 only offers a unsigned * signed instruction and most people want signed * signed, there are two strategies one can take. Add 128 to data so now it's unsigned. But that biases the output. DNNL calculates this bias on the fly by summing weights then subtracts it out during GEMM. intgemm calculates this bias in advance, which can then be subtracted from the bias term with no overhead at runtime. A problem with this strategy is that it makes the accumulator bigger, requiring more upcasting with an expensive madd_epi16 instruction. Emulate signed * signed by normalizing the sign bit into the second argument. This requires extra instructions in the hot loop but keeps the accumulator small, so it's less necessary to accumulate into 32-bit integers and madd_epi16 can be avoided. Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2. Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, AVX2, AVX512BW, and AVX512VNNI.
On ssse3 (tested on the mac)
On AVX2 (Tested on my laptop)
On AVX512VNNI
Marian uses
fbgemm Packed
, which doesunsigned
Xsigned
and unquantizes to floats after. We should aim for those numbers. For comparison, use https://github.com/XapaJIaMnu/gemmbenchThe text was updated successfully, but these errors were encountered: