-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computation of compression parameters via OpenVINO models #2727
base: develop
Are you sure you want to change the base?
Computation of compression parameters via OpenVINO models #2727
Conversation
nncf/quantization/algorithms/weight_compression/weight_lowering.py
Outdated
Show resolved
Hide resolved
nncf/quantization/algorithms/weight_compression/weight_lowering.py
Outdated
Show resolved
Hide resolved
55cafaa
to
a68a63d
Compare
6b98ddd
to
3d9faa4
Compare
1c85732
to
b527cac
Compare
ac3ea02
to
2a3a63c
Compare
c9569bb
to
a151d99
Compare
fe30c13
to
19ea412
Compare
nncf/quantization/algorithms/weight_compression/weight_lowering/dispatcher.py
Outdated
Show resolved
Hide resolved
nncf/quantization/algorithms/weight_compression/weight_lowering/dispatcher.py
Outdated
Show resolved
Hide resolved
eef34f8
to
ca3447c
Compare
ca3447c
to
f3891cd
Compare
|
||
|
||
@lru_cache(None) | ||
def log_once(level: int, message: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NNCF already has a solution for single logging with DuplicateFilter
:
nncf/nncf/torch/quantization/algo.py
Line 627 in d90d285
dup_filter = DuplicateFilter() # so that the overflow fix warning is only logged once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion!
With the current approach the given message will be logged exactly once. The problem is, to achieve the same behavior with duplicate filter, it needs to be applied at a very high level, e.g. before apply()
method. But it is not a good idea to do so, because there may be some log messages which we would like to be logged multiple times during the algorithm running time.
return item in self._cache | ||
|
||
|
||
def cache_results(cache: ResultsCacheContainer) -> Callable: # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you implemented a general solution for function output caching based on memorization techniques. The functools
has such implementation https://docs.python.org/dev/library/functools.html#functools.cache. What do you think about using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implemented cache_results
decorator has some advantages over the lru_cache
:
- There is access to the cache object. This is helpful for clearing the cache if needed.
- Allows to disable caching on demand (
disable_caching
argument).
@@ -107,16 +110,17 @@ def cnt_if_op(model: ov.Model, cnt: int) -> int: | |||
return cnt_if_op(model, 0) | |||
|
|||
|
|||
def get_const_value(const_node: ov.Node) -> np.ndarray: | |||
def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: Optional[bool] = True) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: Optional[bool] = True) -> np.ndarray: | |
def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: bool = True) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion is not clear. The argument is still optional, isn't it?
@@ -40,12 +40,23 @@ def num_bits(self): | |||
""" | |||
return 8 if self.mode in [CompressWeightsMode.INT8_SYM, CompressWeightsMode.INT8_ASYM] else 4 | |||
|
|||
@property | |||
def is_int_asym(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def is_int_asym(self): | |
def is_asymmetric_mode(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will is_asym_mode
do? The current name is quite short which is helpful because it is often used inside different complex conditions.
# Infer the model | ||
inputs = [inp.data for inp in inputs] | ||
if ov_model_params.return_ov_tensors: | ||
infer_request = compiled_model.create_infer_request() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use the cache, I believe that you can cache the infer request to avoid creating instance every call. Did you try it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried it now and observe no difference in time/memory
|
||
# Infer the model | ||
inputs = [inp.data for inp in inputs] | ||
if ov_model_params.return_ov_tensors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you briefly explain why you use different APIs for model inference such as a model(input) and an infer request? Is there any advantage to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not on purpose. Changed so that inference is done via infer request in both scenarios.
Update: after this change I actually have noticed increased compression time and peak memory. So there is some difference. I reverted this back to run inference through __call__
method when possible.
@@ -0,0 +1,519 @@ | |||
# Copyright (c) 2024 Intel Corporation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think this approach has good opportunity for extending? I mean If a developer wants to add new function what he should implement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've proposed a more easily extendable approach in the last round of review. But after a discussion we decided to turn it down. As a reminder the reasons were:
- The main contribution of the current PR from developing perspective is the addition of implementation of OV graphs for compressed computation.
- Implementation of a more extendable approach requires quite a lot of additional logic, e.g. dispatcher implementation. At this point, there is no guarantee that the newly added approach will be extended so this logic is not yet required.
- If there will be need for extending in the future, a separate PR for it can always be created.
Let's have a discussion offline if in your opinion the situation has changed.
compressed_weights = calculate_quantized_weight(weight, config, scale, zero_point) | ||
return compressed_weights, scale, zero_point | ||
|
||
from nncf.quantization.algorithms.weight_compression.openvino_modeling import OVModelParameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion that a developer has to write a ton of code to use function powered by OpenVINO and I can assume that in this form few people will use it. You should think how to simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compression and decompression are quite complex functions that may have different signatures depending on availability of certain input parameters. This is mainly the reason behind the amount of code. A simpler function will have a shorter definition and usage.
Also, decomposing different backend implementations will definitely help, but this discussion leads to #2727 (comment) .
…ncf into compress-via-openvino
This reverts commit 86ff3fed15f6638b3bf15974b31dc46915430895.
Changes
weight_lowering.py
:do_int_quantization()
is used for computing a compressed weight. Possible signatures:weight
->compressed_weight
,scale
, (zero_point
for asymmetric compression)weight
,scale
, (zero_point
) ->compressed_weight
,scale
, (zero_point
)calculate_quantized_dequantized_weight()
is used for computing a decompressed weight. Possible signatures:weight
->decompressed_weight
weight
,scale
, (zero_point
) ->decompressed_weight
weight
->decompressed_weight
,compressed_weight
,scale
, (zero_point
)weight
,scale
, (zero_point
) ->decompressed_weight
,compressed_weight
,scale
, (zero_point
)scale
andzero_point
are the same as the ones given as input (if they were given at all).openvino.Tensor
. Implementation for this backend is limited by only the required functionality, e.g. addition of OV Tensors is not supported because it is not needed.bf16
,u4
andi4
data types. For example,bf16
constants are read from an OpenVINO LLM and given as inputs to a compressing OpenVINO model.u4
andi4
compressed weights are seamlessly inserted into the resulting compressed OpenVINO model.as_numpy_tensor()
method to convert an NNCF Tensor to numpy backend. Currently only OV -> NP conversion is required.Data-free asymmetric compression:
Data-free symmetric compression:
Data-aware compression:
Reason for changes
Reducing model compression time. Only OpenVINO model compression backend is affected.
Related tickets
139047
Tests
tests/openvino/native/quantization/test_ov_modeling_compression.py::test_quantization_alignment
-- check aligment with reference numpy implementationtests/openvino/native/test_openvino_modeling.py
-- checks OV modeling framework hyperparameterstests/openvino/native/test_tensor.py
-- NNCF OV Tensor backend testsValidation jobs:
NNCF/job/manual/job/post_training_weight_compression/286/