Computation of compression parameters via OpenVINO models #2727

nikita-savelyevv · 2024-06-11T13:45:12Z

Changes

Implemented OpenVINO model graphs which are used for calculation of compressed and decompressed weights. Since these models are compiled, calculation become significantly faster especially for larger models and int4 compression.
This functionality is exposed by two methods at weight_lowering.py:
- do_int_quantization() is used for computing a compressed weight. Possible signatures:
  - weight -> compressed_weight, scale, (zero_point for asymmetric compression)
  - weight, scale, (zero_point) -> compressed_weight, scale, (zero_point)
- calculate_quantized_dequantized_weight() is used for computing a decompressed weight. Possible signatures:
  - weight -> decompressed_weight
  - weight, scale, (zero_point) -> decompressed_weight
  - weight -> decompressed_weight, compressed_weight, scale, (zero_point)
  - weight, scale, (zero_point) -> decompressed_weight, compressed_weight, scale, (zero_point)
- Output scale and zero_point are the same as the ones given as input (if they were given at all).
- Computation is done via OV models only if openvino package is installed and input tensors are not torch tensors.
Introduce a new NNCF Tensor backend for storing instances of openvino.Tensor. Implementation for this backend is limited by only the required functionality, e.g. addition of OV Tensors is not supported because it is not needed.
- Introduction of OV Tensors is required for seamless handling of tensors in bf16, u4 and i4 data types. For example, bf16 constants are read from an OpenVINO LLM and given as inputs to a compressing OpenVINO model. u4 and i4 compressed weights are seamlessly inserted into the resulting compressed OpenVINO model.
- Added as_numpy_tensor() method to convert an NNCF Tensor to numpy backend. Currently only OV -> NP conversion is required.
All calculations are aligned with reference numpy implementation. Some performance and memory sacrifices had to be made for such alignment.

Data-free asymmetric compression:

Data-free symmetric compression:

Data-aware compression:

Reason for changes

Reducing model compression time. Only OpenVINO model compression backend is affected.

Related tickets

139047

Tests

tests/openvino/native/quantization/test_ov_modeling_compression.py::test_quantization_alignment -- check aligment with reference numpy implementation
tests/openvino/native/test_openvino_modeling.py -- checks OV modeling framework hyperparameters
tests/openvino/native/test_tensor.py -- NNCF OV Tensor backend tests

Validation jobs:

NNCF/job/manual/job/post_training_weight_compression/286/
OVVP validation ✅
optimum-intel test job https://github.com/huggingface/optimum-intel/actions/runs/12378718192/job/34551283045?pr=734

nncf/quantization/algorithms/weight_compression/weight_lowering.py

nncf/openvino/quantization/compression_primitives.py

nncf/quantization/fake_quantize.py

nncf/quantization/algorithms/weight_compression/weight_lowering/dispatcher.py

tests/post_training/data/wc_reference_data.yaml

alexsu52 · 2025-01-07T11:32:43Z

nncf/common/logging/logger.py

+
+
+@lru_cache(None)
+def log_once(level: int, message: str) -> None:


NNCF already has a solution for single logging with DuplicateFilter:

nncf/nncf/torch/quantization/algo.py

Line 627 in d90d285

dup_filter = DuplicateFilter() # so that the overflow fix warning is only logged once

Thanks for the suggestion!

With the current approach the given message will be logged exactly once. The problem is, to achieve the same behavior with duplicate filter, it needs to be applied at a very high level, e.g. before apply() method. But it is not a good idea to do so, because there may be some log messages which we would like to be logged multiple times during the algorithm running time.

alexsu52 · 2025-01-07T12:21:55Z

nncf/common/utils/decorators.py

+        return item in self._cache
+
+
+def cache_results(cache: ResultsCacheContainer) -> Callable:  # type: ignore


It looks like you implemented a general solution for function output caching based on memorization techniques. The functools has such implementation https://docs.python.org/dev/library/functools.html#functools.cache. What do you think about using it?

The implemented cache_results decorator has some advantages over the lru_cache:

There is access to the cache object. This is helpful for clearing the cache if needed.

Allows to disable caching on demand (disable_caching argument).

nncf/openvino/graph/node_utils.py

alexsu52 · 2025-01-07T12:42:54Z

nncf/openvino/graph/node_utils.py

@@ -107,16 +110,17 @@ def cnt_if_op(model: ov.Model, cnt: int) -> int:
    return cnt_if_op(model, 0)


-def get_const_value(const_node: ov.Node) -> np.ndarray:
+def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: Optional[bool] = True) -> np.ndarray:


Suggested change

def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: Optional[bool] = True) -> np.ndarray:

def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: bool = True) -> np.ndarray:

The suggestion is not clear. The argument is still optional, isn't it?

alexsu52 · 2025-01-07T14:41:01Z

nncf/quantization/algorithms/weight_compression/config.py

@@ -40,12 +40,23 @@ def num_bits(self):
        """
        return 8 if self.mode in [CompressWeightsMode.INT8_SYM, CompressWeightsMode.INT8_ASYM] else 4

+    @property
+    def is_int_asym(self):


Suggested change

def is_int_asym(self):

def is_asymmetric_mode(self):

Will is_asym_mode do? The current name is quite short which is helpful because it is often used inside different complex conditions.

alexsu52 · 2025-01-07T15:39:05Z

nncf/quantization/algorithms/weight_compression/openvino_modeling.py

+    # Infer the model
+    inputs = [inp.data for inp in inputs]
+    if ov_model_params.return_ov_tensors:
+        infer_request = compiled_model.create_infer_request()


If you use the cache, I believe that you can cache the infer request to avoid creating instance every call. Did you try it?

I've tried it now and observe no difference in time/memory

alexsu52 · 2025-01-07T15:41:16Z

nncf/quantization/algorithms/weight_compression/openvino_modeling.py

+
+    # Infer the model
+    inputs = [inp.data for inp in inputs]
+    if ov_model_params.return_ov_tensors:


Could you briefly explain why you use different APIs for model inference such as a model(input) and an infer request? Is there any advantage to this?

This is not on purpose. Changed so that inference is done via infer request in both scenarios.

Update: after this change I actually have noticed increased compression time and peak memory. So there is some difference. I reverted this back to run inference through __call__ method when possible.

alexsu52 · 2025-01-07T15:45:24Z

nncf/quantization/algorithms/weight_compression/openvino_modeling.py

@@ -0,0 +1,519 @@
+# Copyright (c) 2024 Intel Corporation


What do you think this approach has good opportunity for extending? I mean If a developer wants to add new function what he should implement?

I've proposed a more easily extendable approach in the last round of review. But after a discussion we decided to turn it down. As a reminder the reasons were:

The main contribution of the current PR from developing perspective is the addition of implementation of OV graphs for compressed computation.

Implementation of a more extendable approach requires quite a lot of additional logic, e.g. dispatcher implementation. At this point, there is no guarantee that the newly added approach will be extended so this logic is not yet required.

If there will be need for extending in the future, a separate PR for it can always be created.

Let's have a discussion offline if in your opinion the situation has changed.

alexsu52 · 2025-01-07T15:53:07Z

nncf/quantization/algorithms/weight_compression/weight_lowering.py

+        compressed_weights = calculate_quantized_weight(weight, config, scale, zero_point)
+        return compressed_weights, scale, zero_point
+
+    from nncf.quantization.algorithms.weight_compression.openvino_modeling import OVModelParameters


My opinion that a developer has to write a ton of code to use function powered by OpenVINO and I can assume that in this form few people will use it. You should think how to simplify it.

Compression and decompression are quite complex functions that may have different signatures depending on availability of certain input parameters. This is mainly the reason behind the amount of code. A simpler function will have a shorter definition and usage.

Also, decomposing different backend implementations will definitely help, but this discussion leads to #2727 (comment) .

nncf/tensor/functions/ov.py

…ncf into compress-via-openvino

This reverts commit 86ff3fed15f6638b3bf15974b31dc46915430895.

This reverts commit d821e7d. Revert "Fix ov modeling test" This reverts commit 882c9b1.

github-actions bot added NNCF Common Pull request that updates NNCF Common NNCF OpenVINO Pull requests that updates NNCF OpenVINO NNCF PTQ Pull requests that updates NNCF PTQ labels Jun 11, 2024

alexsu52 reviewed Jun 13, 2024

View reviewed changes

nncf/quantization/algorithms/weight_compression/weight_lowering.py Outdated Show resolved Hide resolved

nncf/quantization/algorithms/weight_compression/weight_lowering.py Outdated Show resolved Hide resolved

nncf/openvino/quantization/compression_primitives.py Outdated Show resolved Hide resolved

nikita-savelyevv force-pushed the compress-via-openvino branch 4 times, most recently from 55cafaa to a68a63d Compare July 3, 2024 18:31

nikita-savelyevv force-pushed the compress-via-openvino branch 4 times, most recently from 6b98ddd to 3d9faa4 Compare July 16, 2024 14:19

nikita-savelyevv force-pushed the compress-via-openvino branch 6 times, most recently from 1c85732 to b527cac Compare September 6, 2024 11:11

github-actions bot added the documentation Improvements or additions to documentation label Sep 6, 2024

nikita-savelyevv force-pushed the compress-via-openvino branch 2 times, most recently from ac3ea02 to 2a3a63c Compare September 11, 2024 12:59

nikita-savelyevv force-pushed the compress-via-openvino branch from c9569bb to a151d99 Compare October 11, 2024 11:51

nikita-savelyevv force-pushed the compress-via-openvino branch 2 times, most recently from fe30c13 to 19ea412 Compare October 21, 2024 08:52

alexsu52 requested a review from AlexanderDokuchaev October 22, 2024 09:32

alexsu52 reviewed Oct 22, 2024

View reviewed changes

nikita-savelyevv force-pushed the compress-via-openvino branch 3 times, most recently from eef34f8 to ca3447c Compare October 26, 2024 13:40

nikita-savelyevv force-pushed the compress-via-openvino branch from ca3447c to f3891cd Compare October 29, 2024 15:19

nikita-savelyevv added 8 commits December 13, 2024 14:55

Merge branch 'develop' into compress-via-openvino

204fb21

Merge branch 'develop' into compress-via-openvino

dca5376

Guarantee call order

92fbba5

Add convertable_division parameter

b27c720

Cleanup

6ab1c08

Add convertable division test

a0fe91a

Add explicit inference precision

97bd61d

Fix import

58963ab

nikita-savelyevv commented Dec 16, 2024

View reviewed changes

tests/post_training/data/wc_reference_data.yaml Outdated Show resolved Hide resolved

Update tests/post_training/data/wc_reference_data.yaml

ec21996

alexsu52 requested changes Jan 7, 2025

View reviewed changes

alexsu52 reviewed Jan 9, 2025

View reviewed changes

nncf/tensor/functions/ov.py Outdated Show resolved Hide resolved

nikita-savelyevv added 9 commits January 14, 2025 11:14

Suggested renaming

aeffc8b

Merge branch 'compress-via-openvino' of github.com:nikita-savelyevv/n…

476287b

…ncf into compress-via-openvino

to_backend -> as_numpy_tensor

d2d66b1

Use duplicate filter

f4a08b9

Revert "Use duplicate filter"

9f2a79b

This reverts commit 86ff3fed15f6638b3bf15974b31dc46915430895.

Align log message

05b3eb8

Add TODO regarding share memory during constant creation from tensor

84c88fc

Create infer request in both cases

d821e7d

Merge branch 'develop' into compress-via-openvino

467b5b8

github-actions bot added the NNCF TF Pull requests that updates NNCF TensorFlow label Jan 14, 2025

nikita-savelyevv added 8 commits January 14, 2025 15:50

Update copyright year

1c485ec

Remove not used import

57de030

Fix ov modeling test

882c9b1

Add return type annotation

a9f4e70

Fix docs api conf.py

fc64966

mypy

68e734f

Revert "Create infer request in both cases"

48d47c8

This reverts commit d821e7d. Revert "Fix ov modeling test" This reverts commit 882c9b1.

Set shared_memory=True when creating ov.constant from ov.tensor

234f698

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computation of compression parameters via OpenVINO models #2727

Computation of compression parameters via OpenVINO models #2727

nikita-savelyevv commented Jun 11, 2024 •

edited

Loading

alexsu52 Jan 7, 2025

nikita-savelyevv Jan 14, 2025

alexsu52 Jan 7, 2025

nikita-savelyevv Jan 14, 2025

alexsu52 Jan 7, 2025

nikita-savelyevv Jan 14, 2025

alexsu52 Jan 7, 2025

nikita-savelyevv Jan 14, 2025 •

edited

Loading

alexsu52 Jan 7, 2025

nikita-savelyevv Jan 15, 2025

alexsu52 Jan 7, 2025

nikita-savelyevv Jan 14, 2025 •

edited

Loading

alexsu52 Jan 7, 2025

nikita-savelyevv Jan 14, 2025

alexsu52 Jan 7, 2025

nikita-savelyevv Jan 14, 2025



		@lru_cache(None)
		def log_once(level: int, message: str) -> None:

		return item in self._cache


		def cache_results(cache: ResultsCacheContainer) -> Callable: # type: ignore

	def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: Optional[bool] = True) -> np.ndarray:
	def get_const_value(const_node: ov.Node, cast_bf16_to_fp32: bool = True) -> np.ndarray:

Computation of compression parameters via OpenVINO models #2727

Are you sure you want to change the base?

Computation of compression parameters via OpenVINO models #2727

Conversation

nikita-savelyevv commented Jun 11, 2024 • edited Loading

Changes

Reason for changes

Related tickets

Tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikita-savelyevv Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikita-savelyevv Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikita-savelyevv commented Jun 11, 2024 •

edited

Loading

nikita-savelyevv Jan 14, 2025 •

edited

Loading

nikita-savelyevv Jan 14, 2025 •

edited

Loading