Consultation on ratio and quality

#1
by savvadesogle - opened

Hello. Could you please clarify the following points:

  1. Am I correct in understanding that the model with a ratio of 0.8 is quantized with 80% of layers in int4 and 20% of layers in int8?
  2. Why was this ratio chosen? And have you measured the perplexity compared to the baseline model?
  3. Does it make sense to re-quantize now with the new version of Optimum in order to improve the performance or will there be no difference?
OpenVINO Toolkit org

Hi @savvadesogle

Thanks for your questions! Please note that this model was quantized using NNCF (https://github.com/openvinotoolkit/nncf), which is a compression tool for OpenVINO models.

  1. Not exactly. By default NNCF quantizes all layers to int4 except for embeddings, convolutions, and the last linear layer - this means ratio=1. When ratio < 1, it means that NNCF smartly chooses some of the layers that would be quantized to int4 and quantizes them to int8 instead to preserve the accuracy. You can find more details in https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/post_training_compression/weights_compression/Usage.md
  2. The ratio is chosen as the best trade-off between the accuracy and performance. In this particular case the ratio > 0.8 results in a significant accuracy drop. Usually, when we measure the accuracy, we use different benchmarks and metrics, including perplexity. You can always try different values for compression parameters and choose the best ones for your use case.
  3. If you use the same values, most likely there will be no difference.

Sign up or log in to comment