LLMs

Model Quantization, Are You Using Llama to Full Potential?

Aug 12, 2024

5 min read

Model Quantization is a technique that has aided ML Developers through the ages, some struggling with computing potential or some just a victim to bad architecture. Talking on the core level, Model Quantization is turning model weights from high-precision data types to low ones to aid computing and deployment processes.

In this blog, we will be exploring how model quantization might be the way forward for your tricky LLM deployment and ensuring a healthy mix of performance vs budget while deploying bigger models such as Llama 3.1 405B.

What is Quantization?

Moving back a bit, Model Quantization is the process of mapping continuous infinite values to a smaller set of discrete values to aid with the processing that takes place while training and deploying machine learning models. In the context of LLMs, we are simply turning the model’s weights from High-Precision Data Types to Lower-Precision Data Types.

LLMs at the end of the day are neural networks that store their weights in a readable format. These “weights” are stored as huge sequences of numbers, which are stored in different formats such as Float64. Float32, Float16, and much more. The chosen data type greatly influences the number of “digits” saved in the memory, and directly impacts the computational resources required to keep operations running by opting for cheaper hardware.

How does Quantization Work?

Steps to quantization generally remain the same no matter the model you are using it against, so let us talk about the broader steps in Quantization before we talk about the specific techniques that are used for Large Language Models:

1. Scaling: The first step involves scaling the range of the original values within the range of lower-precision values. Achieved using scale factor that adjusts the high precision values to the range that works for lower-precision format. (Note here is where you lose most of your precision from the model)

2. Rounding: The scaled values shall be then rounded to the nearest value that can be represented with the lower precision. This step tends to introduce some precision errors in the model but can be minimized with careful scaling.

3. Mapping: Lastly the rounded values are mapped to the discrete set of values that the lower-precision format permits. For example in 8-bit quantization, the floating points are mapped to one of the possible 256 integer values.

Now let us dive deeper into the specific methods of Quantization for LLMs.

Different Types of Quantization

Quantization can be classified into two main types: Post-Training Quantization and Quantization-Aware Training. Both these types of quantization come with their own set of pros and cons, which we will shed light upon to make better decisions.

However, among these types of Quantization one more factor that influences the model performance a bunch is the choice between 4-bit, 6-bit, and 8-bit quantization, which will influence the specific technique that we use, upon which we will talk later.

Post-Training Quantization (PQT)

PQT is a method that is performed once the model has been fully trained, upon which it involves converting weights and activations of the pre-trained model to lower precision values without retraining the model, drastically reducing the size of the model parameters.

Pros: Straight to implement, doesn’t require extra training time and cost.

Cons: The method is known to degrade model performance.

Quantization-Aware Training (QAT)

QAT is a method that is performed alongside the training of the LLM. It involves stimulating the effects of quantization during the training process to make the model more robust to performing inferences on a smaller precision map.

Pros: The model performs better than PQT due to its ability to learn to compensate for lower precision.

Cons: More expensive and requires longer training time.

Different Quantization Techniques

Building upon the different types of Quantization techniques, we will discuss the most used Quantization Techniques of LLMs. These techniques have been known to aid in the performance of models of every size, with a careful understanding of the parameters and precision influencing weights.

Head over to Dev. to to check out the code samples to start your Model Quantization Today with Tune Studio using GPTQ which comes inbuilt with our Deployed Models!

Disclaimer: We will not be covering techniques such as SLoRA and QLoRA, as we consider them more about finetuning the model than quantization. However, you can check out our article on SLoRA to read more!

GGML

Generic Graph-Based Model Learning or GGML is a versatile quantization technique focusing on reducing computational complexity and memory footprint while making the model more capable of running lower-end hardware such as CPUs and GPUs with <8GB VRAM.

This method however may introduce performance overhead loss due to the complexity of handling graph-based operations.

Some popular methods for the implementation are:

1. Static Quantization GGML: Quantizes weights and activation to a fixed precision before inference.

2. Dynamic Quantization GGML: Dynamically quantizes the activations during inference, keeping the weights statically quantized.

GPTQ

General-purpose tensor Quantization or GPTQ is one of the most popular quantization techniques, giving the design to optimize high-dimensional data like that is used in LLMs. This method is notably used for about 80 percent of the quantized llms out there with extensive support on its libraries such as Auto GPTQ.

What, however, GPTQ gains in its precision control, it lacks with its resource-hungry life cycle. A keen eye on quantization parameters due to the long quantization process becomes key as these operations are expensive and should be done with the right training parameters.

Some popular methods/libraries for the implementation are:

1. AutoGPTQ: An extensive library designed to automate processes around quantizing large language models specifically that use a transformer approach like GPT models do.

NF4

Neural Floating Point 4-Bit or NF4 is a specialized technique leveraging 4-bit conversions to minimize performance impact while maximizing efficiency. This method has been proven to work exceptionally well with Extra Large Language Models which have more than 50B parameters, while bringing down the hardware requirements from a cluster of A100s to sometimes even down to just one A100 GPU for inference.

Unlike GPTQ or GGML where you may run the inference on CPUs, this technique is specifically designed to quantize models on GPUs and TPUs.

Some popular methods/libraries for the implementation are:

1. Bitsandbytes: Open-source library to quantize LLMs for methods such as 8-bit and 4-bit mapping.

2. Symmetric Quantization: Using a symmetric range around zero for quantizing weights and activations.

3. Asymmetric Quantization: Allows different ranges for positive and negative values for flexibility.

Conclusion

In this blog, we have explored model quantization and how one can go about implementing such techniques and relevant resources for the same. 

We invite you to stay “Tuned” for more tutorials around quantization, a detailed explanation of techniques such as RTN, AWQ, and SmoothQuant,  and how you can also deploy quantized extra-large language models within minutes!

Model Quantization is a technique that has aided ML Developers through the ages, some struggling with computing potential or some just a victim to bad architecture. Talking on the core level, Model Quantization is turning model weights from high-precision data types to low ones to aid computing and deployment processes.

In this blog, we will be exploring how model quantization might be the way forward for your tricky LLM deployment and ensuring a healthy mix of performance vs budget while deploying bigger models such as Llama 3.1 405B.

What is Quantization?

Moving back a bit, Model Quantization is the process of mapping continuous infinite values to a smaller set of discrete values to aid with the processing that takes place while training and deploying machine learning models. In the context of LLMs, we are simply turning the model’s weights from High-Precision Data Types to Lower-Precision Data Types.

LLMs at the end of the day are neural networks that store their weights in a readable format. These “weights” are stored as huge sequences of numbers, which are stored in different formats such as Float64. Float32, Float16, and much more. The chosen data type greatly influences the number of “digits” saved in the memory, and directly impacts the computational resources required to keep operations running by opting for cheaper hardware.

How does Quantization Work?

Steps to quantization generally remain the same no matter the model you are using it against, so let us talk about the broader steps in Quantization before we talk about the specific techniques that are used for Large Language Models:

1. Scaling: The first step involves scaling the range of the original values within the range of lower-precision values. Achieved using scale factor that adjusts the high precision values to the range that works for lower-precision format. (Note here is where you lose most of your precision from the model)

2. Rounding: The scaled values shall be then rounded to the nearest value that can be represented with the lower precision. This step tends to introduce some precision errors in the model but can be minimized with careful scaling.

3. Mapping: Lastly the rounded values are mapped to the discrete set of values that the lower-precision format permits. For example in 8-bit quantization, the floating points are mapped to one of the possible 256 integer values.

Now let us dive deeper into the specific methods of Quantization for LLMs.

Different Types of Quantization

Quantization can be classified into two main types: Post-Training Quantization and Quantization-Aware Training. Both these types of quantization come with their own set of pros and cons, which we will shed light upon to make better decisions.

However, among these types of Quantization one more factor that influences the model performance a bunch is the choice between 4-bit, 6-bit, and 8-bit quantization, which will influence the specific technique that we use, upon which we will talk later.

Post-Training Quantization (PQT)

PQT is a method that is performed once the model has been fully trained, upon which it involves converting weights and activations of the pre-trained model to lower precision values without retraining the model, drastically reducing the size of the model parameters.

Pros: Straight to implement, doesn’t require extra training time and cost.

Cons: The method is known to degrade model performance.

Quantization-Aware Training (QAT)

QAT is a method that is performed alongside the training of the LLM. It involves stimulating the effects of quantization during the training process to make the model more robust to performing inferences on a smaller precision map.

Pros: The model performs better than PQT due to its ability to learn to compensate for lower precision.

Cons: More expensive and requires longer training time.

Different Quantization Techniques

Building upon the different types of Quantization techniques, we will discuss the most used Quantization Techniques of LLMs. These techniques have been known to aid in the performance of models of every size, with a careful understanding of the parameters and precision influencing weights.

Head over to Dev. to to check out the code samples to start your Model Quantization Today with Tune Studio using GPTQ which comes inbuilt with our Deployed Models!

Disclaimer: We will not be covering techniques such as SLoRA and QLoRA, as we consider them more about finetuning the model than quantization. However, you can check out our article on SLoRA to read more!

GGML

Generic Graph-Based Model Learning or GGML is a versatile quantization technique focusing on reducing computational complexity and memory footprint while making the model more capable of running lower-end hardware such as CPUs and GPUs with <8GB VRAM.

This method however may introduce performance overhead loss due to the complexity of handling graph-based operations.

Some popular methods for the implementation are:

1. Static Quantization GGML: Quantizes weights and activation to a fixed precision before inference.

2. Dynamic Quantization GGML: Dynamically quantizes the activations during inference, keeping the weights statically quantized.

GPTQ

General-purpose tensor Quantization or GPTQ is one of the most popular quantization techniques, giving the design to optimize high-dimensional data like that is used in LLMs. This method is notably used for about 80 percent of the quantized llms out there with extensive support on its libraries such as Auto GPTQ.

What, however, GPTQ gains in its precision control, it lacks with its resource-hungry life cycle. A keen eye on quantization parameters due to the long quantization process becomes key as these operations are expensive and should be done with the right training parameters.

Some popular methods/libraries for the implementation are:

1. AutoGPTQ: An extensive library designed to automate processes around quantizing large language models specifically that use a transformer approach like GPT models do.

NF4

Neural Floating Point 4-Bit or NF4 is a specialized technique leveraging 4-bit conversions to minimize performance impact while maximizing efficiency. This method has been proven to work exceptionally well with Extra Large Language Models which have more than 50B parameters, while bringing down the hardware requirements from a cluster of A100s to sometimes even down to just one A100 GPU for inference.

Unlike GPTQ or GGML where you may run the inference on CPUs, this technique is specifically designed to quantize models on GPUs and TPUs.

Some popular methods/libraries for the implementation are:

1. Bitsandbytes: Open-source library to quantize LLMs for methods such as 8-bit and 4-bit mapping.

2. Symmetric Quantization: Using a symmetric range around zero for quantizing weights and activations.

3. Asymmetric Quantization: Allows different ranges for positive and negative values for flexibility.

Conclusion

In this blog, we have explored model quantization and how one can go about implementing such techniques and relevant resources for the same. 

We invite you to stay “Tuned” for more tutorials around quantization, a detailed explanation of techniques such as RTN, AWQ, and SmoothQuant,  and how you can also deploy quantized extra-large language models within minutes!

Model Quantization is a technique that has aided ML Developers through the ages, some struggling with computing potential or some just a victim to bad architecture. Talking on the core level, Model Quantization is turning model weights from high-precision data types to low ones to aid computing and deployment processes.

In this blog, we will be exploring how model quantization might be the way forward for your tricky LLM deployment and ensuring a healthy mix of performance vs budget while deploying bigger models such as Llama 3.1 405B.

What is Quantization?

Moving back a bit, Model Quantization is the process of mapping continuous infinite values to a smaller set of discrete values to aid with the processing that takes place while training and deploying machine learning models. In the context of LLMs, we are simply turning the model’s weights from High-Precision Data Types to Lower-Precision Data Types.

LLMs at the end of the day are neural networks that store their weights in a readable format. These “weights” are stored as huge sequences of numbers, which are stored in different formats such as Float64. Float32, Float16, and much more. The chosen data type greatly influences the number of “digits” saved in the memory, and directly impacts the computational resources required to keep operations running by opting for cheaper hardware.

How does Quantization Work?

Steps to quantization generally remain the same no matter the model you are using it against, so let us talk about the broader steps in Quantization before we talk about the specific techniques that are used for Large Language Models:

1. Scaling: The first step involves scaling the range of the original values within the range of lower-precision values. Achieved using scale factor that adjusts the high precision values to the range that works for lower-precision format. (Note here is where you lose most of your precision from the model)

2. Rounding: The scaled values shall be then rounded to the nearest value that can be represented with the lower precision. This step tends to introduce some precision errors in the model but can be minimized with careful scaling.

3. Mapping: Lastly the rounded values are mapped to the discrete set of values that the lower-precision format permits. For example in 8-bit quantization, the floating points are mapped to one of the possible 256 integer values.

Now let us dive deeper into the specific methods of Quantization for LLMs.

Different Types of Quantization

Quantization can be classified into two main types: Post-Training Quantization and Quantization-Aware Training. Both these types of quantization come with their own set of pros and cons, which we will shed light upon to make better decisions.

However, among these types of Quantization one more factor that influences the model performance a bunch is the choice between 4-bit, 6-bit, and 8-bit quantization, which will influence the specific technique that we use, upon which we will talk later.

Post-Training Quantization (PQT)

PQT is a method that is performed once the model has been fully trained, upon which it involves converting weights and activations of the pre-trained model to lower precision values without retraining the model, drastically reducing the size of the model parameters.

Pros: Straight to implement, doesn’t require extra training time and cost.

Cons: The method is known to degrade model performance.

Quantization-Aware Training (QAT)

QAT is a method that is performed alongside the training of the LLM. It involves stimulating the effects of quantization during the training process to make the model more robust to performing inferences on a smaller precision map.

Pros: The model performs better than PQT due to its ability to learn to compensate for lower precision.

Cons: More expensive and requires longer training time.

Different Quantization Techniques

Building upon the different types of Quantization techniques, we will discuss the most used Quantization Techniques of LLMs. These techniques have been known to aid in the performance of models of every size, with a careful understanding of the parameters and precision influencing weights.

Head over to Dev. to to check out the code samples to start your Model Quantization Today with Tune Studio using GPTQ which comes inbuilt with our Deployed Models!

Disclaimer: We will not be covering techniques such as SLoRA and QLoRA, as we consider them more about finetuning the model than quantization. However, you can check out our article on SLoRA to read more!

GGML

Generic Graph-Based Model Learning or GGML is a versatile quantization technique focusing on reducing computational complexity and memory footprint while making the model more capable of running lower-end hardware such as CPUs and GPUs with <8GB VRAM.

This method however may introduce performance overhead loss due to the complexity of handling graph-based operations.

Some popular methods for the implementation are:

1. Static Quantization GGML: Quantizes weights and activation to a fixed precision before inference.

2. Dynamic Quantization GGML: Dynamically quantizes the activations during inference, keeping the weights statically quantized.

GPTQ

General-purpose tensor Quantization or GPTQ is one of the most popular quantization techniques, giving the design to optimize high-dimensional data like that is used in LLMs. This method is notably used for about 80 percent of the quantized llms out there with extensive support on its libraries such as Auto GPTQ.

What, however, GPTQ gains in its precision control, it lacks with its resource-hungry life cycle. A keen eye on quantization parameters due to the long quantization process becomes key as these operations are expensive and should be done with the right training parameters.

Some popular methods/libraries for the implementation are:

1. AutoGPTQ: An extensive library designed to automate processes around quantizing large language models specifically that use a transformer approach like GPT models do.

NF4

Neural Floating Point 4-Bit or NF4 is a specialized technique leveraging 4-bit conversions to minimize performance impact while maximizing efficiency. This method has been proven to work exceptionally well with Extra Large Language Models which have more than 50B parameters, while bringing down the hardware requirements from a cluster of A100s to sometimes even down to just one A100 GPU for inference.

Unlike GPTQ or GGML where you may run the inference on CPUs, this technique is specifically designed to quantize models on GPUs and TPUs.

Some popular methods/libraries for the implementation are:

1. Bitsandbytes: Open-source library to quantize LLMs for methods such as 8-bit and 4-bit mapping.

2. Symmetric Quantization: Using a symmetric range around zero for quantizing weights and activations.

3. Asymmetric Quantization: Allows different ranges for positive and negative values for flexibility.

Conclusion

In this blog, we have explored model quantization and how one can go about implementing such techniques and relevant resources for the same. 

We invite you to stay “Tuned” for more tutorials around quantization, a detailed explanation of techniques such as RTN, AWQ, and SmoothQuant,  and how you can also deploy quantized extra-large language models within minutes!

Written by

Aryan Kargwal

Data Evangelist

Edited by

Abhishek Mishra

DevRel Engineer