Question 1

What is model quantization and why does it matter?

Accepted Answer

Model quantization is the process of reducing the numerical precision of a neural network's weights and activations from higher bit formats like FP32 (32-bit floating point) to lower bit formats like INT8 (8-bit integer) or INT4 (4-bit integer). This matters because large language models require enormous amounts of VRAM to run. A 7-billion parameter model in FP32 needs about 28GB of VRAM, but quantized to INT4 it needs only about 3.5GB. This makes it possible to run powerful AI models on consumer GPUs that would otherwise require expensive data center hardware. The tradeoff is a small reduction in model quality.

Question 2

What is the difference between GPTQ and GGUF quantization?

Accepted Answer

GPTQ and GGUF are two popular quantization methods with different approaches and use cases. GPTQ (GPT Quantization) performs post-training quantization using calibration data to minimize accuracy loss, and runs primarily on NVIDIA GPUs using CUDA. It is optimized for GPU inference and achieves excellent speed. GGUF (GPT-Generated Unified Format, used by llama.cpp) supports CPU inference as well as GPU offloading, making it more flexible for consumer hardware. GGUF offers various quantization levels like Q4_K_M, Q5_K_M, and Q8_0, each balancing size and quality differently. GPTQ generally has slightly better quality at the same bit width, while GGUF offers broader hardware compatibility.

Question 3

How much quality do you lose with quantization?

Accepted Answer

Quality loss from quantization depends on the method, bit width, and model size. FP16 and BF16 have negligible quality loss (under 0.1% perplexity increase) and are considered lossless for practical purposes. INT8 quantization typically shows less than 0.5% perplexity degradation, which is imperceptible in most applications. INT4 and GPTQ-4bit show approximately 1.5-2.5% perplexity increase, which may be noticeable in complex reasoning tasks but is acceptable for general use. Larger models tolerate quantization better than smaller ones since they have more redundancy. A 70B model quantized to INT4 often outperforms a 7B model at FP16.

Question 4

How do I choose the right quantization level for my hardware?

Accepted Answer

Start by determining your available VRAM, then work backwards to find the highest quality quantization that fits. As a rule of thumb, your total VRAM usage should not exceed 85-90% of available VRAM to leave room for KV cache and overhead. If you have 24GB VRAM (RTX 4090), you can run a 70B model at Q4 quantization or a 13B model at FP16. For 8GB VRAM (RTX 4060), a 7B model at Q4-Q5 quantization works well. For CPU inference using GGUF, you need enough system RAM to hold the model plus KV cache. Always prefer higher quantization bits if your hardware allows it, as Q5 and Q8 preserve noticeably more quality than Q4 in complex reasoning and code generation tasks.

Model Quantization Savings Calculator

Formula

Worked Examples

Example 1: Llama 2 7B: FP32 to INT4

Example 2: Mistral 7B: FP16 to GPTQ-4bit

Frequently Asked Questions

What is model quantization and why does it matter?

What is the difference between GPTQ and GGUF quantization?

How much quality do you lose with quantization?

How do I choose the right quantization level for my hardware?

References