Why Quantization Matters
A 70B parameter model in FP16 requires 140GB of VRAMβmore than any single GPU offers. Quantization compresses models by reducing numerical precision, enabling you to run large models on smaller GPUs.
| Precision | Bits/Param | 70B Model Size | Quality Loss |
|---|---|---|---|
| FP32 | 32 | 280GB | None |
| FP16/BF16 | 16 | 140GB | Negligible |
| INT8 | 8 | 70GB | ~1% |
| INT4 | 4 | 35GB | ~2-3% |
With 4-bit quantization, you can run a 70B model on a single 48GB GPU!
Quantization Methods Compared
GPTQ (GPU Quantization)
Best for: GPU inference with maximum quality
- Uses calibration data to minimize quantization error
- Requires GPU for quantization process
- Excellent quality at 4-bit
- Works with vLLM, TGI, Transformers
AWQ (Activation-Aware Quantization)
Best for: Speed-optimized GPU inference
- Protects "salient" weights that affect output most
- Faster inference than GPTQ in many cases
- Native vLLM support with fused kernels
- Best speed/quality tradeoff for production
GGUF (llama.cpp format)
Best for: CPU inference, Apple Silicon, edge devices
- Successor to GGML format
- Multiple quantization levels (Q4_K_M, Q5_K_M, etc.)
- Works on CPU with optional GPU offload
- Used by llama.cpp, Ollama, LM Studio
π‘ Quick Recommendation
GPU server: Use AWQ for best speed
Quality critical: Use GPTQ
CPU/Mac/Edge: Use GGUF Q4_K_M
Using Pre-Quantized Models
The easiest approach is downloading pre-quantized models from HuggingFace:
# AWQ models (TheBloke is the main quantizer)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-70B-Chat-AWQ",
device_map="auto",
trust_remote_code=True
)
# GPTQ models
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-70B-Chat-GPTQ",
device_map="auto",
trust_remote_code=True
)
Quantizing Your Own Models
GPTQ Quantization
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_id = "meta-llama/Llama-3.1-8B-Instruct"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Calibration dataset (crucial for quality)
calibration_data = [
"The meaning of life is",
"Machine learning is a subset of",
"In a recent study, researchers found",
# Add 100-500 diverse examples
]
# GPTQ config
gptq_config = GPTQConfig(
bits=4, # 4-bit quantization
dataset=calibration_data,
tokenizer=tokenizer,
group_size=128, # Higher = better quality, larger size
desc_act=True, # Activation order (better quality)
)
# Quantize
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=gptq_config,
device_map="auto"
)
# Save
model.save_pretrained("./llama-3.1-8b-gptq")
tokenizer.save_pretrained("./llama-3.1-8b-gptq")
AWQ Quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "./llama-3.1-8b-awq"
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantization config
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # or "GEMV" for batch size 1
}
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
GGUF Conversion
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Install requirements
pip install -r requirements.txt
# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py /path/to/model \
--outfile model-f16.gguf \
--outtype f16
# Quantize to 4-bit
./llama-quantize model-f16.gguf model-q4_k_m.gguf q4_k_m
Serving Quantized Models
vLLM (Best for Production)
# AWQ model
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 4096
# GPTQ model
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
--quantization gptq \
--tensor-parallel-size 2
Text Generation Inference (TGI)
docker run --gpus all -p 8080:80 \
-v /data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id TheBloke/Llama-2-70B-Chat-AWQ \
--quantize awq
llama.cpp (GGUF)
# Build llama.cpp with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
# Serve model
./llama-server \
-m model-q4_k_m.gguf \
-ngl 99 \ # GPU layers to offload
-c 4096 \ # Context length
--host 0.0.0.0 --port 8080
Benchmarks: Quality vs Speed
Testing Llama-2-70B on various tasks (RTX 4090):
| Format | MMLU | Tokens/sec | VRAM |
|---|---|---|---|
| FP16 (baseline) | 69.8% | 12 | 140GB |
| GPTQ 4-bit | 68.9% | 35 | 38GB |
| AWQ 4-bit | 68.7% | 42 | 38GB |
| GGUF Q4_K_M | 68.2% | 28 | 40GB |
Key takeaway: AWQ gives best speed with minimal quality loss.
β οΈ Quality Considerations
Quantization affects different tasks differently. Always benchmark on YOUR specific use case before deploying quantized models in production.
Quantization Levels Explained
GGUF Quantization Types
- Q8_0: 8-bit, ~1% quality loss, largest
- Q6_K: 6-bit, ~1.5% loss, good balance
- Q5_K_M: 5-bit medium, ~2% loss
- Q4_K_M: 4-bit medium, ~3% loss, recommended
- Q4_K_S: 4-bit small, ~3.5% loss
- Q3_K_M: 3-bit, ~5% loss, smallest usable
- Q2_K: 2-bit, significant loss, experimental
The "_K" and "_M" Suffixes
K-quants use different bit depths for different layer types, improving quality. M (medium) vs S (small) controls the balance between size and quality.
Mixed Precision Strategies
Advanced: Keep sensitive layers in higher precision:
from transformers import BitsAndBytesConfig
# Keep first/last layers in FP16, middle in INT4
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Quantize the quantization constants
llm_int8_skip_modules=["lm_head", "embed_tokens"] # Keep in FP16
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)
Cost Savings with Quantization
Running Llama-2-70B on GPUBrazil:
| Setup | GPUs Needed | Cost/Hour |
|---|---|---|
| FP16 (140GB) | 2x A100 80GB | $2.40 |
| AWQ 4-bit (38GB) | 1x A100 80GB | $1.20 |
| AWQ 4-bit (38GB) | 1x L40S 48GB | $0.79 |
50-70% cost reduction with quantization!
Run 70B Models Affordably
Quantized models + GPUBrazil = enterprise AI at startup prices.
Get $5 Free Credit βBest Practices
- Start with pre-quantized models β TheBloke has quantized hundreds of popular models
- Use AWQ for production GPU inference β best speed/quality
- Benchmark on your data β quantization impact varies by task
- Q4_K_M for GGUF β best balance for most use cases
- Consider the use case β creative writing tolerates more quantization than code generation
Conclusion
Quantization is essential for cost-effective LLM deployment. With 4-bit quantization, you can:
- Run 70B models on single GPUs
- Reduce costs by 50-70%
- Maintain 97%+ quality
- Increase inference speed
Start with AWQ models from HuggingFace and deploy on GPUBrazil to maximize your compute budget.