Model Quantization Guide: GPTQ vs AWQ vs GGUF Explained

Why Quantization Matters

A 70B parameter model in FP16 requires 140GB of VRAM—more than any single GPU offers. Quantization compresses models by reducing numerical precision, enabling you to run large models on smaller GPUs.

Precision	Bits/Param	70B Model Size	Quality Loss
FP32	32	280GB	None
FP16/BF16	16	140GB	Negligible
INT8	8	70GB	~1%
INT4	4	35GB	~2-3%

With 4-bit quantization, you can run a 70B model on a single 48GB GPU!

Quantization Methods Compared

GPTQ (GPU Quantization)

Best for: GPU inference with maximum quality

Uses calibration data to minimize quantization error
Requires GPU for quantization process
Excellent quality at 4-bit
Works with vLLM, TGI, Transformers

AWQ (Activation-Aware Quantization)

Best for: Speed-optimized GPU inference

Protects "salient" weights that affect output most
Faster inference than GPTQ in many cases
Native vLLM support with fused kernels
Best speed/quality tradeoff for production

GGUF (llama.cpp format)

Best for: CPU inference, Apple Silicon, edge devices

Successor to GGML format
Multiple quantization levels (Q4_K_M, Q5_K_M, etc.)
Works on CPU with optional GPU offload
Used by llama.cpp, Ollama, LM Studio

💡 Quick Recommendation

GPU server: Use AWQ for best speed
Quality critical: Use GPTQ
CPU/Mac/Edge: Use GGUF Q4_K_M

Using Pre-Quantized Models

The easiest approach is downloading pre-quantized models from HuggingFace:

# AWQ models (TheBloke is the main quantizer)
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-70B-Chat-AWQ",
    device_map="auto",
    trust_remote_code=True
)

# GPTQ models
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-70B-Chat-GPTQ",
    device_map="auto",
    trust_remote_code=True
)

Quantizing Your Own Models

GPTQ Quantization

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Calibration dataset (crucial for quality)
calibration_data = [
    "The meaning of life is",
    "Machine learning is a subset of",
    "In a recent study, researchers found",
    # Add 100-500 diverse examples
]

# GPTQ config
gptq_config = GPTQConfig(
    bits=4,                    # 4-bit quantization
    dataset=calibration_data,
    tokenizer=tokenizer,
    group_size=128,            # Higher = better quality, larger size
    desc_act=True,             # Activation order (better quality)
)

# Quantize
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=gptq_config,
    device_map="auto"
)

# Save
model.save_pretrained("./llama-3.1-8b-gptq")
tokenizer.save_pretrained("./llama-3.1-8b-gptq")

AWQ Quantization

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "./llama-3.1-8b-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # or "GEMV" for batch size 1
}

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

GGUF Conversion

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Install requirements
pip install -r requirements.txt

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py /path/to/model \
    --outfile model-f16.gguf \
    --outtype f16

# Quantize to 4-bit
./llama-quantize model-f16.gguf model-q4_k_m.gguf q4_k_m

Serving Quantized Models

vLLM (Best for Production)

# AWQ model
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
    --quantization awq \
    --tensor-parallel-size 2 \
    --max-model-len 4096

# GPTQ model
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
    --quantization gptq \
    --tensor-parallel-size 2

Text Generation Inference (TGI)

docker run --gpus all -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id TheBloke/Llama-2-70B-Chat-AWQ \
    --quantize awq

llama.cpp (GGUF)

# Build llama.cpp with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# Serve model
./llama-server \
    -m model-q4_k_m.gguf \
    -ngl 99 \  # GPU layers to offload
    -c 4096 \  # Context length
    --host 0.0.0.0 --port 8080

Benchmarks: Quality vs Speed

Testing Llama-2-70B on various tasks (RTX 4090):

Format	MMLU	Tokens/sec	VRAM
FP16 (baseline)	69.8%	12	140GB
GPTQ 4-bit	68.9%	35	38GB
AWQ 4-bit	68.7%	42	38GB
GGUF Q4_K_M	68.2%	28	40GB

Key takeaway: AWQ gives best speed with minimal quality loss.

⚠️ Quality Considerations

Quantization affects different tasks differently. Always benchmark on YOUR specific use case before deploying quantized models in production.

Quantization Levels Explained

GGUF Quantization Types

Q8_0: 8-bit, ~1% quality loss, largest
Q6_K: 6-bit, ~1.5% loss, good balance
Q5_K_M: 5-bit medium, ~2% loss
Q4_K_M: 4-bit medium, ~3% loss, recommended
Q4_K_S: 4-bit small, ~3.5% loss
Q3_K_M: 3-bit, ~5% loss, smallest usable
Q2_K: 2-bit, significant loss, experimental

The "_K" and "_M" Suffixes

K-quants use different bit depths for different layer types, improving quality. M (medium) vs S (small) controls the balance between size and quality.

Mixed Precision Strategies

Advanced: Keep sensitive layers in higher precision:

from transformers import BitsAndBytesConfig

# Keep first/last layers in FP16, middle in INT4
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Quantize the quantization constants
    llm_int8_skip_modules=["lm_head", "embed_tokens"]  # Keep in FP16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

Cost Savings with Quantization

Running Llama-2-70B on GPUBrazil:

Setup	GPUs Needed	Cost/Hour
FP16 (140GB)	2x A100 80GB	$2.40
AWQ 4-bit (38GB)	1x A100 80GB	$1.20
AWQ 4-bit (38GB)	1x L40S 48GB	$0.79

50-70% cost reduction with quantization!

Run 70B Models Affordably

Quantized models + GPUBrazil = enterprise AI at startup prices.

Get $5 Free Credit →

Best Practices

Start with pre-quantized models — TheBloke has quantized hundreds of popular models
Use AWQ for production GPU inference — best speed/quality
Benchmark on your data — quantization impact varies by task
Q4_K_M for GGUF — best balance for most use cases
Consider the use case — creative writing tolerates more quantization than code generation

Conclusion

Quantization is essential for cost-effective LLM deployment. With 4-bit quantization, you can:

Run 70B models on single GPUs
Reduce costs by 50-70%
Maintain 97%+ quality
Increase inference speed

Start with AWQ models from HuggingFace and deploy on GPUBrazil to maximize your compute budget.