Why Quantization Matters

A 70B parameter model in FP16 requires 140GB of VRAMβ€”more than any single GPU offers. Quantization compresses models by reducing numerical precision, enabling you to run large models on smaller GPUs.

PrecisionBits/Param70B Model SizeQuality Loss
FP3232280GBNone
FP16/BF1616140GBNegligible
INT8870GB~1%
INT4435GB~2-3%

With 4-bit quantization, you can run a 70B model on a single 48GB GPU!

Quantization Methods Compared

GPTQ (GPU Quantization)

Best for: GPU inference with maximum quality

AWQ (Activation-Aware Quantization)

Best for: Speed-optimized GPU inference

GGUF (llama.cpp format)

Best for: CPU inference, Apple Silicon, edge devices

πŸ’‘ Quick Recommendation

GPU server: Use AWQ for best speed
Quality critical: Use GPTQ
CPU/Mac/Edge: Use GGUF Q4_K_M

Using Pre-Quantized Models

The easiest approach is downloading pre-quantized models from HuggingFace:

# AWQ models (TheBloke is the main quantizer)
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-70B-Chat-AWQ",
    device_map="auto",
    trust_remote_code=True
)

# GPTQ models
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-70B-Chat-GPTQ",
    device_map="auto",
    trust_remote_code=True
)

Quantizing Your Own Models

GPTQ Quantization

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Calibration dataset (crucial for quality)
calibration_data = [
    "The meaning of life is",
    "Machine learning is a subset of",
    "In a recent study, researchers found",
    # Add 100-500 diverse examples
]

# GPTQ config
gptq_config = GPTQConfig(
    bits=4,                    # 4-bit quantization
    dataset=calibration_data,
    tokenizer=tokenizer,
    group_size=128,            # Higher = better quality, larger size
    desc_act=True,             # Activation order (better quality)
)

# Quantize
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=gptq_config,
    device_map="auto"
)

# Save
model.save_pretrained("./llama-3.1-8b-gptq")
tokenizer.save_pretrained("./llama-3.1-8b-gptq")

AWQ Quantization

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "./llama-3.1-8b-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # or "GEMV" for batch size 1
}

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

GGUF Conversion

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Install requirements
pip install -r requirements.txt

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py /path/to/model \
    --outfile model-f16.gguf \
    --outtype f16

# Quantize to 4-bit
./llama-quantize model-f16.gguf model-q4_k_m.gguf q4_k_m

Serving Quantized Models

vLLM (Best for Production)

# AWQ model
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
    --quantization awq \
    --tensor-parallel-size 2 \
    --max-model-len 4096

# GPTQ model
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
    --quantization gptq \
    --tensor-parallel-size 2

Text Generation Inference (TGI)

docker run --gpus all -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id TheBloke/Llama-2-70B-Chat-AWQ \
    --quantize awq

llama.cpp (GGUF)

# Build llama.cpp with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# Serve model
./llama-server \
    -m model-q4_k_m.gguf \
    -ngl 99 \  # GPU layers to offload
    -c 4096 \  # Context length
    --host 0.0.0.0 --port 8080

Benchmarks: Quality vs Speed

Testing Llama-2-70B on various tasks (RTX 4090):

FormatMMLUTokens/secVRAM
FP16 (baseline)69.8%12140GB
GPTQ 4-bit68.9%3538GB
AWQ 4-bit68.7%4238GB
GGUF Q4_K_M68.2%2840GB

Key takeaway: AWQ gives best speed with minimal quality loss.

⚠️ Quality Considerations

Quantization affects different tasks differently. Always benchmark on YOUR specific use case before deploying quantized models in production.

Quantization Levels Explained

GGUF Quantization Types

The "_K" and "_M" Suffixes

K-quants use different bit depths for different layer types, improving quality. M (medium) vs S (small) controls the balance between size and quality.

Mixed Precision Strategies

Advanced: Keep sensitive layers in higher precision:

from transformers import BitsAndBytesConfig

# Keep first/last layers in FP16, middle in INT4
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Quantize the quantization constants
    llm_int8_skip_modules=["lm_head", "embed_tokens"]  # Keep in FP16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

Cost Savings with Quantization

Running Llama-2-70B on GPUBrazil:

SetupGPUs NeededCost/Hour
FP16 (140GB)2x A100 80GB$2.40
AWQ 4-bit (38GB)1x A100 80GB$1.20
AWQ 4-bit (38GB)1x L40S 48GB$0.79

50-70% cost reduction with quantization!

Run 70B Models Affordably

Quantized models + GPUBrazil = enterprise AI at startup prices.

Get $5 Free Credit β†’

Best Practices

  1. Start with pre-quantized models β€” TheBloke has quantized hundreds of popular models
  2. Use AWQ for production GPU inference β€” best speed/quality
  3. Benchmark on your data β€” quantization impact varies by task
  4. Q4_K_M for GGUF β€” best balance for most use cases
  5. Consider the use case β€” creative writing tolerates more quantization than code generation

Conclusion

Quantization is essential for cost-effective LLM deployment. With 4-bit quantization, you can:

Start with AWQ models from HuggingFace and deploy on GPUBrazil to maximize your compute budget.