Every AI API bills the same way: per token. Every word in and every word out has a price. At first it feels cheap — cents per call. But as the product grows, the agent pipeline runs 24/7 and you process millions of documents, the per-token bill becomes a tax that only goes up. The good news: there's another way to pay for inference, and at scale it's far cheaper.

⚡ TL;DR

API = cost per token (grows linearly, forever). Self-hosting = cost per GPU-hour (fixed, you saturate it with as much volume as you want). Above a certain volume, hosting your own open-source LLM is much cheaper — and it gives you data sovereignty and zero lock-in.

How cost per token works

Closed APIs bill input (your prompt) and output (the answer) separately. Reference prices per million tokens (always confirm each vendor's current values):

Model (reference)Input / 1M tokOutput / 1M tok
Frontier model (top)~$15~$75
Mid-tier model~$3~$15
Lightweight model~$1~$5

The problem isn't the unit price — it's the linearity. Double the usage, double the bill. There's no economy of scale: the millionth token costs the same as the first.

How cost per GPU-hour works

When you host an open-source model (GLM-5.2, Llama, Qwen, DeepSeek...) on your own GPU, the logic flips: you pay for the GPU per hour, and it processes as many tokens as fit within its capacity. A modern GPU serving with vLLM and batching sustains very high throughput — meaning billions of tokens per month for a fixed cost.

In other words: the more tokens you push through the same instance, the lower your effective cost per token. It's the opposite of the API model.

The math: a real example

Imagine a product processing 300M input + 100M output tokens per month (typical of RAG and agents, which read a lot of context). Let's compare.

Via API (cost per token)

Self-hosted (cost per hour)

A dedicated H100 running the open-source model 24/7 on GPUBrazil costs on the order of ~$1,200/month (about 730 hours). And that same H100 has plenty of headroom: with batching it serves far more than the example's 400M tokens — in practice, billions per month.

Scenario (400M tok/month)Monthly costSpare capacity
API — mid-tier≈ $2,400
API — frontier≈ $12,000
H100 self-hosted (24/7)≈ $1,200Very high (fits 3–5× the volume)
💡 The key insight

With self-hosting, the cost does not rise when you double the volume — until the GPU saturates. If your usage grows to 1–2 billion tokens/month, the API bill would hit tens of thousands of dollars, while your H100 keeps costing ~$1,200. That's where the gap stops being nice and becomes enormous.

It doesn't have to be 24/7 — or an H100

Two tweaks make the math even better for smaller workloads:

When the API still wins

Let's be honest — self-hosting isn't always the answer:

That's why the winning pattern is often hybrid: self-host the bulk of the predictable volume and send spikes or rare cases to a frontier API.

Bonus: you don't only pay in money

Cost per token isn't just dollars. Self-hosting also gives you:

Run the numbers on your own GPU

Get free credit, spin up an open-source LLM by the hour and watch your cost per token drop.

Start Free →

FAQ

When is self-hosting an LLM cheaper than using an API?

When volume is high and predictable. APIs bill per token (grows linearly forever); self-hosting bills per GPU-hour (fixed, you saturate it with whatever volume you want). The break-even is when the monthly per-token bill would exceed the cost of keeping the GPU — in practice, hundreds of millions to billions of tokens/month. For low volume, the API tends to win.

How do I calculate whether to switch from an API to my own GPU?

Estimate your tokens/month and multiply by the API's per-token price. Compare with the cost of a GPU for the hours you actually need (billed hourly here). If the GPU serves your volume within capacity and costs less than the API, it's worth it — plus you gain sovereignty and zero lock-in.

Do I need to keep the GPU running 24/7?

No. Hourly billing means turning it on only during usage windows (business hours, batches, peaks) and off the rest. Smaller/quantized GPUs also reduce the per-hour cost.

Conclusion

The right question isn't "API or my own GPU?" but "which part of my volume makes more sense in each model?". For predictable, growing usage — which is where the cost lives — swapping "per token" for "per GPU-hour" can cut the bill in half or more, and delivers data sovereignty on top. Start by measuring your tokens/month, run a GPU for a few hours, and compare against your API invoice. The numbers usually speak for themselves.

Read next: GLM-5.2 vs Claude: the open-source model taking on Anthropic · How much does it cost to run AI in 2026 · Serve an LLM with vLLM