Token Economics: How Self-Hosting Cuts AI Costs

Q: When is self-hosting an LLM cheaper than using an API?

When token volume is high and predictable. APIs bill per token (a cost that grows linearly forever); self-hosting bills per GPU-hour (a fixed cost you saturate with as much volume as you want). The break-even happens when the monthly per-token bill would exceed the cost of keeping a GPU running — in practice, workloads with hundreds of millions to billions of tokens per month. For low or sporadic volume, the API is usually cheaper.

Q: Do I need to keep the GPU running 24/7?

No. Because billing is hourly, you turn the GPU on only during usage windows (business hours, batch jobs, peaks) and off the rest of the time. For variable workloads this cuts the fixed cost dramatically. Smaller/quantized GPUs also lower the per-hour cost.

Every AI API bills the same way: per token. Every word in and every word out has a price. At first it feels cheap — cents per call. But as the product grows, the agent pipeline runs 24/7 and you process millions of documents, the per-token bill becomes a tax that only goes up. The good news: there's another way to pay for inference, and at scale it's far cheaper.

⚡ TL;DR

API = cost per token (grows linearly, forever). Self-hosting = cost per GPU-hour (fixed, you saturate it with as much volume as you want). Above a certain volume, hosting your own open-source LLM is much cheaper — and it gives you data sovereignty and zero lock-in.

How cost per token works

Closed APIs bill input (your prompt) and output (the answer) separately. Reference prices per million tokens (always confirm each vendor's current values):

Model (reference)	Input / 1M tok	Output / 1M tok
Frontier model (top)	~$15	~$75
Mid-tier model	~$3	~$15
Lightweight model	~$1	~$5

The problem isn't the unit price — it's the linearity. Double the usage, double the bill. There's no economy of scale: the millionth token costs the same as the first.

How cost per GPU-hour works

When you host an open-source model (GLM-5.2, Llama, Qwen, DeepSeek...) on your own GPU, the logic flips: you pay for the GPU per hour, and it processes as many tokens as fit within its capacity. A modern GPU serving with vLLM and batching sustains very high throughput — meaning billions of tokens per month for a fixed cost.

In other words: the more tokens you push through the same instance, the lower your effective cost per token. It's the opposite of the API model.

The math: a real example

Imagine a product processing 300M input + 100M output tokens per month (typical of RAG and agents, which read a lot of context). Let's compare.

Via API (cost per token)

Mid-tier model: 300M × $3 + 100M × $15 = $900 + $1,500 = ~$2,400/month.
Frontier model: 300M × $15 + 100M × $75 = $4,500 + $7,500 = ~$12,000/month.

Self-hosted (cost per hour)

A dedicated H100 running the open-source model 24/7 on GPUBrazil costs on the order of ~$1,200/month (about 730 hours). And that same H100 has plenty of headroom: with batching it serves far more than the example's 400M tokens — in practice, billions per month.

Scenario (400M tok/month)	Monthly cost	Spare capacity
API — mid-tier	≈ $2,400	—
API — frontier	≈ $12,000	—
H100 self-hosted (24/7)	≈ $1,200	Very high (fits 3–5× the volume)

💡 The key insight

With self-hosting, the cost does not rise when you double the volume — until the GPU saturates. If your usage grows to 1–2 billion tokens/month, the API bill would hit tens of thousands of dollars, while your H100 keeps costing ~$1,200. That's where the gap stops being nice and becomes enormous.

It doesn't have to be 24/7 — or an H100

Two tweaks make the math even better for smaller workloads:

Turn it on only when you use it. Since billing is hourly, run the GPU during business hours or in batches and shut it off the rest. 10h/day ≈ 1/3 the cost of 24/7.
Use the right GPU. Quantized models run well on cheaper GPUs (RTX A6000, L40). See how to choose the right GPU for your model.

When the API still wins

Let's be honest — self-hosting isn't always the answer:

Low or sporadic volume: if you make a few calls a day, paying per token is cheaper than keeping (or even spinning up) a GPU.
Unpredictable spikes: the API scales instantly with nothing to manage.
Zero MLOps: if you don't want to operate infrastructure, the API delivers simplicity.

That's why the winning pattern is often hybrid: self-host the bulk of the predictable volume and send spikes or rare cases to a frontier API.

Bonus: you don't only pay in money

Cost per token isn't just dollars. Self-hosting also gives you:

Data sovereignty: prompts and documents never leave your instance — crucial for regulated data.
No lock-in: price and availability are yours. No vendor hikes the price on or retires your model overnight.
Local latency: a GPU near your users means fewer round-trips to overseas servers.

Run the numbers on your own GPU

Get free credit, spin up an open-source LLM by the hour and watch your cost per token drop.

Start Free →

FAQ

When is self-hosting an LLM cheaper than using an API?

When volume is high and predictable. APIs bill per token (grows linearly forever); self-hosting bills per GPU-hour (fixed, you saturate it with whatever volume you want). The break-even is when the monthly per-token bill would exceed the cost of keeping the GPU — in practice, hundreds of millions to billions of tokens/month. For low volume, the API tends to win.

How do I calculate whether to switch from an API to my own GPU?

Estimate your tokens/month and multiply by the API's per-token price. Compare with the cost of a GPU for the hours you actually need (billed hourly here). If the GPU serves your volume within capacity and costs less than the API, it's worth it — plus you gain sovereignty and zero lock-in.

Do I need to keep the GPU running 24/7?

No. Hourly billing means turning it on only during usage windows (business hours, batches, peaks) and off the rest. Smaller/quantized GPUs also reduce the per-hour cost.

Conclusion

The right question isn't "API or my own GPU?" but "which part of my volume makes more sense in each model?". For predictable, growing usage — which is where the cost lives — swapping "per token" for "per GPU-hour" can cut the bill in half or more, and delivers data sovereignty on top. Start by measuring your tokens/month, run a GPU for a few hours, and compare against your API invoice. The numbers usually speak for themselves.