Plan B When an AI Model Vanishes: A Continuity Playbook

In June 2026, Anthropic disabled Claude Fable 5 and Mythos 5 for all customers, complying with a US government directive. Other Claude models, such as Opus 4.8, stayed online — but anyone who depended specifically on those two woke up to a broken product. This article is the playbook you wish you'd read before that morning.

⚡ Playbook TL;DR

(1) Abstract your calls behind an OpenAI-compatible interface. (2) Keep a self-hosted open-source fallback (vLLM/TGI) warm on a GPU. (3) Use a router (LiteLLM) for automatic failover. (4) Test the fallback periodically. (5) Version your model weights.

Step 1 — Abstract behind an OpenAI-compatible interface

The most common mistake is coupling your code to one vendor's specific SDK. The fix is to talk to every model through the same interface — the de facto standard is OpenAI's /v1/chat/completions API, supported by virtually every provider and by vLLM. Switching models becomes swapping a base_url, not rewriting your app.

# Every call goes through an OpenAI-compatible client
from openai import OpenAI

primary = OpenAI(
    base_url="https://api.primary-vendor.com/v1",
    api_key="sk-primary",
)

def chat(messages, model="primary-model"):
    return primary.chat.completions.create(model=model, messages=messages)

Step 2 — Keep a self-hosted fallback warm

The real plan B is an open-source model running on your own GPU. With the vLLM template in the GPUBrazil Console, you spin up an OpenAI-compatible endpoint in minutes, serving DeepSeek, Qwen 3, Llama 4, or Mistral. Since billing is hourly in reais via Pix, you decide when to keep it running.

# OpenAI client pointed at your vLLM on GPUBrazil
fallback = OpenAI(
    base_url="https://your-instance.gpubrazil.com/v1",
    api_key="your-local-key",
)

resp = fallback.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B",
    messages=[{"role": "user", "content": "Summarize this contract."}],
)
print(resp.choices[0].message.content)

Step 3 — The automatic failover pattern

With both ends speaking the same protocol, failover becomes a try/except: if the primary fails, route to the self-hosted one. Your user never notices.

def chat_with_failover(messages, model="primary-model"):
    try:
        return primary.chat.completions.create(
            model=model, messages=messages
        )
    except Exception as e:
        # Primary unavailable (suspended, unstable, blocked):
        # fall back to the self-hosted open-source model
        return fallback.chat.completions.create(
            model="Qwen/Qwen3-235B-A22B", messages=messages
        )

Step 4 — Routing with LiteLLM (managed failover)

For something more robust than a try/except, use a proxy/router like LiteLLM. It puts multiple models behind a single interface, with automatic failover, load balancing, and cost limits. Your app talks to LiteLLM; LiteLLM decides which backend to hit.

Step 5 — Test the fallback and version the weights

A plan B no one tests is a plan B that doesn't exist. Two practices close out the playbook:

Test periodically: schedule a game day where you simulate the primary going down and verify the fallback boots and responds with acceptable quality.
Version the weights: store the weights of your chosen open-source model (and the version hash). If the source repository disappears or changes, you still have exactly the model you validated.

💡 Why a dedicated GPU

Beyond continuity, running the fallback on your own dedicated instance keeps your data under your control (relevant for the LGPD), since prompts and weights are never sent to a third-party API. You pay by the hour in reais, with no capex or FX risk.

Build your plan B today, not in the next crisis

Spin up an open-source vLLM fallback on a dedicated GPU in minutes.

Get Started Free →

Frequently asked questions

What do I do when the AI model my company depends on gets switched off?

Have a plan B ready before the crisis: abstract your calls behind an OpenAI-compatible interface, keep a self-hosted open-source model (vLLM or TGI) warm on a GPU, and use a router like LiteLLM for automatic failover. When the primary model goes down, traffic routes to the fallback with no code change.

What happened to Claude Fable 5 and Mythos 5?

In June 2026, following a US government directive, Anthropic disabled Claude Fable 5 and Mythos 5 for all customers. Other Claude models, such as Opus 4.8, were unaffected. The episode showed that a production model can vanish due to decisions outside your control.

How do I keep a self-hosted fallback cheap without leaving it running all the time?

Keep the model weights versioned and a ready template on GPUBrazil to spin up vLLM quickly. Test the fallback periodically to make sure it boots and responds. Since billing is hourly in reais, you only pay for the GPU when the fallback is active, avoiding idle hardware cost.

Conclusion

The Fable 5 and Mythos 5 suspension wasn't an isolated event — it was a rehearsal of what can happen to any third-party dependency. Business continuity in AI isn't luck: it's architecture. Abstract, keep a self-hosted fallback alive on a dedicated GPU, route with failover, and test. So the next time a model vanishes, your product keeps answering.