Cloud Providers
Detailed guide to cloud LLM providers — OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Cohere, and OpenRouter — with pricing, model specs, and configuration.
Each cloud provider is accessed through LiteLLM using the provider/model-name format. Set your API key and model in environment variables.
OpenAI
The most widely used LLM provider with a broad ecosystem and fast release cadence. The GPT-5 family launched in early 2025 and the GPT-4.1 series offers the largest context windows at 1M tokens.
Models
| Model | Context | Input $/M | Output $/M | Notes |
|---|---|---|---|---|
gpt-5 | 400K | $1.25 | $10.00 | Flagship model, strongest reasoning |
gpt-5-mini | 400K | $0.125 | $1.00 | Best value for most tasks |
gpt-5-nano | 400K | $0.05 | $0.40 | Ultra-cheap for high-volume classification |
gpt-4.1 | 1M | $2.00 | $8.00 | Largest context window, code-optimized |
gpt-4.1-mini | 1M | $0.40 | $1.60 | 1M context at low cost |
GPT-4o is legacy and no longer recommended for new deployments.
Configuration
OPENAI_API_KEY=sk-...
LITELLM_MODEL=openai/gpt-5-miniStrengths for support
- Excellent function calling for structured data extraction
- Fast response times for real-time chat
- Batch API available at 50% discount for async tasks (triage, QA scoring)
- GPT-5-nano is extremely cheap for high-volume classification and routing
- GPT-4.1 offers 1M context for processing entire conversation histories
Anthropic
Known for safety research, long context, and strong reasoning capabilities. The Claude 4.6 family offers up to 1M token context windows.
Models
| Model | Context | Input $/M | Output $/M | Notes |
|---|---|---|---|---|
claude-opus-4.6 | 1M | $5.00 | $25.00 | Maximum capability, complex reasoning |
claude-sonnet-4.6 | 1M | $3.00 | $15.00 | Best reasoning-to-cost ratio |
claude-haiku-4.5 | 200K | $1.00 | $5.00 | Fast and affordable |
Older models (Claude Sonnet 4, Claude Haiku 3.5) are deprecated.
Configuration
ANTHROPIC_API_KEY=sk-ant-...
LITELLM_MODEL=anthropic/claude-sonnet-4.6Strengths for support
- Extended thinking mode for complex QA evaluations
- Strong instruction following reduces prompt engineering effort
- Excellent at nuanced tone and empathy in draft replies
- 1M context on Opus and Sonnet handles entire knowledge bases in a single call
Offers massive context windows and competitive pricing, especially on the Flash tier. Gemini 2.5 Flash is currently the most popular model on OpenRouter by weekly usage.
Models
| Model | Context | Input $/M | Output $/M | Notes |
|---|---|---|---|---|
gemini-2.5-pro | 1M | $1.25 | $10.00 | Strongest Google model, 1M context |
gemini-2.5-flash | 1M | $0.30 | $2.50 | Most popular model by usage, great value |
gemini-3-flash-preview | 1M | $0.50 | $3.00 | Next-gen preview, improved reasoning |
Gemini 2.0 Flash is deprecated as of June 2026. Gemma 3 27B is the current open-weight model (available via Ollama and OpenRouter).
Configuration
GEMINI_API_KEY=...
LITELLM_MODEL=google/gemini-2.5-flashStrengths for support
- 1M token context is ideal for knowledge base article analysis
- Gemini 2.5 Flash offers excellent quality at low cost for classification
- Grounding with Google Search for fact-checking responses
- Native multimodal support for image-based support tickets
xAI
A newer provider offering competitive models with strong reasoning and very large context windows. Grok 4.1 Fast provides 2M tokens of context -- the largest available from any cloud provider.
Models
| Model | Context | Input $/M | Output $/M | Notes |
|---|---|---|---|---|
grok-4 | 256K | $3.00 | $15.00 | Flagship reasoning model |
grok-4.1-fast | 2M | $0.20 | $0.50 | Largest context window available, very fast |
Configuration
XAI_API_KEY=...
LITELLM_MODEL=xai/grok-4.1-fastStrengths for support
- 2M token context on Grok 4.1 Fast can process entire ticket histories and knowledge bases together
- Grok 4.1 Fast is extremely cost-effective at $0.20/$0.50 per million tokens
- Strong reasoning capabilities on Grok 4 for complex QA and analysis
- Fast inference speeds suitable for real-time support workflows
DeepSeek
Chinese AI lab offering extremely cost-effective models. DeepSeek V3.2 is one of the cheapest capable models available, and R1 provides strong reasoning at a fraction of the cost of competitors.
Models
| Model | Context | Input $/M | Output $/M | Notes |
|---|---|---|---|---|
deepseek-v3.2 | 164K | $0.28 | $0.42 | Extremely cheap general-purpose model |
deepseek-r1 | 64K | $0.70 | $2.50 | Reasoning model with chain-of-thought |
DeepSeek V3.2 replaced the earlier V3 model and is also available as deepseek-chat.
Configuration
DEEPSEEK_API_KEY=...
LITELLM_MODEL=deepseek/deepseek-v3.2Strengths for support
- V3.2 at $0.28/$0.42 per million tokens is ideal for high-volume triage and classification
- R1 provides strong reasoning at roughly 5x less cost than GPT-5 or Claude Opus 4.6
- Good multilingual performance
- Open-weight R1 model also available for local deployment via Ollama
Mistral
European AI provider with strong multilingual capabilities and EU data residency options.
Models
| Model | Context | Input $/M | Output $/M | Notes |
|---|---|---|---|---|
mistral-large-3 | 128K | $2.00 | $6.00 | Flagship model, strong reasoning |
mistral-medium-3 | 128K | $1.00 | $3.00 | Balanced quality and cost |
mistral-small-3.1 | 128K | $0.20 | $0.60 | Very cost-effective |
codestral | 256K | $0.20 | $0.60 | Optimized for code generation |
Configuration
MISTRAL_API_KEY=...
LITELLM_MODEL=mistral/mistral-small-3.1Strengths for support
- EU-hosted endpoints for GDPR compliance -- data stays in Europe
- Strong multilingual performance across European languages
- Mistral Small 3.1 offers excellent quality-to-cost for classification tasks
- Open-weight versions available for on-premises deployment
Cohere
Specializes in retrieval-augmented generation (RAG) with built-in citation grounding.
Models
| Model | Context | Input $/M | Output $/M | Notes |
|---|---|---|---|---|
command-a | 128K | Contact | Contact | RAG-optimized flagship (replaces Command R+) |
Configuration
COHERE_API_KEY=...
LITELLM_MODEL=cohere/command-aStrengths for support
- Citation grounding -- responses include source references from your knowledge base
- Built-in RAG pipeline reduces integration complexity
- Strong reranking capabilities for search result quality
- Embed models for efficient semantic search
OpenRouter
A unified API gateway that provides access to 300+ models from every major provider through a single API key. Instead of managing separate API keys for OpenAI, Anthropic, Google, and others, you route everything through OpenRouter.
How it works
OpenRouter acts as a proxy -- you send requests using the openrouter/ prefix, and OpenRouter forwards them to the underlying provider. Pricing is the provider's base rate plus a small markup (typically 0--15% depending on the model). Open-source models are available at very competitive rates since OpenRouter aggregates capacity from multiple inference providers. Gemini 2.5 Flash is currently the most popular model on the platform by weekly usage.
Popular models via OpenRouter
| Model | Context | Input $/M | Output $/M | Notes |
|---|---|---|---|---|
openai/gpt-5-mini | 400K | ~$0.14 | ~$1.15 | OpenAI via OpenRouter |
anthropic/claude-sonnet-4.6 | 1M | ~$3.00 | ~$15.00 | Anthropic via OpenRouter |
google/gemini-2.5-flash | 1M | ~$0.30 | ~$2.50 | Most popular model on OpenRouter |
google/gemini-2.5-pro | 1M | ~$1.25 | ~$10.00 | Google via OpenRouter |
xai/grok-4.1-fast | 2M | ~$0.20 | ~$0.50 | Largest context available |
meta-llama/llama-4-scout | 10M | $0.15 | $0.60 | Open model, hosted inference |
meta-llama/llama-4-maverick | 400K | $0.25 | $1.00 | Open model, hosted inference |
meta-llama/llama-3.3-70b | 128K | $0.10 | $0.32 | Mature open model, very cheap |
google/gemma-3-27b | 128K | $0.10 | $0.20 | Open model, hosted inference |
moonshot/kimi-k2 | 128K | $0.57 | $2.30 | Strong multilingual reasoning |
deepseek/deepseek-r1 | 64K | ~$0.70 | ~$2.50 | Reasoning model |
Configuration
OPENROUTER_API_KEY=sk-or-...
LITELLM_MODEL=openrouter/google/gemini-2.5-flashOpenRouter uses the format openrouter/provider/model in LiteLLM. Set a single OPENROUTER_API_KEY and you can access any model they support.
Strengths for support
- Single API key for every provider -- simplifies secrets management and billing
- Model fallback -- configure automatic failover to a backup model if the primary is unavailable
- Open model hosting -- run Llama 4 Scout, Gemma 3, and other open models without your own GPU infrastructure
- Usage dashboard -- centralized view of spend across all models and providers
- No vendor lock-in -- switch between providers by changing one string, no API key rotation needed
- Free tier models -- some open models are available with free credits for experimentation
When to use OpenRouter vs direct APIs
| Scenario | Recommendation |
|---|---|
| Using a single provider | Direct API (lower cost, fewer hops) |
| Using 2+ providers | OpenRouter (one key, unified billing) |
| Experimenting with models | OpenRouter (instant access to everything) |
| Using open models without GPUs | OpenRouter (hosted Llama 4, Gemma 3, etc.) |
| Maximum latency control | Direct API (no proxy overhead) |
| Enterprise with existing contracts | Direct API (use negotiated rates) |
| Need the largest context window | OpenRouter with xAI Grok 4.1 Fast (2M tokens) |
Model routing and fallbacks
OpenRouter supports automatic model routing -- if one provider is down or slow, requests can fall back to an alternative:
import litellm
# Primary: Claude Sonnet 4.6, fallback: GPT-5-mini via OpenRouter
response = await litellm.acompletion(
model="openrouter/anthropic/claude-sonnet-4.6",
messages=messages,
fallbacks=["openrouter/openai/gpt-5-mini"],
)API key management
Store API keys in your .env file (never commit them):
# .env -- direct provider keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
XAI_API_KEY=...
DEEPSEEK_API_KEY=...
MISTRAL_API_KEY=...
COHERE_API_KEY=...
# Or use a single OpenRouter key for all providers
OPENROUTER_API_KEY=sk-or-...LiteLLM automatically picks up the right key based on the model prefix. You can also configure keys programmatically via litellm.api_key.
Rate limits
All providers enforce rate limits. For production deployments:
- OpenAI: Request tier upgrades via the dashboard for higher TPM (tokens per minute)
- Anthropic: Workspaces allow per-team rate limit allocation
- Google: Vertex AI provides higher quotas than the consumer API
- xAI: Rate limits scale with usage tier; contact for enterprise quotas
- DeepSeek: Rate limits are generous on paid tiers; may throttle during peak demand
- Mistral: Contact sales for enterprise rate limits
- OpenRouter: Rate limits vary by underlying provider; dashboard shows real-time usage
LiteLLM supports automatic retries with exponential backoff via litellm.set_callbacks().