Cloud Providers

Detailed guide to cloud LLM providers — OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Cohere, and OpenRouter — with pricing, model specs, and configuration.

Each cloud provider is accessed through LiteLLM using the provider/model-name format. Set your API key and model in environment variables.

OpenAI

The most widely used LLM provider with a broad ecosystem and fast release cadence. The GPT-5 family launched in early 2025 and the GPT-4.1 series offers the largest context windows at 1M tokens.

Models

Model	Context	Input $/M	Output $/M	Notes
`gpt-5`	400K	$1.25	$10.00	Flagship model, strongest reasoning
`gpt-5-mini`	400K	$0.125	$1.00	Best value for most tasks
`gpt-5-nano`	400K	$0.05	$0.40	Ultra-cheap for high-volume classification
`gpt-4.1`	1M	$2.00	$8.00	Largest context window, code-optimized
`gpt-4.1-mini`	1M	$0.40	$1.60	1M context at low cost

GPT-4o is legacy and no longer recommended for new deployments.

Configuration

OPENAI_API_KEY=sk-...
LITELLM_MODEL=openai/gpt-5-mini

Strengths for support

Excellent function calling for structured data extraction
Fast response times for real-time chat
Batch API available at 50% discount for async tasks (triage, QA scoring)
GPT-5-nano is extremely cheap for high-volume classification and routing
GPT-4.1 offers 1M context for processing entire conversation histories

Anthropic

Known for safety research, long context, and strong reasoning capabilities. The Claude 4.6 family offers up to 1M token context windows.

Models

Model	Context	Input $/M	Output $/M	Notes
`claude-opus-4.6`	1M	$5.00	$25.00	Maximum capability, complex reasoning
`claude-sonnet-4.6`	1M	$3.00	$15.00	Best reasoning-to-cost ratio
`claude-haiku-4.5`	200K	$1.00	$5.00	Fast and affordable

Older models (Claude Sonnet 4, Claude Haiku 3.5) are deprecated.

Configuration

ANTHROPIC_API_KEY=sk-ant-...
LITELLM_MODEL=anthropic/claude-sonnet-4.6

Strengths for support

Extended thinking mode for complex QA evaluations
Strong instruction following reduces prompt engineering effort
Excellent at nuanced tone and empathy in draft replies
1M context on Opus and Sonnet handles entire knowledge bases in a single call

Google

Offers massive context windows and competitive pricing, especially on the Flash tier. Gemini 2.5 Flash is currently the most popular model on OpenRouter by weekly usage.

Models

Model	Context	Input $/M	Output $/M	Notes
`gemini-2.5-pro`	1M	$1.25	$10.00	Strongest Google model, 1M context
`gemini-2.5-flash`	1M	$0.30	$2.50	Most popular model by usage, great value
`gemini-3-flash-preview`	1M	$0.50	$3.00	Next-gen preview, improved reasoning

Gemini 2.0 Flash is deprecated as of June 2026. Gemma 3 27B is the current open-weight model (available via Ollama and OpenRouter).

Configuration

GEMINI_API_KEY=...
LITELLM_MODEL=google/gemini-2.5-flash

Strengths for support

1M token context is ideal for knowledge base article analysis
Gemini 2.5 Flash offers excellent quality at low cost for classification
Grounding with Google Search for fact-checking responses
Native multimodal support for image-based support tickets

xAI

A newer provider offering competitive models with strong reasoning and very large context windows. Grok 4.1 Fast provides 2M tokens of context -- the largest available from any cloud provider.

Models

Model	Context	Input $/M	Output $/M	Notes
`grok-4`	256K	$3.00	$15.00	Flagship reasoning model
`grok-4.1-fast`	2M	$0.20	$0.50	Largest context window available, very fast

Configuration

XAI_API_KEY=...
LITELLM_MODEL=xai/grok-4.1-fast

Strengths for support

2M token context on Grok 4.1 Fast can process entire ticket histories and knowledge bases together
Grok 4.1 Fast is extremely cost-effective at $0.20/$0.50 per million tokens
Strong reasoning capabilities on Grok 4 for complex QA and analysis
Fast inference speeds suitable for real-time support workflows

DeepSeek

Chinese AI lab offering extremely cost-effective models. DeepSeek V3.2 is one of the cheapest capable models available, and R1 provides strong reasoning at a fraction of the cost of competitors.

Models

Model	Context	Input $/M	Output $/M	Notes
`deepseek-v3.2`	164K	$0.28	$0.42	Extremely cheap general-purpose model
`deepseek-r1`	64K	$0.70	$2.50	Reasoning model with chain-of-thought

DeepSeek V3.2 replaced the earlier V3 model and is also available as deepseek-chat.

Configuration

DEEPSEEK_API_KEY=...
LITELLM_MODEL=deepseek/deepseek-v3.2

Strengths for support

V3.2 at $0.28/$0.42 per million tokens is ideal for high-volume triage and classification
R1 provides strong reasoning at roughly 5x less cost than GPT-5 or Claude Opus 4.6
Good multilingual performance
Open-weight R1 model also available for local deployment via Ollama

Mistral

European AI provider with strong multilingual capabilities and EU data residency options.

Models

Model	Context	Input $/M	Output $/M	Notes
`mistral-large-3`	128K	$2.00	$6.00	Flagship model, strong reasoning
`mistral-medium-3`	128K	$1.00	$3.00	Balanced quality and cost
`mistral-small-3.1`	128K	$0.20	$0.60	Very cost-effective
`codestral`	256K	$0.20	$0.60	Optimized for code generation

Configuration

MISTRAL_API_KEY=...
LITELLM_MODEL=mistral/mistral-small-3.1

Strengths for support

EU-hosted endpoints for GDPR compliance -- data stays in Europe
Strong multilingual performance across European languages
Mistral Small 3.1 offers excellent quality-to-cost for classification tasks
Open-weight versions available for on-premises deployment

Cohere

Specializes in retrieval-augmented generation (RAG) with built-in citation grounding.

Models

Model	Context	Input $/M	Output $/M	Notes
`command-a`	128K	Contact	Contact	RAG-optimized flagship (replaces Command R+)

Configuration

COHERE_API_KEY=...
LITELLM_MODEL=cohere/command-a

Strengths for support

Citation grounding -- responses include source references from your knowledge base
Built-in RAG pipeline reduces integration complexity
Strong reranking capabilities for search result quality
Embed models for efficient semantic search

A unified API gateway that provides access to 300+ models from every major provider through a single API key. Instead of managing separate API keys for OpenAI, Anthropic, Google, and others, you route everything through OpenRouter.

How it works

OpenRouter acts as a proxy -- you send requests using the openrouter/ prefix, and OpenRouter forwards them to the underlying provider. Pricing is the provider's base rate plus a small markup (typically 0--15% depending on the model). Open-source models are available at very competitive rates since OpenRouter aggregates capacity from multiple inference providers. Gemini 2.5 Flash is currently the most popular model on the platform by weekly usage.

Popular models via OpenRouter

Model	Context	Input $/M	Output $/M	Notes
`openai/gpt-5-mini`	400K	~$0.14	~$1.15	OpenAI via OpenRouter
`anthropic/claude-sonnet-4.6`	1M	~$3.00	~$15.00	Anthropic via OpenRouter
`google/gemini-2.5-flash`	1M	~$0.30	~$2.50	Most popular model on OpenRouter
`google/gemini-2.5-pro`	1M	~$1.25	~$10.00	Google via OpenRouter
`xai/grok-4.1-fast`	2M	~$0.20	~$0.50	Largest context available
`meta-llama/llama-4-scout`	10M	$0.15	$0.60	Open model, hosted inference
`meta-llama/llama-4-maverick`	400K	$0.25	$1.00	Open model, hosted inference
`meta-llama/llama-3.3-70b`	128K	$0.10	$0.32	Mature open model, very cheap
`google/gemma-3-27b`	128K	$0.10	$0.20	Open model, hosted inference
`moonshot/kimi-k2`	128K	$0.57	$2.30	Strong multilingual reasoning
`deepseek/deepseek-r1`	64K	~$0.70	~$2.50	Reasoning model

Configuration

OPENROUTER_API_KEY=sk-or-...
LITELLM_MODEL=openrouter/google/gemini-2.5-flash

OpenRouter uses the format openrouter/provider/model in LiteLLM. Set a single OPENROUTER_API_KEY and you can access any model they support.

Strengths for support

Single API key for every provider -- simplifies secrets management and billing
Model fallback -- configure automatic failover to a backup model if the primary is unavailable
Open model hosting -- run Llama 4 Scout, Gemma 3, and other open models without your own GPU infrastructure
Usage dashboard -- centralized view of spend across all models and providers
No vendor lock-in -- switch between providers by changing one string, no API key rotation needed
Free tier models -- some open models are available with free credits for experimentation

When to use OpenRouter vs direct APIs

Scenario	Recommendation
Using a single provider	Direct API (lower cost, fewer hops)
Using 2+ providers	OpenRouter (one key, unified billing)
Experimenting with models	OpenRouter (instant access to everything)
Using open models without GPUs	OpenRouter (hosted Llama 4, Gemma 3, etc.)
Maximum latency control	Direct API (no proxy overhead)
Enterprise with existing contracts	Direct API (use negotiated rates)
Need the largest context window	OpenRouter with xAI Grok 4.1 Fast (2M tokens)

Model routing and fallbacks

OpenRouter supports automatic model routing -- if one provider is down or slow, requests can fall back to an alternative:

import litellm

# Primary: Claude Sonnet 4.6, fallback: GPT-5-mini via OpenRouter
response = await litellm.acompletion(
    model="openrouter/anthropic/claude-sonnet-4.6",
    messages=messages,
    fallbacks=["openrouter/openai/gpt-5-mini"],
)

API key management

Store API keys in your .env file (never commit them):

# .env -- direct provider keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
XAI_API_KEY=...
DEEPSEEK_API_KEY=...
MISTRAL_API_KEY=...
COHERE_API_KEY=...

# Or use a single OpenRouter key for all providers
OPENROUTER_API_KEY=sk-or-...

LiteLLM automatically picks up the right key based on the model prefix. You can also configure keys programmatically via litellm.api_key.

Rate limits

All providers enforce rate limits. For production deployments:

OpenAI: Request tier upgrades via the dashboard for higher TPM (tokens per minute)
Anthropic: Workspaces allow per-team rate limit allocation
Google: Vertex AI provides higher quotas than the consumer API
xAI: Rate limits scale with usage tier; contact for enterprise quotas
DeepSeek: Rate limits are generous on paid tiers; may throttle during peak demand
Mistral: Contact sales for enterprise rate limits
OpenRouter: Rate limits vary by underlying provider; dashboard shows real-time usage

LiteLLM supports automatic retries with exponential backoff via litellm.set_callbacks().

On this page