AI Support Guide

Cloud Providers

Detailed guide to cloud LLM providers — OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Cohere, and OpenRouter — with pricing, model specs, and configuration.

Each cloud provider is accessed through LiteLLM using the provider/model-name format. Set your API key and model in environment variables.

OpenAI

The most widely used LLM provider with a broad ecosystem and fast release cadence. The GPT-5 family launched in early 2025 and the GPT-4.1 series offers the largest context windows at 1M tokens.

Models

ModelContextInput $/MOutput $/MNotes
gpt-5400K$1.25$10.00Flagship model, strongest reasoning
gpt-5-mini400K$0.125$1.00Best value for most tasks
gpt-5-nano400K$0.05$0.40Ultra-cheap for high-volume classification
gpt-4.11M$2.00$8.00Largest context window, code-optimized
gpt-4.1-mini1M$0.40$1.601M context at low cost

GPT-4o is legacy and no longer recommended for new deployments.

Configuration

OPENAI_API_KEY=sk-...
LITELLM_MODEL=openai/gpt-5-mini

Strengths for support

  • Excellent function calling for structured data extraction
  • Fast response times for real-time chat
  • Batch API available at 50% discount for async tasks (triage, QA scoring)
  • GPT-5-nano is extremely cheap for high-volume classification and routing
  • GPT-4.1 offers 1M context for processing entire conversation histories

Anthropic

Known for safety research, long context, and strong reasoning capabilities. The Claude 4.6 family offers up to 1M token context windows.

Models

ModelContextInput $/MOutput $/MNotes
claude-opus-4.61M$5.00$25.00Maximum capability, complex reasoning
claude-sonnet-4.61M$3.00$15.00Best reasoning-to-cost ratio
claude-haiku-4.5200K$1.00$5.00Fast and affordable

Older models (Claude Sonnet 4, Claude Haiku 3.5) are deprecated.

Configuration

ANTHROPIC_API_KEY=sk-ant-...
LITELLM_MODEL=anthropic/claude-sonnet-4.6

Strengths for support

  • Extended thinking mode for complex QA evaluations
  • Strong instruction following reduces prompt engineering effort
  • Excellent at nuanced tone and empathy in draft replies
  • 1M context on Opus and Sonnet handles entire knowledge bases in a single call

Google

Offers massive context windows and competitive pricing, especially on the Flash tier. Gemini 2.5 Flash is currently the most popular model on OpenRouter by weekly usage.

Models

ModelContextInput $/MOutput $/MNotes
gemini-2.5-pro1M$1.25$10.00Strongest Google model, 1M context
gemini-2.5-flash1M$0.30$2.50Most popular model by usage, great value
gemini-3-flash-preview1M$0.50$3.00Next-gen preview, improved reasoning

Gemini 2.0 Flash is deprecated as of June 2026. Gemma 3 27B is the current open-weight model (available via Ollama and OpenRouter).

Configuration

GEMINI_API_KEY=...
LITELLM_MODEL=google/gemini-2.5-flash

Strengths for support

  • 1M token context is ideal for knowledge base article analysis
  • Gemini 2.5 Flash offers excellent quality at low cost for classification
  • Grounding with Google Search for fact-checking responses
  • Native multimodal support for image-based support tickets

xAI

A newer provider offering competitive models with strong reasoning and very large context windows. Grok 4.1 Fast provides 2M tokens of context -- the largest available from any cloud provider.

Models

ModelContextInput $/MOutput $/MNotes
grok-4256K$3.00$15.00Flagship reasoning model
grok-4.1-fast2M$0.20$0.50Largest context window available, very fast

Configuration

XAI_API_KEY=...
LITELLM_MODEL=xai/grok-4.1-fast

Strengths for support

  • 2M token context on Grok 4.1 Fast can process entire ticket histories and knowledge bases together
  • Grok 4.1 Fast is extremely cost-effective at $0.20/$0.50 per million tokens
  • Strong reasoning capabilities on Grok 4 for complex QA and analysis
  • Fast inference speeds suitable for real-time support workflows

DeepSeek

Chinese AI lab offering extremely cost-effective models. DeepSeek V3.2 is one of the cheapest capable models available, and R1 provides strong reasoning at a fraction of the cost of competitors.

Models

ModelContextInput $/MOutput $/MNotes
deepseek-v3.2164K$0.28$0.42Extremely cheap general-purpose model
deepseek-r164K$0.70$2.50Reasoning model with chain-of-thought

DeepSeek V3.2 replaced the earlier V3 model and is also available as deepseek-chat.

Configuration

DEEPSEEK_API_KEY=...
LITELLM_MODEL=deepseek/deepseek-v3.2

Strengths for support

  • V3.2 at $0.28/$0.42 per million tokens is ideal for high-volume triage and classification
  • R1 provides strong reasoning at roughly 5x less cost than GPT-5 or Claude Opus 4.6
  • Good multilingual performance
  • Open-weight R1 model also available for local deployment via Ollama

Mistral

European AI provider with strong multilingual capabilities and EU data residency options.

Models

ModelContextInput $/MOutput $/MNotes
mistral-large-3128K$2.00$6.00Flagship model, strong reasoning
mistral-medium-3128K$1.00$3.00Balanced quality and cost
mistral-small-3.1128K$0.20$0.60Very cost-effective
codestral256K$0.20$0.60Optimized for code generation

Configuration

MISTRAL_API_KEY=...
LITELLM_MODEL=mistral/mistral-small-3.1

Strengths for support

  • EU-hosted endpoints for GDPR compliance -- data stays in Europe
  • Strong multilingual performance across European languages
  • Mistral Small 3.1 offers excellent quality-to-cost for classification tasks
  • Open-weight versions available for on-premises deployment

Cohere

Specializes in retrieval-augmented generation (RAG) with built-in citation grounding.

Models

ModelContextInput $/MOutput $/MNotes
command-a128KContactContactRAG-optimized flagship (replaces Command R+)

Configuration

COHERE_API_KEY=...
LITELLM_MODEL=cohere/command-a

Strengths for support

  • Citation grounding -- responses include source references from your knowledge base
  • Built-in RAG pipeline reduces integration complexity
  • Strong reranking capabilities for search result quality
  • Embed models for efficient semantic search

OpenRouter

A unified API gateway that provides access to 300+ models from every major provider through a single API key. Instead of managing separate API keys for OpenAI, Anthropic, Google, and others, you route everything through OpenRouter.

How it works

OpenRouter acts as a proxy -- you send requests using the openrouter/ prefix, and OpenRouter forwards them to the underlying provider. Pricing is the provider's base rate plus a small markup (typically 0--15% depending on the model). Open-source models are available at very competitive rates since OpenRouter aggregates capacity from multiple inference providers. Gemini 2.5 Flash is currently the most popular model on the platform by weekly usage.

ModelContextInput $/MOutput $/MNotes
openai/gpt-5-mini400K~$0.14~$1.15OpenAI via OpenRouter
anthropic/claude-sonnet-4.61M~$3.00~$15.00Anthropic via OpenRouter
google/gemini-2.5-flash1M~$0.30~$2.50Most popular model on OpenRouter
google/gemini-2.5-pro1M~$1.25~$10.00Google via OpenRouter
xai/grok-4.1-fast2M~$0.20~$0.50Largest context available
meta-llama/llama-4-scout10M$0.15$0.60Open model, hosted inference
meta-llama/llama-4-maverick400K$0.25$1.00Open model, hosted inference
meta-llama/llama-3.3-70b128K$0.10$0.32Mature open model, very cheap
google/gemma-3-27b128K$0.10$0.20Open model, hosted inference
moonshot/kimi-k2128K$0.57$2.30Strong multilingual reasoning
deepseek/deepseek-r164K~$0.70~$2.50Reasoning model

Configuration

OPENROUTER_API_KEY=sk-or-...
LITELLM_MODEL=openrouter/google/gemini-2.5-flash

OpenRouter uses the format openrouter/provider/model in LiteLLM. Set a single OPENROUTER_API_KEY and you can access any model they support.

Strengths for support

  • Single API key for every provider -- simplifies secrets management and billing
  • Model fallback -- configure automatic failover to a backup model if the primary is unavailable
  • Open model hosting -- run Llama 4 Scout, Gemma 3, and other open models without your own GPU infrastructure
  • Usage dashboard -- centralized view of spend across all models and providers
  • No vendor lock-in -- switch between providers by changing one string, no API key rotation needed
  • Free tier models -- some open models are available with free credits for experimentation

When to use OpenRouter vs direct APIs

ScenarioRecommendation
Using a single providerDirect API (lower cost, fewer hops)
Using 2+ providersOpenRouter (one key, unified billing)
Experimenting with modelsOpenRouter (instant access to everything)
Using open models without GPUsOpenRouter (hosted Llama 4, Gemma 3, etc.)
Maximum latency controlDirect API (no proxy overhead)
Enterprise with existing contractsDirect API (use negotiated rates)
Need the largest context windowOpenRouter with xAI Grok 4.1 Fast (2M tokens)

Model routing and fallbacks

OpenRouter supports automatic model routing -- if one provider is down or slow, requests can fall back to an alternative:

import litellm

# Primary: Claude Sonnet 4.6, fallback: GPT-5-mini via OpenRouter
response = await litellm.acompletion(
    model="openrouter/anthropic/claude-sonnet-4.6",
    messages=messages,
    fallbacks=["openrouter/openai/gpt-5-mini"],
)

API key management

Store API keys in your .env file (never commit them):

# .env -- direct provider keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
XAI_API_KEY=...
DEEPSEEK_API_KEY=...
MISTRAL_API_KEY=...
COHERE_API_KEY=...

# Or use a single OpenRouter key for all providers
OPENROUTER_API_KEY=sk-or-...

LiteLLM automatically picks up the right key based on the model prefix. You can also configure keys programmatically via litellm.api_key.

Rate limits

All providers enforce rate limits. For production deployments:

  • OpenAI: Request tier upgrades via the dashboard for higher TPM (tokens per minute)
  • Anthropic: Workspaces allow per-team rate limit allocation
  • Google: Vertex AI provides higher quotas than the consumer API
  • xAI: Rate limits scale with usage tier; contact for enterprise quotas
  • DeepSeek: Rate limits are generous on paid tiers; may throttle during peak demand
  • Mistral: Contact sales for enterprise rate limits
  • OpenRouter: Rate limits vary by underlying provider; dashboard shows real-time usage

LiteLLM supports automatic retries with exponential backoff via litellm.set_callbacks().

On this page