Model Selection Guide
Which LLM to use for each AI support capability — recommendations by use case, budget, and requirements.
Choosing the right model for each service is the single most impactful decision for both quality and cost. This guide provides concrete recommendations based on each service's workload characteristics.
Decision factors
Before selecting a model, consider:
- Privacy requirements -- Can customer data leave your infrastructure? If not, use local models.
- Budget -- What is your monthly LLM spend target?
- Latency -- Is the service real-time (chat reply) or async (batch QA scoring)?
- Quality threshold -- Is "good enough" acceptable (triage) or does quality directly impact customers (reply drafting)?
Per-service recommendations
Simpli Triage
Task: Classify tickets into categories and assign urgency levels.
This is a structured classification task with short input and output. Speed and cost matter more than nuanced reasoning.
| Tier | Cloud model | Local model | Why |
|---|---|---|---|
| Budget | GPT-5-nano ($0.05/$0.40) or DeepSeek V3.2 ($0.28/$0.42) | Gemma 3 27B | Classification doesn't need large models |
| Balanced | GPT-5-mini ($0.125/$1) | Phi-4 | Reliable, low cost |
| Quality | GPT-4.1-mini ($0.40/$1.60) | Gemma 3 27B | Better edge-case handling |
# Recommended
LITELLM_MODEL=openai/gpt-5-nano
# Local alternative
LITELLM_MODEL=ollama/gemma3:27bSimpli Reply
Task: Generate draft responses for customer support tickets.
Quality is critical -- these drafts are shown to agents and may be sent to customers. Tone, empathy, and accuracy all matter. Longer outputs mean output token cost dominates.
| Tier | Cloud model | Local model | Why |
|---|---|---|---|
| Budget | DeepSeek V3.2 ($0.28/$0.42) | Llama 4 Scout | Decent quality at minimal cost |
| Balanced | Claude Sonnet 4.6 ($3/$15) | Llama 4 Scout | Best reasoning-to-cost ratio |
| Quality | Claude Opus 4.6 ($5/$25) | Llama 4 Scout | Maximum empathy and nuance |
# Recommended
LITELLM_MODEL=anthropic/claude-sonnet-4-6-20260401
# Local alternative
LITELLM_MODEL=ollama/llama4Simpli QA
Task: Score support conversations against quality rubrics.
Requires strong reasoning to evaluate multi-turn conversations. Must understand nuance -- did the agent show empathy? Was the resolution complete? Longer inputs (full conversations) but structured output.
| Tier | Cloud model | Local model | Why |
|---|---|---|---|
| Budget | GPT-5-mini ($0.125/$1) | Qwen 3 32B | Acceptable for basic rubrics |
| Balanced | Claude Sonnet 4.6 ($3/$15) | Qwen 3 32B | Strong reasoning, fair scoring |
| Quality | Claude Opus 4.6 ($5/$25) | Qwen 3 32B | Deep reasoning with chain-of-thought |
# Recommended
LITELLM_MODEL=anthropic/claude-sonnet-4-6-20260401
# Local alternative
LITELLM_MODEL=ollama/qwen3:32bSimpli Sentiment
Task: Analyze sentiment and detect escalation risk.
Simple classification task -- even simpler than triage. Optimize for speed and cost.
| Tier | Cloud model | Local model | Why |
|---|---|---|---|
| Budget | GPT-5-nano ($0.05/$0.40) or DeepSeek V3.2 ($0.28/$0.42) | Gemma 3 27B | Cheapest options available |
| Balanced | Gemini 2.5 Flash ($0.30/$2.50) | Gemma 3 27B | Fast, reliable |
| Quality | GPT-5-mini ($0.125/$1) | Phi-4 | Better nuance detection |
# Recommended
LITELLM_MODEL=openai/gpt-5-nano
# Local alternative
LITELLM_MODEL=ollama/gemma3:27bSimpli KB
Task: Knowledge base article analysis, gap detection, semantic search.
Long context is the key requirement -- articles can be lengthy and gap analysis needs to consider the entire knowledge base. Quality of analysis matters for actionable recommendations.
| Tier | Cloud model | Local model | Why |
|---|---|---|---|
| Budget | Gemini 2.5 Flash ($0.30/$2.50) | Llama 4 Scout | 1M+ context at minimal cost |
| Balanced | Gemini 2.5 Pro ($1.25/$10) | Llama 4 Scout | Best long-context quality |
| Quality | Claude Sonnet 4.6 ($3/$15) | Llama 4 Scout (10M ctx) | Deep analysis, long context |
# Recommended
LITELLM_MODEL=google/gemini-2.5-pro
# Local alternative
LITELLM_MODEL=ollama/llama4Quick reference
| Service | Best cloud | Best via OpenRouter | Best local | Key requirement |
|---|---|---|---|---|
| Triage | GPT-5-nano | openrouter/openai/gpt-5-nano | Gemma 3 27B | Speed, low cost |
| Reply | Claude Sonnet 4.6 | openrouter/anthropic/claude-sonnet-4.6 | Llama 4 Scout | Quality, empathy |
| QA | Claude Sonnet 4.6 | openrouter/anthropic/claude-sonnet-4.6 | Qwen 3 32B | Reasoning |
| Sentiment | GPT-5-nano | openrouter/google/gemma-3-27b | Gemma 3 27B | Speed, low cost |
| KB | Gemini 2.5 Pro | openrouter/google/gemini-2.5-pro | Llama 4 Scout | Long context |
Multi-model setup
You can use different models for different services. Each service reads its own LITELLM_MODEL environment variable:
# docker-compose.yml environment per service — direct APIs
simpli-triage:
LITELLM_MODEL: openai/gpt-5-nano
simpli-reply:
LITELLM_MODEL: anthropic/claude-sonnet-4-6-20260401
simpli-qa:
LITELLM_MODEL: anthropic/claude-sonnet-4-6-20260401
simpli-sentiment:
LITELLM_MODEL: openai/gpt-5-nano
simpli-kb:
LITELLM_MODEL: google/gemini-2.5-proOpenRouter single-key alternative
If you prefer managing a single API key, route everything through OpenRouter:
# docker-compose.yml — all services via OpenRouter
simpli-triage:
LITELLM_MODEL: openrouter/openai/gpt-5-nano
OPENROUTER_API_KEY: sk-or-...
simpli-reply:
LITELLM_MODEL: openrouter/anthropic/claude-sonnet-4.6
OPENROUTER_API_KEY: sk-or-...
simpli-qa:
LITELLM_MODEL: openrouter/anthropic/claude-sonnet-4.6
OPENROUTER_API_KEY: sk-or-...
simpli-sentiment:
LITELLM_MODEL: openrouter/google/gemma-3-27b
OPENROUTER_API_KEY: sk-or-...
simpli-kb:
LITELLM_MODEL: openrouter/google/gemini-2.5-pro
OPENROUTER_API_KEY: sk-or-...Note that the sentiment service uses gemma-3-27b via OpenRouter -- this is a hosted open model at $0.10/$0.20 per M tokens, significantly cheaper than GPT-5-nano for simple classification. See the Cloud Providers page for more on OpenRouter's model catalog and pricing.
DeepSeek V3.2 ($0.28/$0.42) is also worth considering as a cost-effective middle ground between local models (free but require GPU hardware) and premium cloud APIs. It handles classification and simple generation well enough for budget-tier workloads while costing a fraction of models like Claude Sonnet 4.6 or GPT-4.1.
This tiered approach uses budget models for simple classification (triage, sentiment) and premium models for tasks where quality directly impacts output (reply, QA). See the Cost Optimization page for estimated savings.