Cost Optimization

Strategies for reducing LLM costs in customer support — model tiering, prompt engineering, caching, local inference, batch processing, and budget tracking.

LLM costs scale with token volume. A support operation handling 100K tickets per month can easily spend thousands on API calls. These strategies help you control costs without sacrificing quality where it matters.

Understanding LLM costs

LLM pricing has two components:

Input tokens: The prompt you send (system instructions, conversation history, context)
Output tokens: The model's response (typically 2--5x more expensive per token)

A typical support interaction generates 500--2,000 input tokens and 200--800 output tokens depending on the service.

Strategy 1: Model tiering

Not every task needs the most capable model. Match model quality to task complexity.

Service	Task type	Recommended cloud model	Cost/M tokens (in/out)	Recommended local model
Simpli Triage	Classification	GPT-5-nano ($0.05/$0.40)	Low	Gemma 3 27B
Simpli Sentiment	Classification	DeepSeek V3.2 ($0.28/$0.42)	Low	Gemma 3 27B
Simpli Reply	Generation	Claude Sonnet 4.6 ($3/$15)	Medium	Llama 4 Scout
Simpli QA	Evaluation	Claude Sonnet 4.6 ($3/$15)	Medium	Qwen 3 32B
Simpli KB	Analysis	Gemini 2.5 Pro ($1.25/$10)	Medium	Llama 4 Scout

Using GPT-5-nano for triage instead of GPT-4.1 reduces per-ticket classification cost by 97% with minimal quality loss on structured classification tasks. DeepSeek V3.2 is an even cheaper alternative at $0.28/$0.42 per M tokens for sentiment analysis.

Strategy 2: Prompt optimization

Shorter prompts mean fewer input tokens:

Trim system prompts: Remove verbose instructions. Use structured output formats (json_object mode) to reduce both prompt and completion tokens.
Limit conversation history: For triage and sentiment, only send the latest 2--3 messages instead of the full thread.
Use few-shot wisely: 2 examples is often enough -- 10 examples costs 5x more with diminishing returns.
Avoid restating the question: Don't repeat the user's message in the system prompt.

# Before: ~800 input tokens
system = """You are a customer support triage agent. Your job is to classify
incoming tickets into categories. The categories are: billing, technical,
account, feature_request, complaint. Please analyze the ticket carefully
and return your classification along with a confidence score..."""

# After: ~200 input tokens
system = """Classify the ticket. Return JSON: {"category": str, "confidence": float}
Categories: billing, technical, account, feature_request, complaint"""

Strategy 3: Caching

Avoid paying for the same response twice.

LiteLLM built-in caching

import litellm

# In-memory cache (development)
litellm.cache = litellm.Cache()

# Redis cache (production)
litellm.cache = litellm.Cache(
    type="redis",
    host="localhost",
    port=6379,
)

Caching works best for:

Repeated triage of similar tickets
Knowledge base article suggestions (same article, different tickets)
Macro/template suggestions with common patterns

Semantic caching

LiteLLM supports semantic caching that matches semantically similar (not just identical) prompts. This increases cache hit rates significantly for support workloads where customers phrase the same issue differently.

Strategy 4: Local models for development

Use Ollama locally during development to eliminate cloud API costs entirely:

# .env.development
LITELLM_MODEL=ollama/gemma3:27b
OLLAMA_API_BASE=http://localhost:11434

# .env.production
LITELLM_MODEL=openai/gpt-5-nano
OPENAI_API_KEY=sk-...

Developers iterating on prompts and testing integrations can burn through thousands of API calls. Routing these through a local model saves real money.

Strategy 5: Batch processing

For non-real-time tasks, use batch APIs at reduced rates:

OpenAI Batch API: 50% discount, results within 24 hours
Ideal for: QA scoring, bulk triage, sentiment analysis backfills

# OpenAI batch API via LiteLLM
response = await litellm.abatch_completion(
    model="openai/gpt-5-nano",
    messages=messages_batch,
)

Strategy 6: OpenRouter for consolidated billing

If you use multiple providers, OpenRouter simplifies cost management with a single bill and usage dashboard. It also lets you access hosted open-source models (Llama 4 Scout, Gemma 3 27B) without running your own GPU infrastructure -- often cheaper than cloud providers for open models.

# Access open models without GPUs
LITELLM_MODEL=openrouter/meta-llama/llama-4-scout    # $0.15/$0.60 per M tokens
LITELLM_MODEL=openrouter/google/gemma-3-27b           # $0.10/$0.20 per M tokens

OpenRouter's pricing for open models can be significantly cheaper than commercial APIs for comparable quality, making it an effective middle ground between running your own hardware and paying full cloud rates.

Strategy 7: DeepSeek for budget workloads

DeepSeek V3.2 at $0.28/$0.42 per M tokens is the cheapest capable cloud model available. It handles classification and simple generation tasks well, making it ideal for high-volume, cost-sensitive services like sentiment analysis and triage.

# DeepSeek for budget classification
LITELLM_MODEL=deepseek/deepseek-chat    # V3.2: $0.28/$0.42 per M tokens

# DeepSeek R1 for reasoning tasks at a fraction of cloud cost
LITELLM_MODEL=deepseek/deepseek-reasoner  # R1: $0.70/$2.50 per M tokens

For comparison, DeepSeek V3.2 is roughly 5x cheaper than GPT-5-nano on input tokens and comparable on output tokens, while DeepSeek R1 provides strong reasoning capabilities at a fraction of Claude Opus 4.6 or GPT-4.1 pricing.

Cost tracking with CostTracker

Use a cost tracking module to monitor spend in real time:

from cost_tracker import CostTracker, TokenUsage

tracker = CostTracker()

# Record each LLM call
async def generate_draft(request):
    response = await litellm.acompletion(
        model=settings.litellm_model,
        messages=messages,
    )
    cost = tracker.record_from_response(settings.litellm_model, response)
    logger.info(
        "llm_call",
        model=cost.model,
        tokens=cost.usage.total_tokens,
        cost=str(cost.total_cost),
    )
    return response

# Check spend at any time
summary = tracker.summary()
for model, stats in summary["models"].items():
    print(f"{model}: {stats['calls']} calls, ${stats['total_cost']}")
print(f"Total: ${summary['total_cost']}")

Enable cost tracking per service via environment variable:

COST_TRACKING_ENABLED=true

Budget estimation

Estimated monthly cloud API costs per service at different ticket volumes. Assumes average ticket complexity and GPT-5-nano/DeepSeek V3.2 for classification, Claude Sonnet 4.6 for generation/evaluation.

Service	1K tickets/mo	10K tickets/mo	100K tickets/mo
Triage (GPT-5-nano)	$0.05	$0.45	$4.50
Sentiment (DeepSeek V3.2)	$0.04	$0.35	$3.50
Reply (Claude Sonnet 4.6)	$9.00	$90	$900
QA (Claude Sonnet 4.6)	$12.00	$120	$1,200
KB analysis (Gemini 2.5 Pro)	$3.00	$30	$300
Total	~$24	~$241	~$2,408

With model tiering and caching, these costs can typically be reduced by 30--60%.

Using local models for triage and sentiment drops the total by another 10--15% since those services are the highest-volume but lowest-cost components. Switching Reply to DeepSeek V3.2 where draft quality requirements are relaxed can cut the largest line item dramatically.

On this page