Local & Open-Source Models
Run LLMs locally with Ollama, vLLM, or llama.cpp — setup guides for Llama 4, Gemma 3, Mistral, Qwen 3, DeepSeek, and Phi-4.
Running models locally eliminates API costs, keeps data on-premises, and can reduce latency for certain workloads. AI support services work with local models through LiteLLM's unified interface — no code changes required.
Why run locally
- Privacy: Customer data never leaves your infrastructure. No third-party data processing agreements needed.
- Cost: Zero per-token cost after hardware investment. Ideal for high-volume workloads like ticket classification and QA scoring.
- Latency: No network round-trip for on-premises deployments. Sub-100ms inference on modern GPUs.
- Air-gapped: Works without internet connectivity, suitable for regulated industries and government deployments.
- Development: Free experimentation during development and testing without burning API credits.
Runtimes
| Runtime | Setup | GPU support | API format | Best for |
|---|---|---|---|---|
| Ollama | One-line install | NVIDIA, Apple Silicon | OpenAI-compatible | Development, small teams |
| vLLM | pip install | NVIDIA (CUDA) | OpenAI-compatible | Production, high throughput |
| llama.cpp | Build from source | NVIDIA, AMD, Apple Silicon, CPU | Custom + OpenAI-compatible | Edge, constrained hardware |
Ollama is the easiest starting point. It handles model downloading, quantization, and serving behind a single CLI. For production deployments with multiple concurrent users, vLLM provides continuous batching and PagedAttention for significantly higher throughput. llama.cpp is the most portable option, running on everything from Raspberry Pi to multi-GPU servers.
Ollama setup
The fastest way to get started with local models.
Install and run
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model
ollama pull gemma3
ollama pull llama4
ollama pull qwen3
# Serve (runs on port 11434 by default)
ollama serveConfigure with LiteLLM
LITELLM_MODEL=ollama/gemma3
OLLAMA_API_BASE=http://localhost:11434Docker Compose
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
triage-service:
image: your-org/triage-service:latest
environment:
LITELLM_MODEL: ollama/gemma3:27b
OLLAMA_API_BASE: http://ollama:11434
depends_on:
- ollama
volumes:
ollama_data:For Apple Silicon Macs running Docker Desktop, remove the deploy.resources block — Metal acceleration is used automatically.
vLLM setup
Optimized for production throughput with continuous batching and PagedAttention.
Install and serve
pip install vllm
# Serve a model with OpenAI-compatible API
vllm serve google/gemma-3-27b-it --port 8000
# For MoE models like Llama 4 Scout
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --port 8000 --tensor-parallel-size 2
# For quantized models
vllm serve google/gemma-3-27b-it --quantization awq --port 8000Configure with LiteLLM
LITELLM_MODEL=openai/google/gemma-3-27b-it
OPENAI_API_BASE=http://localhost:8000/v1
OPENAI_API_KEY=dummy # vLLM doesn't require a real keyvLLM excels at high-concurrency scenarios — it can serve multiple Simpli services simultaneously with efficient GPU memory management. Use --tensor-parallel-size to shard large models across multiple GPUs.
llama.cpp setup
Lightweight C++ inference engine that runs on CPUs and a wide range of GPUs.
Build and run
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j
# Download a GGUF model (from Hugging Face or similar)
# Example: Gemma 3 27B quantized
./llama-cli -m models/gemma-3-27b-it-Q4_K_M.gguf
# Run as OpenAI-compatible server
./llama-server -m models/gemma-3-27b-it-Q4_K_M.gguf --port 8080Configure with LiteLLM
LITELLM_MODEL=openai/gemma-3-27b-it
OPENAI_API_BASE=http://localhost:8080/v1
OPENAI_API_KEY=dummyBest for edge deployments or machines without NVIDIA GPUs — supports Apple Metal, AMD ROCm, and pure CPU inference (slower but functional).
Model catalog
Llama 4 Scout (Meta, April 2025)
Meta's mixture-of-experts architecture delivers frontier-class performance with efficient inference. Scout uses 16 experts with 1 active, totaling 109B parameters but only 17B active during inference. The standout feature is its 10M token context window — the largest among open models by a wide margin.
| Variant | Parameters | Context | VRAM | LiteLLM ID | Notes |
|---|---|---|---|---|---|
| Llama 4 Scout | 17B active / 109B total | 10M | 24 GB (Q4) | ollama/llama4 | 16 experts, multimodal (text + image) |
Llama 4 Maverick (Meta, April 2025)
The larger Llama 4 variant with 128 experts and 400B total parameters. Approaches frontier model quality but requires multi-GPU infrastructure.
| Variant | Parameters | Context | VRAM | LiteLLM ID | Notes |
|---|---|---|---|---|---|
| Llama 4 Maverick | 17B active / 400B total | 1M | 80 GB+ (Q4) | ollama/llama4:maverick | 128 experts, multimodal, requires multi-GPU |
Gemma 3 (Google, March 2025)
Google's open model built on Gemini 2.0 research. The 27B dense model offers strong multimodal capabilities (text and image input) across 140+ languages. Runs on a single GPU, making it one of the most practical high-quality open models available.
| Variant | Parameters | Context | VRAM | LiteLLM ID | Notes |
|---|---|---|---|---|---|
| Gemma 3 27B | 27B dense | 128K | 18 GB (Q4) | ollama/gemma3 | Multimodal, 140+ languages |
Gemma 3 is an excellent default choice for AI support workloads. The 27B model handles ticket classification, draft generation, and QA scoring well, and it fits on consumer hardware with Q4 quantization.
Qwen 3 (Alibaba, 2025)
The Qwen 3 family spans a wide range of sizes, from tiny 0.6B models for edge deployment to a 235B MoE model that rivals frontier APIs. A key feature is thinking/non-thinking mode switching — the model can produce chain-of-thought reasoning on demand or respond directly for lower latency.
| Variant | Parameters | Context | VRAM | LiteLLM ID | Notes |
|---|---|---|---|---|---|
| Qwen 3 8B | 8B dense | 128K | 6 GB (Q4) | ollama/qwen3:8b | Fast, lightweight |
| Qwen 3 14B | 14B dense | 128K | 10 GB (Q4) | ollama/qwen3:14b | Good quality/speed balance |
| Qwen 3 32B | 32B dense | 128K | 20 GB (Q4) | ollama/qwen3:32b | Strong reasoning |
| Qwen 3 30B-A3B | 30B total / 3B active MoE | 128K | 4 GB (Q4) | ollama/qwen3:30b-a3b | MoE, very efficient |
| Qwen 3 235B-A22B | 235B total / 22B active MoE | 128K | 48 GB (Q4) | ollama/qwen3:235b-a22b | Near-frontier MoE |
Qwen 3.5 (Alibaba, February 2026)
The latest generation of the Qwen family, building on Qwen 3 with improved reasoning and instruction following.
| Variant | Parameters | Context | VRAM | LiteLLM ID | Notes |
|---|---|---|---|---|---|
| Qwen 3.5 27B | 27B dense | 128K | 18 GB (Q4) | ollama/qwen3.5:27b | Dense, strong all-around |
| Qwen 3.5 35B-A3B | 35B total / 3B active MoE | 128K | 4 GB (Q4) | ollama/qwen3.5:35b-a3b | Efficient MoE |
| Qwen 3.5 122B-A10B | 122B total / 10B active MoE | 128K | 24 GB (Q4) | ollama/qwen3.5:122b-a10b | High quality MoE |
| Qwen 3.5 397B-A17B | 397B total / 17B active MoE | 128K | 80 GB+ (Q4) | ollama/qwen3.5:397b-a17b | Largest Qwen, multi-GPU |
DeepSeek R1 (DeepSeek)
Reasoning-focused model with transparent chain-of-thought. The distilled variants bring R1-style reasoning to smaller, more deployable sizes.
| Variant | Parameters | Context | VRAM | LiteLLM ID | Notes |
|---|---|---|---|---|---|
| DeepSeek R1 Distill 8B | 8B | 128K | 6 GB (Q4) | ollama/deepseek-r1:8b | Lightweight reasoning |
| DeepSeek R1 Distill 32B | 32B | 128K | 20 GB (Q4) | ollama/deepseek-r1:32b | Strong reasoning at moderate size |
Useful for QA scoring workflows where you want to see the model's reasoning process. The chain-of-thought output can be logged alongside scores for auditing.
Phi-4 Reasoning (Microsoft, March 2026)
A compact 15B multimodal reasoning model from Microsoft. Supports both vision and text input. Released under the MIT license, making it one of the most permissively licensed capable models available.
| Variant | Parameters | Context | VRAM | LiteLLM ID | Notes |
|---|---|---|---|---|---|
| Phi-4 Reasoning | 15B | 32K | 10 GB (Q4) | ollama/phi-4 | Multimodal (vision + text), MIT license |
Excellent for classification tasks in AI support — its reasoning capabilities help with nuanced ticket triage decisions.
Mistral Small 3.2 (Mistral)
Open-weight model with strong multilingual capabilities and built-in function calling support. The 128K context window and 24B parameter count hit a practical sweet spot.
| Variant | Parameters | Context | VRAM | LiteLLM ID | Notes |
|---|---|---|---|---|---|
| Mistral Small 3.2 | 24B | 128K | 16 GB (Q4) | ollama/mistral-small | Multilingual, function calling |
Hardware recommendations
| Setup | VRAM | Recommended models |
|---|---|---|
| MacBook Pro 16 GB | Shared 16 GB | Gemma 3 27B (Q4), Phi-4 (Q4), Qwen 3 8B |
| MacBook Pro 32 GB | Shared 32 GB | Llama 4 Scout (Q4), Qwen 3 32B (Q4), Qwen 3.5 27B (Q4) |
| RTX 4090 | 24 GB | Gemma 3 27B (Q4), Llama 4 Scout (Q4), Mistral Small 3.2, Qwen 3.5 122B-A10B (Q4) |
| 2x RTX 4090 | 48 GB | Qwen 3 235B-A22B MoE (Q4), DeepSeek R1 32B (FP16) |
| A100 80 GB | 80 GB | Any single model at high quantization |
| Multi-GPU server | 160 GB+ | Llama 4 Maverick, Qwen 3.5 397B-A17B |
For most AI support deployments, a single RTX 4090 or a 32 GB Apple Silicon Mac covers the practical model range. MoE models like Llama 4 Scout and the Qwen 3/3.5 MoE variants are particularly efficient because only a fraction of total parameters are active per token.
Quantization guide
| Method | Quality impact | Size reduction | When to use |
|---|---|---|---|
| FP16 | None | Baseline | When VRAM is not a constraint |
| Q8 | Minimal | ~50% | Good default for production |
| Q6_K | Very small | ~60% | Quality-sensitive production workloads |
| Q4_K_M | Small | ~75% | Best balance of quality and size |
| Q4_K_S | Moderate | ~75% | When memory is tight |
| Q3_K_M | Noticeable | ~80% | Last resort before dropping model size |
For AI support tasks (classification, drafting, scoring), Q4_K_M quantization typically preserves 95%+ of full-precision quality while dramatically reducing memory requirements. Start with Q4_K_M and only move to Q8 or FP16 if you observe quality degradation in your specific workload.
MoE models like Llama 4 Scout and the Qwen 3/3.5 MoE variants benefit especially from quantization because only the active expert weights are loaded per inference step, keeping effective memory usage lower than the total parameter count suggests.
Cloud Providers
Detailed guide to cloud LLM providers — OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Cohere, and OpenRouter — with pricing, model specs, and configuration.
Cost Optimization
Strategies for reducing LLM costs in customer support — model tiering, prompt engineering, caching, local inference, batch processing, and budget tracking.