Local & Open-Source Models

Run LLMs locally with Ollama, vLLM, or llama.cpp — setup guides for Llama 4, Gemma 3, Mistral, Qwen 3, DeepSeek, and Phi-4.

Running models locally eliminates API costs, keeps data on-premises, and can reduce latency for certain workloads. AI support services work with local models through LiteLLM's unified interface — no code changes required.

Why run locally

Privacy: Customer data never leaves your infrastructure. No third-party data processing agreements needed.
Cost: Zero per-token cost after hardware investment. Ideal for high-volume workloads like ticket classification and QA scoring.
Latency: No network round-trip for on-premises deployments. Sub-100ms inference on modern GPUs.
Air-gapped: Works without internet connectivity, suitable for regulated industries and government deployments.
Development: Free experimentation during development and testing without burning API credits.

Runtimes

Runtime	Setup	GPU support	API format	Best for
Ollama	One-line install	NVIDIA, Apple Silicon	OpenAI-compatible	Development, small teams
vLLM	pip install	NVIDIA (CUDA)	OpenAI-compatible	Production, high throughput
llama.cpp	Build from source	NVIDIA, AMD, Apple Silicon, CPU	Custom + OpenAI-compatible	Edge, constrained hardware

Ollama is the easiest starting point. It handles model downloading, quantization, and serving behind a single CLI. For production deployments with multiple concurrent users, vLLM provides continuous batching and PagedAttention for significantly higher throughput. llama.cpp is the most portable option, running on everything from Raspberry Pi to multi-GPU servers.

Ollama setup

The fastest way to get started with local models.

Install and run

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull gemma3
ollama pull llama4
ollama pull qwen3

# Serve (runs on port 11434 by default)
ollama serve

Configure with LiteLLM

LITELLM_MODEL=ollama/gemma3
OLLAMA_API_BASE=http://localhost:11434

Docker Compose

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  triage-service:
    image: your-org/triage-service:latest
    environment:
      LITELLM_MODEL: ollama/gemma3:27b
      OLLAMA_API_BASE: http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:

For Apple Silicon Macs running Docker Desktop, remove the deploy.resources block — Metal acceleration is used automatically.

vLLM setup

Optimized for production throughput with continuous batching and PagedAttention.

Install and serve

pip install vllm

# Serve a model with OpenAI-compatible API
vllm serve google/gemma-3-27b-it --port 8000

# For MoE models like Llama 4 Scout
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --port 8000 --tensor-parallel-size 2

# For quantized models
vllm serve google/gemma-3-27b-it --quantization awq --port 8000

Configure with LiteLLM

LITELLM_MODEL=openai/google/gemma-3-27b-it
OPENAI_API_BASE=http://localhost:8000/v1
OPENAI_API_KEY=dummy  # vLLM doesn't require a real key

vLLM excels at high-concurrency scenarios — it can serve multiple Simpli services simultaneously with efficient GPU memory management. Use --tensor-parallel-size to shard large models across multiple GPUs.

llama.cpp setup

Lightweight C++ inference engine that runs on CPUs and a wide range of GPUs.

Build and run

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j

# Download a GGUF model (from Hugging Face or similar)
# Example: Gemma 3 27B quantized
./llama-cli -m models/gemma-3-27b-it-Q4_K_M.gguf

# Run as OpenAI-compatible server
./llama-server -m models/gemma-3-27b-it-Q4_K_M.gguf --port 8080

Configure with LiteLLM

LITELLM_MODEL=openai/gemma-3-27b-it
OPENAI_API_BASE=http://localhost:8080/v1
OPENAI_API_KEY=dummy

Best for edge deployments or machines without NVIDIA GPUs — supports Apple Metal, AMD ROCm, and pure CPU inference (slower but functional).

Model catalog

Llama 4 Scout (Meta, April 2025)

Meta's mixture-of-experts architecture delivers frontier-class performance with efficient inference. Scout uses 16 experts with 1 active, totaling 109B parameters but only 17B active during inference. The standout feature is its 10M token context window — the largest among open models by a wide margin.

Variant	Parameters	Context	VRAM	LiteLLM ID	Notes
Llama 4 Scout	17B active / 109B total	10M	24 GB (Q4)	`ollama/llama4`	16 experts, multimodal (text + image)

Llama 4 Maverick (Meta, April 2025)

The larger Llama 4 variant with 128 experts and 400B total parameters. Approaches frontier model quality but requires multi-GPU infrastructure.

Variant	Parameters	Context	VRAM	LiteLLM ID	Notes
Llama 4 Maverick	17B active / 400B total	1M	80 GB+ (Q4)	`ollama/llama4:maverick`	128 experts, multimodal, requires multi-GPU

Gemma 3 (Google, March 2025)

Google's open model built on Gemini 2.0 research. The 27B dense model offers strong multimodal capabilities (text and image input) across 140+ languages. Runs on a single GPU, making it one of the most practical high-quality open models available.

Variant	Parameters	Context	VRAM	LiteLLM ID	Notes
Gemma 3 27B	27B dense	128K	18 GB (Q4)	`ollama/gemma3`	Multimodal, 140+ languages

Gemma 3 is an excellent default choice for AI support workloads. The 27B model handles ticket classification, draft generation, and QA scoring well, and it fits on consumer hardware with Q4 quantization.

Qwen 3 (Alibaba, 2025)

The Qwen 3 family spans a wide range of sizes, from tiny 0.6B models for edge deployment to a 235B MoE model that rivals frontier APIs. A key feature is thinking/non-thinking mode switching — the model can produce chain-of-thought reasoning on demand or respond directly for lower latency.

Variant	Parameters	Context	VRAM	LiteLLM ID	Notes
Qwen 3 8B	8B dense	128K	6 GB (Q4)	`ollama/qwen3:8b`	Fast, lightweight
Qwen 3 14B	14B dense	128K	10 GB (Q4)	`ollama/qwen3:14b`	Good quality/speed balance
Qwen 3 32B	32B dense	128K	20 GB (Q4)	`ollama/qwen3:32b`	Strong reasoning
Qwen 3 30B-A3B	30B total / 3B active MoE	128K	4 GB (Q4)	`ollama/qwen3:30b-a3b`	MoE, very efficient
Qwen 3 235B-A22B	235B total / 22B active MoE	128K	48 GB (Q4)	`ollama/qwen3:235b-a22b`	Near-frontier MoE

Qwen 3.5 (Alibaba, February 2026)

The latest generation of the Qwen family, building on Qwen 3 with improved reasoning and instruction following.

Variant	Parameters	Context	VRAM	LiteLLM ID	Notes
Qwen 3.5 27B	27B dense	128K	18 GB (Q4)	`ollama/qwen3.5:27b`	Dense, strong all-around
Qwen 3.5 35B-A3B	35B total / 3B active MoE	128K	4 GB (Q4)	`ollama/qwen3.5:35b-a3b`	Efficient MoE
Qwen 3.5 122B-A10B	122B total / 10B active MoE	128K	24 GB (Q4)	`ollama/qwen3.5:122b-a10b`	High quality MoE
Qwen 3.5 397B-A17B	397B total / 17B active MoE	128K	80 GB+ (Q4)	`ollama/qwen3.5:397b-a17b`	Largest Qwen, multi-GPU

DeepSeek R1 (DeepSeek)

Reasoning-focused model with transparent chain-of-thought. The distilled variants bring R1-style reasoning to smaller, more deployable sizes.

Variant	Parameters	Context	VRAM	LiteLLM ID	Notes
DeepSeek R1 Distill 8B	8B	128K	6 GB (Q4)	`ollama/deepseek-r1:8b`	Lightweight reasoning
DeepSeek R1 Distill 32B	32B	128K	20 GB (Q4)	`ollama/deepseek-r1:32b`	Strong reasoning at moderate size

Useful for QA scoring workflows where you want to see the model's reasoning process. The chain-of-thought output can be logged alongside scores for auditing.

Phi-4 Reasoning (Microsoft, March 2026)

A compact 15B multimodal reasoning model from Microsoft. Supports both vision and text input. Released under the MIT license, making it one of the most permissively licensed capable models available.

Variant	Parameters	Context	VRAM	LiteLLM ID	Notes
Phi-4 Reasoning	15B	32K	10 GB (Q4)	`ollama/phi-4`	Multimodal (vision + text), MIT license

Excellent for classification tasks in AI support — its reasoning capabilities help with nuanced ticket triage decisions.

Mistral Small 3.2 (Mistral)

Open-weight model with strong multilingual capabilities and built-in function calling support. The 128K context window and 24B parameter count hit a practical sweet spot.

Variant	Parameters	Context	VRAM	LiteLLM ID	Notes
Mistral Small 3.2	24B	128K	16 GB (Q4)	`ollama/mistral-small`	Multilingual, function calling

Hardware recommendations

Setup	VRAM	Recommended models
MacBook Pro 16 GB	Shared 16 GB	Gemma 3 27B (Q4), Phi-4 (Q4), Qwen 3 8B
MacBook Pro 32 GB	Shared 32 GB	Llama 4 Scout (Q4), Qwen 3 32B (Q4), Qwen 3.5 27B (Q4)
RTX 4090	24 GB	Gemma 3 27B (Q4), Llama 4 Scout (Q4), Mistral Small 3.2, Qwen 3.5 122B-A10B (Q4)
2x RTX 4090	48 GB	Qwen 3 235B-A22B MoE (Q4), DeepSeek R1 32B (FP16)
A100 80 GB	80 GB	Any single model at high quantization
Multi-GPU server	160 GB+	Llama 4 Maverick, Qwen 3.5 397B-A17B

For most AI support deployments, a single RTX 4090 or a 32 GB Apple Silicon Mac covers the practical model range. MoE models like Llama 4 Scout and the Qwen 3/3.5 MoE variants are particularly efficient because only a fraction of total parameters are active per token.

Quantization guide

Method	Quality impact	Size reduction	When to use
FP16	None	Baseline	When VRAM is not a constraint
Q8	Minimal	~50%	Good default for production
Q6_K	Very small	~60%	Quality-sensitive production workloads
Q4_K_M	Small	~75%	Best balance of quality and size
Q4_K_S	Moderate	~75%	When memory is tight
Q3_K_M	Noticeable	~80%	Last resort before dropping model size

For AI support tasks (classification, drafting, scoring), Q4_K_M quantization typically preserves 95%+ of full-precision quality while dramatically reducing memory requirements. Start with Q4_K_M and only move to Q8 or FP16 if you observe quality degradation in your specific workload.

MoE models like Llama 4 Scout and the Qwen 3/3.5 MoE variants benefit especially from quantization because only the active expert weights are loaded per inference step, keeping effective memory usage lower than the total parameter count suggests.

Local & Open-Source Models

On this page