AI Support Guide

Local & Open-Source Models

Run LLMs locally with Ollama, vLLM, or llama.cpp — setup guides for Llama 4, Gemma 3, Mistral, Qwen 3, DeepSeek, and Phi-4.

Running models locally eliminates API costs, keeps data on-premises, and can reduce latency for certain workloads. AI support services work with local models through LiteLLM's unified interface — no code changes required.

Why run locally

  • Privacy: Customer data never leaves your infrastructure. No third-party data processing agreements needed.
  • Cost: Zero per-token cost after hardware investment. Ideal for high-volume workloads like ticket classification and QA scoring.
  • Latency: No network round-trip for on-premises deployments. Sub-100ms inference on modern GPUs.
  • Air-gapped: Works without internet connectivity, suitable for regulated industries and government deployments.
  • Development: Free experimentation during development and testing without burning API credits.

Runtimes

RuntimeSetupGPU supportAPI formatBest for
OllamaOne-line installNVIDIA, Apple SiliconOpenAI-compatibleDevelopment, small teams
vLLMpip installNVIDIA (CUDA)OpenAI-compatibleProduction, high throughput
llama.cppBuild from sourceNVIDIA, AMD, Apple Silicon, CPUCustom + OpenAI-compatibleEdge, constrained hardware

Ollama is the easiest starting point. It handles model downloading, quantization, and serving behind a single CLI. For production deployments with multiple concurrent users, vLLM provides continuous batching and PagedAttention for significantly higher throughput. llama.cpp is the most portable option, running on everything from Raspberry Pi to multi-GPU servers.

Ollama setup

The fastest way to get started with local models.

Install and run

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull gemma3
ollama pull llama4
ollama pull qwen3

# Serve (runs on port 11434 by default)
ollama serve

Configure with LiteLLM

LITELLM_MODEL=ollama/gemma3
OLLAMA_API_BASE=http://localhost:11434

Docker Compose

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

  triage-service:
    image: your-org/triage-service:latest
    environment:
      LITELLM_MODEL: ollama/gemma3:27b
      OLLAMA_API_BASE: http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:

For Apple Silicon Macs running Docker Desktop, remove the deploy.resources block — Metal acceleration is used automatically.

vLLM setup

Optimized for production throughput with continuous batching and PagedAttention.

Install and serve

pip install vllm

# Serve a model with OpenAI-compatible API
vllm serve google/gemma-3-27b-it --port 8000

# For MoE models like Llama 4 Scout
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --port 8000 --tensor-parallel-size 2

# For quantized models
vllm serve google/gemma-3-27b-it --quantization awq --port 8000

Configure with LiteLLM

LITELLM_MODEL=openai/google/gemma-3-27b-it
OPENAI_API_BASE=http://localhost:8000/v1
OPENAI_API_KEY=dummy  # vLLM doesn't require a real key

vLLM excels at high-concurrency scenarios — it can serve multiple Simpli services simultaneously with efficient GPU memory management. Use --tensor-parallel-size to shard large models across multiple GPUs.

llama.cpp setup

Lightweight C++ inference engine that runs on CPUs and a wide range of GPUs.

Build and run

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j

# Download a GGUF model (from Hugging Face or similar)
# Example: Gemma 3 27B quantized
./llama-cli -m models/gemma-3-27b-it-Q4_K_M.gguf

# Run as OpenAI-compatible server
./llama-server -m models/gemma-3-27b-it-Q4_K_M.gguf --port 8080

Configure with LiteLLM

LITELLM_MODEL=openai/gemma-3-27b-it
OPENAI_API_BASE=http://localhost:8080/v1
OPENAI_API_KEY=dummy

Best for edge deployments or machines without NVIDIA GPUs — supports Apple Metal, AMD ROCm, and pure CPU inference (slower but functional).

Model catalog

Llama 4 Scout (Meta, April 2025)

Meta's mixture-of-experts architecture delivers frontier-class performance with efficient inference. Scout uses 16 experts with 1 active, totaling 109B parameters but only 17B active during inference. The standout feature is its 10M token context window — the largest among open models by a wide margin.

VariantParametersContextVRAMLiteLLM IDNotes
Llama 4 Scout17B active / 109B total10M24 GB (Q4)ollama/llama416 experts, multimodal (text + image)

Llama 4 Maverick (Meta, April 2025)

The larger Llama 4 variant with 128 experts and 400B total parameters. Approaches frontier model quality but requires multi-GPU infrastructure.

VariantParametersContextVRAMLiteLLM IDNotes
Llama 4 Maverick17B active / 400B total1M80 GB+ (Q4)ollama/llama4:maverick128 experts, multimodal, requires multi-GPU

Gemma 3 (Google, March 2025)

Google's open model built on Gemini 2.0 research. The 27B dense model offers strong multimodal capabilities (text and image input) across 140+ languages. Runs on a single GPU, making it one of the most practical high-quality open models available.

VariantParametersContextVRAMLiteLLM IDNotes
Gemma 3 27B27B dense128K18 GB (Q4)ollama/gemma3Multimodal, 140+ languages

Gemma 3 is an excellent default choice for AI support workloads. The 27B model handles ticket classification, draft generation, and QA scoring well, and it fits on consumer hardware with Q4 quantization.

Qwen 3 (Alibaba, 2025)

The Qwen 3 family spans a wide range of sizes, from tiny 0.6B models for edge deployment to a 235B MoE model that rivals frontier APIs. A key feature is thinking/non-thinking mode switching — the model can produce chain-of-thought reasoning on demand or respond directly for lower latency.

VariantParametersContextVRAMLiteLLM IDNotes
Qwen 3 8B8B dense128K6 GB (Q4)ollama/qwen3:8bFast, lightweight
Qwen 3 14B14B dense128K10 GB (Q4)ollama/qwen3:14bGood quality/speed balance
Qwen 3 32B32B dense128K20 GB (Q4)ollama/qwen3:32bStrong reasoning
Qwen 3 30B-A3B30B total / 3B active MoE128K4 GB (Q4)ollama/qwen3:30b-a3bMoE, very efficient
Qwen 3 235B-A22B235B total / 22B active MoE128K48 GB (Q4)ollama/qwen3:235b-a22bNear-frontier MoE

Qwen 3.5 (Alibaba, February 2026)

The latest generation of the Qwen family, building on Qwen 3 with improved reasoning and instruction following.

VariantParametersContextVRAMLiteLLM IDNotes
Qwen 3.5 27B27B dense128K18 GB (Q4)ollama/qwen3.5:27bDense, strong all-around
Qwen 3.5 35B-A3B35B total / 3B active MoE128K4 GB (Q4)ollama/qwen3.5:35b-a3bEfficient MoE
Qwen 3.5 122B-A10B122B total / 10B active MoE128K24 GB (Q4)ollama/qwen3.5:122b-a10bHigh quality MoE
Qwen 3.5 397B-A17B397B total / 17B active MoE128K80 GB+ (Q4)ollama/qwen3.5:397b-a17bLargest Qwen, multi-GPU

DeepSeek R1 (DeepSeek)

Reasoning-focused model with transparent chain-of-thought. The distilled variants bring R1-style reasoning to smaller, more deployable sizes.

VariantParametersContextVRAMLiteLLM IDNotes
DeepSeek R1 Distill 8B8B128K6 GB (Q4)ollama/deepseek-r1:8bLightweight reasoning
DeepSeek R1 Distill 32B32B128K20 GB (Q4)ollama/deepseek-r1:32bStrong reasoning at moderate size

Useful for QA scoring workflows where you want to see the model's reasoning process. The chain-of-thought output can be logged alongside scores for auditing.

Phi-4 Reasoning (Microsoft, March 2026)

A compact 15B multimodal reasoning model from Microsoft. Supports both vision and text input. Released under the MIT license, making it one of the most permissively licensed capable models available.

VariantParametersContextVRAMLiteLLM IDNotes
Phi-4 Reasoning15B32K10 GB (Q4)ollama/phi-4Multimodal (vision + text), MIT license

Excellent for classification tasks in AI support — its reasoning capabilities help with nuanced ticket triage decisions.

Mistral Small 3.2 (Mistral)

Open-weight model with strong multilingual capabilities and built-in function calling support. The 128K context window and 24B parameter count hit a practical sweet spot.

VariantParametersContextVRAMLiteLLM IDNotes
Mistral Small 3.224B128K16 GB (Q4)ollama/mistral-smallMultilingual, function calling

Hardware recommendations

SetupVRAMRecommended models
MacBook Pro 16 GBShared 16 GBGemma 3 27B (Q4), Phi-4 (Q4), Qwen 3 8B
MacBook Pro 32 GBShared 32 GBLlama 4 Scout (Q4), Qwen 3 32B (Q4), Qwen 3.5 27B (Q4)
RTX 409024 GBGemma 3 27B (Q4), Llama 4 Scout (Q4), Mistral Small 3.2, Qwen 3.5 122B-A10B (Q4)
2x RTX 409048 GBQwen 3 235B-A22B MoE (Q4), DeepSeek R1 32B (FP16)
A100 80 GB80 GBAny single model at high quantization
Multi-GPU server160 GB+Llama 4 Maverick, Qwen 3.5 397B-A17B

For most AI support deployments, a single RTX 4090 or a 32 GB Apple Silicon Mac covers the practical model range. MoE models like Llama 4 Scout and the Qwen 3/3.5 MoE variants are particularly efficient because only a fraction of total parameters are active per token.

Quantization guide

MethodQuality impactSize reductionWhen to use
FP16NoneBaselineWhen VRAM is not a constraint
Q8Minimal~50%Good default for production
Q6_KVery small~60%Quality-sensitive production workloads
Q4_K_MSmall~75%Best balance of quality and size
Q4_K_SModerate~75%When memory is tight
Q3_K_MNoticeable~80%Last resort before dropping model size

For AI support tasks (classification, drafting, scoring), Q4_K_M quantization typically preserves 95%+ of full-precision quality while dramatically reducing memory requirements. Start with Q4_K_M and only move to Q8 or FP16 if you observe quality degradation in your specific workload.

MoE models like Llama 4 Scout and the Qwen 3/3.5 MoE variants benefit especially from quantization because only the active expert weights are loaded per inference step, keeping effective memory usage lower than the total parameter count suggests.

On this page