Post

Run AI Models Locally for Free: Complete Guide by RAM Size (8 GB to 64 GB)

A practical guide to running open-source AI models locally on your machine for coding, text generation, image creation, audio transcription, and more. Includes model recommendations by RAM size, setup instructions for Ollama, and integration with VS Code, terminal tools (Aider, Fabric, ShellGPT, Mods), and other apps.

Run AI Models Locally for Free: Complete Guide by RAM Size (8 GB to 64 GB)

You don’t need a cloud subscription or an expensive GPU to use AI in your daily workflow. Open-source models have matured to the point where a regular laptop can run capable AI for coding, writing, image generation, audio transcription, and more — completely free, completely private.

This guide covers everything: how to pick the right model for your hardware, how to set it up, and how to integrate it into your actual workflow.


Table of Contents


Why Run AI Locally?

Before diving into models and setup, here’s why local AI is worth your time:

  • Zero cost — no API fees, no subscriptions, no token limits
  • Full privacy — your code, documents, and data never leave your machine
  • Works offline — airports, trains, remote locations — no internet needed
  • No rate limits — run as many queries as your hardware allows
  • Customizable — fine-tune models, adjust parameters, create custom system prompts

The trade-off is straightforward: local models are smaller and less capable than cloud giants like GPT-4o or Claude. But for 80% of daily tasks — code completions, explaining errors, summarizing documents, generating images — they’re more than enough.


Quick Decision: Pick Your Tools

Before diving into the technical details, here’s the big picture. Running AI locally requires two things:

  1. A runtime — software that loads and serves models on your machine
  2. A workflow tool — the interface you actually interact with (editor extension, terminal command, chat UI)

Pick one from each table, install them, then scroll to your RAM tier to choose the right model.

Don’t know where to start? Install Ollama + the Continue VS Code extension. You’ll be up and running in 5 minutes.

Runtimes: How You Load & Serve Models

ToolTypeBest ForPlatformOpen SourceInstall
OllamaCLI + API serverMost users — simplest setup, massive ecosystemMac, Linux, Windowsbrew install ollama or download
LM StudioDesktop GUIBeginners, model browsing, “Chat with Docs”Mac, Linux, Windows❌ (free)Download app
oMLXCLI serverApple Silicon speed demons, coding agents (5x TTFT)Mac only✅ (Apache 2.0)GitHub
Unsloth StudioWeb GUIModel comparison arena, fine-tuning, observabilityMac, Linux, Windows✅ (AGPL-3.0)GitHub
PinokioApp storeNon-technical — one-click install for 160+ AI toolsMac, Linux, WindowsDownload
LocalAIAPI serverMulti-modal (text, image, audio, embeddings) in one APIMac, Linux, WindowsDocs

Start here: If you’re unsure, install Ollama. It’s the most widely supported runtime — nearly every tool in the next table connects to it.

Workflow Tools: How You Use the Models

Terminal / CLI Tools

ToolBest ForKey FeatureInstallConnects To
AiderAI pair programmingEdits files directly, auto-commits with gitpipx install aider-chatOllama, LM Studio, cloud
FabricPrompt patterns (summarize, extract, write)100+ crowdsourced prompt templates, pipe anythinggo install github.com/danielmiessler/fabric@latestOllama
ShellGPTShell command generationForgot a ffmpeg flag? Just askpip install shell-gptOllama (OpenAI-compat)
ModsPipe stdin → LLM → formatted outputUnix philosophy — compose with any CLIbrew install charmbracelet/tap/modsOllama
ChatbladeScripting & JSON outputChain prompts in bash, structured responsespip install chatbladeOllama (OpenAI-compat)

Editor Extensions

ToolEditorBest ForKey FeatureInstall
ContinueVS CodeCode chat + tab autocomplete@file context, inline edits, multi-modelMarketplace
ClineVS CodeAgentic coding (creates/edits files autonomously)Runs terminal, iterates on errorsMarketplace
Roo CodeVS CodeCline fork with extra featuresSame workflow, community-drivenMarketplace

GUI / Chat Apps

ToolBest ForKey FeatureInstallConnects To
Open WebUIChatGPT-like UI, PDF/doc uploadMulti-model, history, RAG built-indocker run (docs)Ollama
GooseAutonomous agent, MCP extensionsDesktop + CLI, 70+ extensions, subagentsDownloadOllama
JoaniumFile-aware automations, schedulingReads project files, GitHub/Gmail/Calendar integrationsDownloadOllama + 10 providers
JanSimple offline ChatGPT alternativeClean chat UI, zero cloud dependencyDownloadBuilt-in + Ollama
GPT4AllOne-click local chat + document Q&AMinimal setup, non-technical friendlyDownloadBuilt-in
EnchantediOS/macOS native chatFree on App Store, smooth native UXApp StoreOllama

Coding Agents (Autonomous)

Note: Coding agents work best with 14B+ parameter models (16 GB RAM minimum). On 8 GB, they’ll be limited to simple tasks.

ToolBest ForKey FeatureInstallConnects To
OpenCodeTerminal + VS Code + desktop agent75+ providers, MIT license, 153k+ starscurl -fsSL https://opencode.ai/install.sh \| bashOllama
OpenHandsSandboxed autonomous codingReads repos, edits files, runs tests, iteratespip install openhands or DockerOllama
OpenHumanPersonal AI with memory + app integrationsGmail, Slack, Notion, GitHub — desktop app (Rust)DownloadOllama, LM Studio
Hermes AgentSelf-improving personal AILearns from experience, persistent memory, schedulingInstall scriptOllama

Non-Text Tools (Audio, Image, TTS)

TaskToolTypeInstallNotes
Speech-to-textWhisper.cppCLIbrew install whisper-cppBest local transcription
Text-to-speechPiperCLIpip install piper-tts30+ languages, runs on CPU
Image generationDraw ThingsmacOS appMac App Store (free)Native Metal acceleration
Image generationComfyUIWeb UIgit clone + pip installLinux/Windows, node-based
Image generationStability MatrixDesktopDownloadOne-click SD installer

Typical setup for a developer: Ollama (runtime) + Continue (editor) + Aider or Mods (terminal). Total install time: ~10 minutes. Then pick a model from your RAM tier below.


How Local AI Models Work (Quick Primer)

graph LR
    A[You type a prompt] --> B[Ollama / Runtime]
    B --> C[Model loaded in RAM/VRAM]
    C --> D[Inference on CPU/GPU]
    D --> E[Response generated]

When you run a model locally:

  1. The model weights (a large file, typically 2-20 GB) are loaded into your RAM or GPU memory
  2. A runtime like Ollama manages the model, accepts prompts via an API, and returns responses
  3. Inference (generating the response) happens on your CPU or GPU — Apple Silicon Macs use the unified memory GPU, which is efficient for this

Key Concept: Quantization

Quantization is compressing a model’s weights from high-precision numbers (16-bit floats) to lower-precision ones (4-bit integers). Think of it like reducing image quality from PNG to JPEG — the file gets much smaller, you lose some detail, but it’s usually good enough. Without quantization, a 14B model would be ~28 GB (won’t fit on most laptops). Quantized to Q4, it’s ~9 GB — fits comfortably on 16 GB RAM.

Models come in different quantization levels that trade quality for size:

QuantizationQualitySize ReductionWhen to Use
F16 (full)BestNone (baseline)Only if you have massive RAM
Q8Near-perfect~50% smallerBest quality-to-size ratio
Q6_KExcellent~58% smallerSweet spot for most users
Q4_K_MGood~70% smallerDefault for most Ollama models
Q3_KAcceptable~75% smallerWhen RAM is very tight
Q2_KDegraded~80% smallerLast resort

Most models on Ollama default to Q4_K_M — a good balance. You don’t need to worry about this unless you’re optimizing for a specific RAM budget.

Pro tip: A bigger model at lower quantization (e.g., 14B at Q3) often outperforms a smaller model at higher quantization (e.g., 7B at Q8). When a model barely fits your RAM, try the next smaller quantization rather than dropping to a smaller model.

Look for “Unsloth Dynamic” GGUFs: When downloading models from Hugging Face, you’ll often see versions uploaded by Unsloth. Their “Dynamic 2.0” quantization intelligently varies precision per layer — giving important layers higher precision and less critical ones lower. The result: better quality at the same file size (benchmarks show +1% accuracy while being 2 GB smaller than standard quants). If you see both a regular GGUF and an Unsloth GGUF for the same model, prefer the Unsloth version.


The RAM Budget Rule

Not all your RAM is available for AI models. Here’s the realistic breakdown:

Total RAMOS + Apps OverheadAvailable for ModelsPractical Model Size Limit
8 GB~4 GB~4 GBUp to 3B-7B parameter models
16 GB~5 GB~11 GBUp to 7B-14B parameter models
24 GB~6 GB~18 GBUp to 14B-22B parameter models
32 GB~8 GB~24 GBUp to 22B-35B parameter models
64 GB~10 GB~54 GBUp to 70B-80B parameter models (MoE)

Rule of thumb: The model file size (shown by ollama list) should be at most 80% of your available RAM. Going beyond that causes memory swapping, which makes inference painfully slow.

Got 36-48 GB? MacBook Pros with M4 Pro/Max chips come in 36, 48, 64, and 128 GB configurations. At 36-48 GB, follow the 32 GB recommendations with more headroom — you can use larger context windows or keep two models loaded simultaneously. At 64 GB, use the 64 GB section. At 128 GB+, you can run truly massive models like Llama 3.1 70B at full Q8 quality or Qwen3-Coder-Next with large context.


Got a GPU? How It Changes Everything

The rest of this guide is organized by RAM because that’s the universal constraint — everyone has RAM, not everyone has a dedicated GPU. But if you do have a GPU, it fundamentally changes what you can run and how fast.

Why GPU Matters for AI

GPUs have hundreds of parallel cores optimized for the matrix math that drives AI inference. A model loaded into GPU memory (VRAM) runs 3-10x faster than the same model on CPU. The key constraint shifts from total RAM to VRAM — your GPU’s dedicated memory.

What is VRAM? VRAM (Video RAM) is memory physically built into your graphics card — separate from your system RAM. It’s ultra-fast memory that only the GPU can access directly. When people say “my RTX 4070 has 12 GB,” they mean 12 GB of VRAM. Your system RAM (16/32 GB) and VRAM (8/12/24 GB) are independent pools — a model loaded into VRAM runs dramatically faster because the GPU doesn’t need to fetch data over the slower system bus.

graph TD
    A[Model File on Disk] --> B{Fits in VRAM?}
    B -->|Yes - Full GPU| C["3-10x faster<br/>60-150 tok/s"]
    B -->|Partially| D["Split GPU+CPU<br/>20-50 tok/s"]
    B -->|No| E["CPU only<br/>5-20 tok/s"]

Step 1: Identify Your GPU

macOS (Apple Silicon)

Apple Silicon (M1/M2/M3/M4) uses unified memory — the GPU and CPU share the same RAM pool. There’s no separate VRAM. This is actually an advantage: your entire RAM is available to the GPU.

1
2
3
4
5
6
# Check your chip and memory
system_profiler SPHardwareDataType | grep -E "Chip|Memory"

# Example output:
# Chip: Apple M2 Pro
# Memory: 16 GB

Your “GPU memory” = your total RAM. Ollama and llama.cpp automatically use the Metal GPU on Apple Silicon — no setup needed.

What is Metal? Metal is Apple’s GPU programming framework — the equivalent of NVIDIA’s CUDA. It lets software (like Ollama, llama.cpp, Whisper.cpp) run AI computations directly on Apple Silicon’s built-in GPU cores. You don’t install it separately — it’s part of macOS. When you see “Metal acceleration,” it simply means the tool is using your Mac’s GPU instead of just the CPU, which makes inference 2-4x faster.

Linux (NVIDIA)

1
2
3
4
5
6
7
8
# Check if NVIDIA GPU is detected
nvidia-smi

# Example output shows:
# GPU Name: NVIDIA GeForce RTX 4070 Ti
# Memory: 12288 MiB (12 GB VRAM)
# Driver Version: 560.35.03
# CUDA Version: 12.6

If nvidia-smi is not found, you need to install NVIDIA drivers:

1
2
3
4
5
# Ubuntu/Debian
sudo apt install nvidia-driver-560

# Or use the NVIDIA CUDA toolkit
# See: https://docs.ollama.com/gpu

Linux (AMD)

1
2
3
4
5
6
7
8
# Check for AMD GPU
rocminfo | grep -E "Name|Marketing"

# Or simpler
lspci | grep -i "vga\|3d"

# Check VRAM
rocm-smi --showmeminfo vram

AMD GPU support requires ROCm. Supported cards: RX 6000/7000 series, Radeon PRO, Instinct. See Ollama AMD docs.

Windows (NVIDIA/AMD/Intel)

1
2
3
4
5
# PowerShell — check GPU name and VRAM
Get-CimInstance Win32_VideoController | Select-Object Name, AdapterRAM

# Or open Task Manager → Performance → GPU
# Shows GPU name, dedicated memory (VRAM), and utilization

For NVIDIA specifically:

1
nvidia-smi

Step 2: Understand VRAM vs RAM

ScenarioWhat Determines Model SizeSpeed
Apple Silicon (M1-M4)Total unified RAM (shared between CPU & GPU)Fast — GPU uses all available memory
Dedicated NVIDIA GPUVRAM is primary; system RAM is backup for overflowFastest when model fits fully in VRAM
AMD GPU (ROCm)VRAM (same as NVIDIA, but Linux-only)Fast when supported
Intel Arc GPUVRAM (limited Ollama support)Moderate
No dedicated GPUSystem RAM only, CPU inferenceSlowest

Key insight: On a system with 32 GB RAM + 12 GB VRAM (e.g., RTX 4070 Ti), the VRAM is what matters most for speed. A 7B model fits entirely in 12 GB VRAM and runs at 60-100+ tok/s. The 32 GB RAM becomes relevant only for models that exceed your VRAM.

Step 3: Model Selection by VRAM

If you have a dedicated NVIDIA/AMD GPU, choose models based on VRAM, not system RAM:

VRAMBest Model SizeExamplesExpected Speed
6 GB3B-7B (Q4)Qwen2.5-Coder 3B, Gemma 3 4B, Llama 3.2 3B40-80 tok/s
8 GB7B (Q4)Llama 3.1 8B, Mistral 7B, Qwen2.5-Coder 7B50-100 tok/s
12 GB7B (Q8) or 14B (Q4)Qwen2.5-Coder 14B, Phi-4 14B, Gemma 4 E4B30-70 tok/s
16 GB14B (Q6) or 22B (Q4)Codestral 22B, Mistral Small 24B25-50 tok/s
24 GB22B (Q6) or 30B (Q4)Qwen3 30B-A3B, Gemma 4 26B, Qwen3-Coder 30B20-45 tok/s
48 GB70B (Q4)Llama 3.1 70B, Qwen3 32B (full precision)15-30 tok/s

Speeds approximate for NVIDIA RTX 4000-series. Older cards (RTX 3000) are ~20-30% slower.

Compared to CPU-only: A 7B model on CPU might give you 10-20 tok/s. The same model fully loaded in 8 GB VRAM gives 60-100 tok/s. That’s the difference between “usable” and “feels instant.”

Step 4: How GPU Offloading Works

Ollama (and llama.cpp) automatically handles GPU offloading:

  • Full offload: Model fits entirely in VRAM → maximum speed
  • Partial offload: Model is split — some layers on GPU, rest on CPU → faster than CPU-only, slower than full GPU
  • No offload: No compatible GPU or VRAM too small → CPU-only

You can control this with environment variables:

1
2
3
4
5
# Force specific number of GPU layers (advanced)
OLLAMA_NUM_GPU_LAYERS=35 ollama run qwen2.5-coder:14b

# Disable GPU entirely (useful for testing)
OLLAMA_NO_GPU=1 ollama run llama3.1

To check what Ollama is actually using:

1
2
3
4
5
6
7
# See GPU utilization while a model is running
ollama ps

# NVIDIA: watch GPU memory and utilization in real-time
watch -n 1 nvidia-smi

# macOS: check GPU usage in Activity Monitor → GPU History

Practical Decision Tree

graph TD
    A["What hardware do you have?"] --> B{Apple Silicon?}
    B -->|Yes| C["Use RAM tables in this guide<br/>GPU is automatic via Metal"]
    B -->|No| D{Dedicated NVIDIA/AMD GPU?}
    D -->|Yes| E["Check VRAM with nvidia-smi<br/>Size models to VRAM"]
    D -->|No| F["Use RAM tables in this guide<br/>Expect 3-5x slower than Apple Silicon"]
    E --> G{"Model fits in VRAM?"}
    G -->|Fully| H["Best case: 60-150 tok/s<br/>Run the largest model that fits"]
    G -->|Partially| I["Good: 20-50 tok/s<br/>Split between GPU and CPU"]
    G -->|Not at all| J["Falls back to CPU<br/>Consider a smaller model"]

Summary: How GPU Changes the Rules

Without GPU (CPU-only)With GPU
Model size limited by RAMModel size limited by VRAM (for full speed)
5-25 tok/s typical30-150 tok/s typical
All RAM recommendations in this guide apply directlyUse VRAM table above for model sizing
Larger context windows eat into available RAMContext window uses VRAM too — budget accordingly
One model at a time on ≤16 GBCan keep model in VRAM + run apps normally (system RAM stays free)

Apple Silicon users: You already have the GPU advantage built-in — Metal acceleration is automatic. The RAM-based tables in this guide already account for GPU usage via unified memory. No extra setup needed.

No GPU? No problem. Every model in this guide runs on CPU. A GPU makes things faster, but it’s not required. If you’re on an Intel/AMD laptop without a discrete GPU, follow the RAM-based recommendations and expect slower speeds.


Context Window: Why It Matters

The context window is how much text a model can “see” at once — your prompt, the conversation history, and the response all share this window. For coding, this is critical:

Context SizeWhat It MeansGood For
4K tokens~3,000 wordsShort Q&A, simple completions
8K tokens~6,000 wordsSingle-file code review, short conversations
32K tokens~24,000 wordsMulti-file context, longer conversations
128K tokens~96,000 wordsEntire codebase context, long documents
256K tokens~192,000 wordsVery large documents, extensive code analysis

Important: Ollama often defaults to a conservative 2048 tokens (or the model’s minimum) to save RAM. You usually need to explicitly set a larger context:

1
2
3
4
5
6
7
8
9
# To set context window, use the API or a Modelfile
# Or interactively in the prompt type: /set parameter num_ctx 32768

# Or in API calls
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:30b-a3b",
  "prompt": "your prompt here",
  "options": { "num_ctx": 32768 }
}'

RAM impact: Larger context windows consume more RAM. On 16 GB, stick to 8K-16K context. On 32 GB, you can comfortably use 32K-64K.

ModelMax ContextDefault on Ollama
Gemma 4 (all sizes)128K-256K2048
Qwen3 (all sizes)32K-128K2048
Qwen3-Coder 30B128K2048
Llama 3.1 8B128K2048
Codestral 22B32K2048
Mistral Small 24B128K2048

Always override the default if you need more context for coding or document analysis.


Model Recommendations by RAM and Use Case

Audio & TTS (all RAM tiers): For speech-to-text, use Whisper.cpp — the large-v3-turbo q5 model (~0.57 GB) is the best value at any RAM tier. On 8 GB use small (~0.5 GB) if RAM is tight. For text-to-speech, use Piper (~60-100 MB per voice, runs on CPU). Both are lightweight and run alongside any LLM without conflict.

8 GB RAM — The Essentials

With 8 GB, you can run one small model at a time. Close unnecessary apps (especially browsers with many tabs) to free up memory.

Text & Chat

ModelSizeCommandContextStrengths
Gemma 4 E2B~7.2 GBollama pull gemma4:e2b128KGoogle’s latest, vision built-in. ⚠️ Tight fit — requires closing all apps, will use swap on 8 GB
Gemma 3 4B~3.3 GBollama pull gemma3:4b128KEfficient small model, good general knowledge
Phi-4 Mini 3.8B~2.5 GBollama pull phi4-mini16KMicrosoft’s small model, strong reasoning for its size
Llama 3.2 3B~2 GBollama pull llama3.2:3b128KMeta’s compact model, fast and capable

Coding

ModelSizeCommandContextStrengths
Qwen2.5-Coder 3B~2 GBollama pull qwen2.5-coder:3b32KBest small coding model, fill-in-the-middle support
DeepSeek-Coder 1.3B~0.8 GBollama pull deepseek-coder:1.3b16KUltra-light, good for autocomplete only

Image Understanding (Vision)

ModelSizeCommandStrengths
Gemma 4 E2B~7.2 GBollama pull gemma4:e2bVision built-in — describe images, read diagrams
MiniCPM-V 3B~2 GBollama pull minicpm-vLighter vision model, works better on 8 GB

8 GB verdict: You can do basic chat, simple code completions, light image understanding, and audio transcription. Don’t expect multi-turn complex reasoning or large codebase analysis. Gemma 4 E2B is the most capable option but leaves almost no headroom.


16 GB RAM — The Sweet Spot for Most Developers

This is where local AI becomes genuinely useful. You can run 7B-14B models comfortably.

Text & Chat

ModelSizeCommandContextStrengths
Gemma 4 E4B~9.6 GBollama pull gemma4128KGoogle’s latest, vision + text, excellent quality
Gemma 3 12B~8.1 GBollama pull gemma3:12b128KExcellent general-purpose, multimodal
Llama 3.1 8B~4.7 GBollama pull llama3.1128KMeta’s workhorse, great instruction following
Mistral 7B~4.1 GBollama pull mistral32KFast, good at structured output and summarization
Phi-4 14B~9 GBollama pull phi416KMicrosoft’s reasoning model, punches above its weight

Coding

ModelSizeCommandContextStrengths
Qwen2.5-Coder 7B~4.7 GBollama pull qwen2.5-coder:7b32KBest coding model at this size, excellent completions
Qwen2.5-Coder 14B~9 GBollama pull qwen2.5-coder:14b32KStronger code generation, fits tight on 16 GB
DeepSeek-Coder-V2 Lite 16B~9 GBollama pull deepseek-coder-v2:16b128KMoE architecture, good at code generation

Image Understanding (Vision)

ModelSizeCommandStrengths
Gemma 4 E4B~9.6 GBollama pull gemma4Built-in vision — handles text + images in one model
Gemma 3 12B~8.1 GBollama pull gemma3:12bBuilt-in vision, slightly older but proven
LLaVA 13B~8 GBollama pull llava:13bDedicated vision model, good image analysis

16 GB verdict: This is where local AI becomes a real productivity tool. You get solid coding assistance, good chat, image understanding, and audio transcription. Run one model at a time for best performance.


24 GB RAM — Power User Territory

You can run larger models and even keep two smaller models loaded simultaneously.

Text & Chat

ModelSizeCommandContextStrengths
Gemma 4 26B (MoE)~18 GBollama pull gemma4:26b256KGoogle’s latest MoE, vision + text, 256K context
Qwen3 14B~9 GBollama pull qwen3:14b128KStrong reasoning, supports thinking mode
Gemma 3 27B~17 GBollama pull gemma3:27b128KExcellent quality, proven multimodal
Mistral Small 24B~14 GBollama pull mistral-small128KGreat at structured tasks, function calling

Coding

ModelSizeCommandContextStrengths
Codestral 22B~13 GBollama pull codestral32KMistral’s dedicated coding model, 80+ languages
Qwen2.5-Coder 14B~9 GBollama pull qwen2.5-coder:14b32KBest dedicated coding model at this tier
DeepSeek-Coder-V2 16B~9 GBollama pull deepseek-coder-v2:16b128KGood at code generation and explanation

Image Understanding (Vision)

ModelSizeCommandStrengths
Gemma 4 26B~18 GBollama pull gemma4:26bBest vision at this tier, 256K context
Gemma 3 27B~17 GBollama pull gemma3:27bBuilt-in vision, excellent quality

With 24 GB, you can run a chat model + a small autocomplete model simultaneously:

1
2
3
4
5
# Chat/reasoning model (~14 GB)
ollama pull mistral-small

# Tab-autocomplete model (~4.7 GB)
ollama pull qwen2.5-coder:7b

24 GB verdict: You get near-cloud-quality responses for most tasks. The dual-model setup (reasoning + autocomplete) is a game-changer for coding workflows. Gemma 4 26B is the standout pick if you run one model at a time.


32 GB RAM — Maximum Local AI Experience

This is the best consumer-level experience. You can run the largest open models and multi-model setups.

Text & Chat

ModelSizeCommandContextStrengths
Qwen3.6 35B-A3B~24 GBollama pull qwen3.6:35b256K2026 MoE — 3B active, 256K context, agentic coding + vision
Qwen3 30B-A3B~18 GBollama pull qwen3:30b-a3b128KMoE — 30B knowledge, 3B inference speed. Best value
Gemma 4 26B (MoE)~18 GBollama pull gemma4:26b256KGoogle’s latest, vision + text, 256K context
Gemma 4 31B (Dense)~20 GBollama pull gemma4:31b256KDense model, highest quality Gemma 4
Command-R 35B~20 GBollama pull command-r128KCohere’s model, excellent at RAG and tool use

Coding

ModelSizeCommandContextStrengths
Qwen3-Coder 30B~19 GBollama pull qwen3-coder128KPurpose-built for code, agentic workflows, Apache 2.0
Qwen3.6 35B-A3B~24 GBollama pull qwen3.6:35b256KNewer, stronger reasoning, 256K context
Qwen3 30B-A3B~18 GBollama pull qwen3:30b-a3b128KFast MoE with strong code abilities
Codestral 22B~13 GBollama pull codestral32KDedicated coding model, 80+ languages

Image Understanding (Vision)

ModelSizeCommandStrengths
Gemma 4 31B~20 GBollama pull gemma4:31bBest local vision model, 256K context
Gemma 4 26B~18 GBollama pull gemma4:26bMoE variant, faster inference
Llama 3.2 Vision 11B~7 GBollama pull llama3.2-visionGood vision, leaves room for other models
1
2
3
4
5
6
7
# Primary reasoning/chat (~18 GB)
ollama pull qwen3:30b-a3b

# Tab-autocomplete for coding (~4.7 GB)
ollama pull qwen2.5-coder:7b

# Keep ~9 GB free for OS + apps

32 GB verdict: You get a genuinely powerful local AI setup. Qwen3.6 35B-A3B is the new standout (256K context, agentic-ready). Qwen3 30B-A3B remains the fastest value pick. For dedicated coding, Qwen3-Coder 30B is the king.


64 GB RAM — Cloud-Killer Territory

With 64 GB of unified memory (M4 Max, M3 Ultra) or system RAM + VRAM, you can run the largest MoE models that rival cloud coding APIs. This is where local AI truly competes with Claude and GPT for agentic coding.

Text & Chat

ModelSizeCommandContextStrengths
Qwen3-Coder-Next~46 GBollama pull qwen3-coder-next131K80B total / 3B active — best local coding agent model of 2026
Llama 3.1 70B~40 GBollama pull llama3.1:70b128KMeta’s flagship, excellent instruction following
Qwen3.6 35B-A3B (Q8)~36 GBollama pull qwen3.6:35b-a3b-q8_0256KFull-quality Qwen3.6 without quantization loss
Qwen3 32B (Dense)~20 GBollama pull qwen3:32b128KDense model, leaves headroom for large context

Coding

ModelSizeCommandContextStrengths
Qwen3-Coder-Next~46 GBollama pull qwen3-coder-next131KPurpose-built for coding agents. 80B MoE, 3B active — SWE-Bench ~70%
Qwen3-Coder 30B (Q8)~32 GBollama pull qwen3-coder:30b-q8_0128KFull-quality Qwen3-Coder without quantization loss
Qwen3.6 35B-A3B~24 GBollama pull qwen3.6:35b256KFast agentic model, 256K context for large repos
DeepSeek-V2.5 236B (Q2)~50 GBCommunity GGUF128KMoE giant at aggressive quantization, experimental

Image Understanding (Vision)

ModelSizeCommandStrengths
Gemma 4 31B (Q8)~33 GBollama pull gemma4:31b-q8_0Best vision, full quality, 256K context
Llama 3.2 Vision 90B~55 GBollama pull llama3.2-vision:90bMeta’s largest vision model
1
2
3
4
5
6
7
# Primary coding agent (~46 GB)
ollama pull qwen3-coder-next

# Lightweight autocomplete (~4.7 GB) — runs alongside
ollama pull qwen2.5-coder:7b

# Keep ~13 GB free for OS + apps + context

Or for a versatile non-coding setup:

1
2
3
4
5
# Reasoning + general purpose (~40 GB)
ollama pull llama3.1:70b

# Vision model when needed (~24 GB) — swap with llama3.1
ollama pull qwen3.6:35b

64 GB verdict: This is cloud-killer territory. Qwen3-Coder-Next (80B MoE, 3B active) matches paid coding APIs for agentic workflows — file edits, test runs, multi-step debugging — completely free and private. If you’re buying a Mac for local AI coding, 64 GB unified memory is the sweet spot.


Setup Guide: Ollama (Text, Code, Vision Models)

Ollama is the easiest way to run LLMs locally. It handles model downloading, quantization, and serves an API — all in one tool.

Step 1: Install Ollama

PlatformInstall Command / MethodRequirements
macOSDownload from ollama.com/download/mac or brew install ollamamacOS Sonoma 14+, Apple M-series or Intel
Linuxcurl -fsSL https://ollama.com/install.sh \| shAny modern distro. GPU setup optional
WindowsDownload from ollama.com/download/windowsWindows 10 22H2+. No admin needed

After install, verify with ollama --version. The API runs at http://localhost:11434.

Storage: Models live in ~/.ollama/models/ (macOS/Linux) or %HOMEPATH%\.ollama\models (Windows). Budget 20-50 GB free disk space. On Windows, set the OLLAMA_MODELS env var to move models to another drive. See platform-specific docs for macOS, Linux, Windows.

Step 2: Pull a Model

1
2
3
4
5
6
7
8
# Example: pull Gemma 4 (default E4B, ~9.6 GB)
ollama pull gemma4

# Example: pull a coding model
ollama pull qwen2.5-coder:7b

# List downloaded models
ollama list

Step 3: Run and Chat

1
2
3
4
5
6
7
8
# Interactive chat
ollama run gemma4

# Ask a coding question
ollama run qwen2.5-coder:7b "Write a Python function to merge two sorted lists"

# To run with larger context window, create a Modelfile with PARAMETER num_ctx 32768
# Or use the interactive prompt: /set parameter num_ctx 32768

Step 4: Verify the API

Ollama exposes a local API at http://localhost:11434:

1
2
3
4
5
6
7
8
9
# Check downloaded models
curl http://localhost:11434/api/tags

# Send a prompt via API
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4",
  "prompt": "Explain Docker networking in 3 sentences",
  "stream": false
}'

Useful Ollama Commands

1
2
3
4
5
6
ollama list              # Show downloaded models with sizes
ollama ps                # Show currently loaded models in memory
ollama rm <model>        # Delete a model to free disk space
ollama show <model>      # Show model details (size, quantization, context, license)
ollama cp <src> <dest>   # Copy a model (useful for custom Modelfile configs)
ollama pull <model>      # Download or update a model

Prefer a visual API explorer? The Ollama REST API Postman Collection has pre-built requests for generate, chat, structured output, JSON mode, and model management — great for testing before writing code. For programmatic use, see the official Python and JavaScript client libraries.


Beyond Ollama: Other Free Tools to Run Local AI

The Quick Decision table above covers the most popular options. This section lists specialized tools for users who need lower-level control, production serving, or platform-specific hardware.

Low-Level Inference Engines

ToolWhat It IsBest ForOpen Source
llama.cppThe C/C++ inference engine Ollama is built onMaximum control over quantization, context, and parameters✅ Yes
MLX / mlx-lmApple’s native framework for Apple SiliconFastest inference on Macs — up to 4x faster than llama.cpp for some models✅ Yes

Production & High-Throughput Serving

ToolWhat It IsBest ForOpen Source
vLLMProduction inference server with continuous batching and PagedAttentionMulti-user serving, high concurrency. V1 engine supports text, audio, embeddings, multimodal✅ Yes
SGLangHigh-throughput serving framework from UC BerkeleyStructured output, constrained decoding, production API serving✅ Yes
TGIHugging Face’s inference server with built-in observabilityTeams already in the HF ecosystem, metrics-heavy deployments✅ Yes

Platform-Specific & Niche

ToolWhat It IsBest ForOpen Source
Docker Model RunnerRun GGUF models directly from Docker DesktopTeams already in container workflows — pull models like Docker imagesPartial
LemonadeAMD’s tool for Ryzen AI NPU hardwareAMD laptop users with dedicated NPUs — includes MCP tool calling✅ Yes

Setup Guide: Whisper.cpp (Audio Transcription)

Whisper.cpp runs OpenAI’s Whisper speech-to-text model locally using optimized C++ code. It’s fast on Apple Silicon and modern CPUs.

Step 1: Install via Homebrew

1
2
3
4
5
# Install whisper-cpp (macOS)
brew install whisper-cpp

# Verify — the CLI binary is called whisper-cli
whisper-cli --help

Note: The Homebrew package installs the binary as whisper-cli, not whisper-cpp. It does not include a model download script — you need to download GGML model files manually.

Step 2: Download a Model

Models are hosted at huggingface.co/ggerganov/whisper.cpp. Download the .bin file that matches your RAM budget:

1
2
3
4
5
6
# Create a directory for models
mkdir -p ~/.local/share/whisper-cpp

# Recommended: large-v3-turbo quantized (574 MB) — best speed/quality ratio
curl -L -o ~/.local/share/whisper-cpp/ggml-large-v3-turbo-q5_0.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo-q5_0.bin

Which Model to Choose?

Model FileSizeSpeedAccuracyBest For
ggml-tiny.bin78 MBFastestBasicQuick tests
ggml-base.bin148 MBVery fastDecentClear speech, low RAM
ggml-small.bin488 MBFastGoodMeetings, podcasts
ggml-medium.bin1.53 GBModerateVery goodAccented speech, noisy audio
ggml-large-v3-turbo-q5_0.bin574 MBFastExcellentBest pick — large quality at medium speed
ggml-large-v3.bin3.1 GBSlowBestProfessional transcription

Pro tip: The large-v3-turbo model is a distilled version of large-v3 — nearly the same accuracy but ~4x faster. The q5_0 quantized variant (574 MB) is the sweet spot for most users.

Multilingual vs English-only: Files with .en in the name (e.g., ggml-medium.en.bin) are English-only and slightly more accurate for English. Files without .en support all languages. The large-v3-turbo is multilingual only.

To download a different model, swap the filename in the URL:

1
2
3
4
5
6
7
# Example: download small model (488 MB)
curl -L -o ~/.local/share/whisper-cpp/ggml-small.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin

# Example: download medium English-only (1.53 GB)
curl -L -o ~/.local/share/whisper-cpp/ggml-medium.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium.en.bin

Step 3: Transcribe

1
2
3
4
5
6
7
8
# Basic transcription
whisper-cli -m ~/.local/share/whisper-cpp/ggml-large-v3-turbo-q5_0.bin -f recording.wav

# Output as SRT subtitles
whisper-cli -m ~/.local/share/whisper-cpp/ggml-large-v3-turbo-q5_0.bin -f meeting.wav --output-srt

# Output with timestamps (plain text)
whisper-cli -m ~/.local/share/whisper-cpp/ggml-large-v3-turbo-q5_0.bin -f meeting.wav --output-txt

Supported Audio Formats

Whisper.cpp works best with 16-bit WAV at 16 kHz. Convert other formats first:

1
2
3
4
5
# Convert MP3 to WAV using ffmpeg
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

# Convert M4A (iPhone recording) to WAV
ffmpeg -i voice-memo.m4a -ar 16000 -ac 1 -c:a pcm_s16le output.wav

Speed estimates on Apple M2, will vary by hardware.


Setup Guide: Piper (Text-to-Speech)

Piper is a fast, local neural text-to-speech engine. It runs entirely on CPU, needs minimal RAM (~60-100 MB per voice), and supports 30+ languages.

Install

1
2
# Install via pip
pip install piper-tts

Download a Voice Model

Piper requires a model file (.onnx) and its config (.onnx.json). Browse available voices at huggingface.co/rhasspy/piper-voices.

1
2
3
4
5
6
7
8
9
# Create a directory for voice models
mkdir -p ~/.local/share/piper-voices

# Download a US English voice (medium quality, ~60 MB)
curl -L -o ~/.local/share/piper-voices/en_US-lessac-medium.onnx \
  https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx

curl -L -o ~/.local/share/piper-voices/en_US-lessac-medium.onnx.json \
  https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

Generate Speech

1
2
3
4
5
6
7
8
9
10
11
# Generate speech to a WAV file
echo "Hello, this is a test of local text to speech." | \
  piper -m ~/.local/share/piper-voices/en_US-lessac-medium.onnx -f output.wav

# Play directly (macOS)
echo "The build failed with 3 errors." | \
  piper -m ~/.local/share/piper-voices/en_US-lessac-medium.onnx -f temp.wav && afplay temp.wav && rm temp.wav

# Pipe from a file
cat notes.txt | \
  piper -m ~/.local/share/piper-voices/en_US-lessac-medium.onnx -f notes-audio.wav

Note: Piper does not have a --list-voices flag. You choose a voice by downloading its .onnx + .onnx.json files and passing the path via -m. Browse voices with audio samples at rhasspy.github.io/piper-samples.

Use Cases for Local TTS

  • Accessibility — screen reader alternative for your own tools
  • Content creation — narrate blog posts or documentation
  • Notifications — audio alerts from CI/CD pipelines or monitoring
  • Language learning — hear pronunciation in 30+ languages
  • Proofreading — hearing your writing read aloud catches errors your eyes miss

Setup Guide: Image Generation (Stable Diffusion)

For generating images locally, use Stable Diffusion via tools optimized for your hardware.

On macOS (Apple Silicon)

Draw Things is a free, native macOS/iOS app that runs Stable Diffusion models efficiently on Apple Silicon:

  • Download from the Mac App Store (free)
  • Built-in model browser — download SDXL, SD 1.5, or FLUX models
  • Uses Metal GPU acceleration — fast on M1/M2/M3/M4 chips
  • No terminal setup needed

On Linux/Windows

Use ComfyUI or Stable Diffusion WebUI (Automatic1111):

1
2
3
4
5
6
# ComfyUI (recommended — node-based, flexible)
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
python main.py
# Open http://localhost:8188 in your browser

Easiest Install: Stability Matrix or Pinokio

If you don’t want to deal with Python environments and git clones:

  • Stability Matrix (Lykos AI) — Multi-platform package manager for Stable Diffusion. One-click install of ComfyUI, A1111, Forge. Manages Python environments, models, and extensions automatically. Open-source.
  • Pinokio — App store-like launcher for 160+ local AI tools including ComfyUI, Whisper, TTS, and more. Browse, click install, run. No terminal needed.

Image Model Recommendations by RAM

RAMModelSizeQualityGeneration Time
8 GBSD 1.5~2 GBBasic, 512x512~10-20 sec
16 GBSDXL~6.5 GBGood, 1024x1024~15-30 sec
24 GBSDXL + refiner~12 GBHigh quality with refinement~30-60 sec
32 GBFLUX.1-dev~12 GBState-of-the-art, best prompt adherence~20-40 sec

Times estimated on Apple M2 Pro. NVIDIA GPUs are typically 2-3x faster for image generation.


Customizing Models with Modelfile

Ollama’s Modelfile lets you create custom model configurations — set system prompts, adjust temperature, change context length, and more. This is powerful for creating specialized assistants.

Example: Custom Coding Assistant

Create a file called Modelfile.coding:

1
2
3
4
5
6
7
8
9
10
11
12
13
FROM qwen2.5-coder:14b

# Set a larger context window for code
PARAMETER num_ctx 32768

# Lower temperature for more deterministic code output
PARAMETER temperature 0.3

# System prompt for coding assistance
SYSTEM """You are an expert software engineer. You write clean, well-documented,
production-ready code. You follow best practices for the language being used.
When reviewing code, you focus on bugs, security issues, and performance.
Always explain your reasoning briefly."""

Build and run it:

1
2
3
4
5
# Create the custom model
ollama create coding-assistant -f Modelfile.coding

# Use it
ollama run coding-assistant "Review this function for bugs: ..."

Example: Document Summarizer

1
2
3
4
5
6
7
8
9
10
FROM gemma4

PARAMETER num_ctx 65536
PARAMETER temperature 0.2

SYSTEM """You are a document analysis assistant. When given text, you:
1. Provide a concise summary (3-5 sentences)
2. List key points as bullet points
3. Identify any action items or decisions
Be concise and factual. Never add information not present in the source."""

Example: Creative Writing Helper

1
2
3
4
5
6
7
8
9
FROM qwen3:30b-a3b

PARAMETER temperature 0.8
PARAMETER top_p 0.9
PARAMETER num_ctx 16384

SYSTEM """You are a creative writing assistant. You help with brainstorming,
drafting, and editing. Your suggestions are vivid and original. You match
the tone and style the user is going for."""

Integrate with Your Editor & Workflow

Already picked your tools from the Quick Decision table? Here’s how to configure each one. The real power of local AI comes when it’s integrated into your editor and terminal — below are detailed setup instructions.

Continue is the most popular open-source AI coding extension. It supports Ollama natively.

Install:

  1. Open VS Code → Extensions → Search “Continue” → Install
  2. Click the Continue icon in the sidebar → Settings (gear icon)
  3. Edit the config:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
models:
  - name: Gemma 4
    provider: ollama
    model: gemma4
    apiBase: http://localhost:11434

  - name: Qwen3-Coder 30B
    provider: ollama
    model: qwen3-coder
    apiBase: http://localhost:11434

  - name: Qwen2.5 Coder 14B
    provider: ollama
    model: qwen2.5-coder:14b
    apiBase: http://localhost:11434

tabAutocompleteModel:
  provider: ollama
  model: qwen2.5-coder:7b
  apiBase: http://localhost:11434

What you get:

  • Sidebar chat — ask questions about your code, get explanations, generate functions
  • Inline editing — select code, press Ctrl+I / Cmd+I, describe the change
  • Tab autocomplete — code completions as you type (uses the smaller model)
  • Context awareness — reference files with @file, codebase with @codebase
  • Document analysis — drag PDFs or docs into the chat for summarization and Q&A

Option 2: Cline (Agentic Coding)

Cline is an open-source VS Code extension that acts as an autonomous coding agent — it can create files, run terminal commands, and iterate on code with your approval:

  1. Install the Cline extension from VS Code Marketplace
  2. In settings, set provider to “Ollama”
  3. Set the endpoint to http://localhost:11434
  4. Select your model (recommend 32K+ context for best results)

Option 3: Open WebUI (Browser-Based Chat)

For a ChatGPT-like interface that connects to your local models:

1
2
3
4
5
6
# Run with Docker
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 — it auto-detects all your Ollama models. Great for:

  • Longer conversations with full chat history
  • Document analysis — upload PDFs, Word docs, or text files and ask questions about them
  • Sharing with team members on your local network
  • Comparing responses from different models side-by-side

Option 4: Aider (AI Pair Programming in Terminal)

Aider is an open-source terminal-based AI pair programming tool (44k+ GitHub stars, 6.8M+ installs). It edits files directly in your codebase, auto-commits changes with sensible messages, and works with local Ollama models — completely free, completely private.

Why Aider stands out for local AI:

  • Works directly on your existing codebase (not a sandbox)
  • Auto-commits changes with git — easy to diff, undo, or review
  • Maps your entire repo for context-aware edits
  • Supports 100+ programming languages
  • Voice-to-code, linting, testing integration

Install Aider:

1
2
3
4
5
6
7
8
# One-liner (installs aider + python 3.12 if needed)
curl -LsSf https://aider.chat/install.sh | sh

# Or via pipx
pipx install aider-chat

# Or via uv
uv tool install --force --python python3.12 aider-chat@latest

Connect Aider to Ollama:

1
2
3
4
5
6
7
8
9
# Set the Ollama endpoint (default, usually already correct)
export OLLAMA_API_BASE=http://127.0.0.1:11434

# Pull a good coding model
ollama pull qwen2.5-coder:14b

# Start aider with your local model
cd /path/to/your/project
aider --model ollama_chat/qwen2.5-coder:14b

Use ollama_chat/ prefix (not ollama/) for chat-optimized interactions.

Recommended local models for Aider:

RAMModelCommand
8 GBQwen2.5 Coder 7Baider --model ollama_chat/qwen2.5-coder:7b
16 GBQwen2.5 Coder 14Baider --model ollama_chat/qwen2.5-coder:14b
24 GB+Codestral 22Baider --model ollama_chat/codestral:22b
32 GB+Qwen2.5 Coder 32Baider --model ollama_chat/qwen2.5-coder:32b

Context window note: Ollama defaults to a 2K context window, which is too small for real coding. Aider auto-adjusts this for each request, but you can also fix it with a .aider.model.settings.yml file in your project:

1
2
3
- name: ollama_chat/qwen2.5-coder:14b
  extra_params:
    num_ctx: 32768

Example session:

1
2
3
4
5
6
7
8
9
10
$ cd ~/projects/my-app
$ aider --model ollama_chat/qwen2.5-coder:14b

Aider v0.82.0
Model: ollama_chat/qwen2.5-coder:14b

> Add pagination to the /users API endpoint

# Aider edits your files, shows the diff, and auto-commits with:
# "feat: add pagination to /users endpoint with limit/offset params"

Aider also works with cloud models (DeepSeek, Claude, GPT-4o) if you want to mix local and cloud:

1
2
aider --model deepseek --api-key deepseek=<key>
aider --model sonnet --api-key anthropic=<key>

Option 5: Fabric (AI Prompt Patterns from Terminal)

Fabric is a CLI framework with 100+ crowdsourced AI prompt “patterns” — highly optimized prompts for tasks like summarizing articles, extracting wisdom from videos, writing essays, or analyzing security reports. You pipe any text through it.

Install:

1
2
go install github.com/danielmiessler/fabric@latest
fabric --setup   # select Ollama as your provider

Requires Go — install with brew install go if you don’t have it.

Usage with local models:

1
2
3
4
5
6
7
8
# Summarize an article
cat article.txt | fabric --pattern summarize --model llama3

# Extract key insights from a YouTube transcript
yt --transcript "https://youtube.com/watch?v=..." | fabric --pattern extract_wisdom --model qwen2.5

# Write a blog post from notes
cat notes.md | fabric --pattern write_essay --model llama3

Option 6: ShellGPT (sgpt)

ShellGPT is a command-line productivity tool for generating shell commands, code snippets, and general text. Forget a complex ffmpeg or tar command? Just ask.

Install:

1
pip install shell-gpt

Connect to local models — edit ~/.config/shell_gpt/.sgptrc:

1
2
3
OPENAI_API_HOST=http://localhost:11434/v1
OPENAI_API_KEY=not-needed
DEFAULT_MODEL=qwen2.5-coder:14b

Usage:

1
2
3
4
5
6
7
8
# Generate a shell command
sgpt --shell "find all markdown files modified in the last 2 days"

# Generate code
sgpt --code "python function to merge two sorted lists"

# General questions
sgpt "explain the difference between TCP and UDP"

Option 7: Mods (by Charmbracelet)

Mods is built for piping — take stdin, process it with an LLM, get beautifully formatted Markdown output.

Install:

1
brew install charmbracelet/tap/mods

Connect to local models — edit ~/.config/mods/mods.yml:

1
2
3
4
5
6
apis:
  ollama:
    base-url: http://localhost:11434/v1
    models:
      llama3:
        max-input-chars: 32000

Usage:

1
2
3
4
5
6
7
8
# Summarize git commits into release notes
git log --oneline -20 | mods "Summarize these into release notes"

# Explain error output
npm test 2>&1 | mods "What went wrong and how to fix it?"

# Review a diff
git diff | mods "Review this code change for bugs"

Option 8: Chatblade

Chatblade is a versatile CLI for LLM interactions — pipe output, format as JSON, and create complex prompt chains in bash.

Install:

1
pip install chatblade

Connect to local models:

1
2
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=not-needed

Usage:

1
2
3
4
5
# Pipe and process
cat error.log | chatblade "What's the root cause?"

# JSON output for scripting
chatblade -e "list 5 python testing libraries" | jq .

Connecting Any CLI Tool to Local Models

Most terminal AI tools support OpenAI’s API format. Since Ollama is compatible at http://localhost:11434/v1, add this to ~/.zshrc:

1
2
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=not-needed

This covers sgpt, chatblade, and most other tools that accept OPENAI_API_BASE or OPENAI_BASE_URL. LM Studio uses port 1234 instead.

Upgrade & Uninstall Terminal AI Tools

Keep your tools current or remove them cleanly:

ToolUpgradeUninstall
Ollamabrew upgrade ollamabrew uninstall ollama && rm -rf ~/.ollama
Aideraider --upgrade or pipx upgrade aider-chatpipx uninstall aider-chat
Fabricgo install github.com/danielmiessler/fabric@latestrm $(which fabric)
ShellGPTpip install --upgrade shell-gptpip uninstall shell-gpt
Modsbrew upgrade charmbracelet/tap/modsbrew uninstall mods
Chatbladepip install --upgrade chatbladepip uninstall chatblade
Continue (VS Code)Auto-updates via VS Code MarketplaceUninstall from Extensions panel
Cline (VS Code)Auto-updates via VS Code MarketplaceUninstall from Extensions panel
Open WebUIdocker pull ghcr.io/open-webui/open-webui:main && docker restart open-webuidocker rm -f open-webui && docker volume rm open-webui

Tip: Models are the biggest disk consumers (8-20 GB each). Run ollama list periodically and ollama rm <model> for ones you’re not using.


Ollama Integrations: Where You Can Use Local Models

Beyond the tools covered in the Quick Decision table and setup guides above, here are additional integrations worth knowing about. Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 — any tool that supports a custom OpenAI endpoint can connect to it.

IDEs & Editors (Beyond VS Code)

ToolWhat It DoesSetup
VS Code (native)Models in Copilot Chat picker (VS Code 1.113+)ollama launch vscode
JetBrainsIntelliJ, PyCharm, WebStorm — needs JetBrains AI subscriptionSettings → AI → Ollama
XcodeXcode 26+ with Apple IntelligenceSettings → Locally Hosted
ZedNative Ollama providerConfigure → Ollama
marimoPython notebook with AI chat + code completionSettings → AI → Ollama

Coding Agents (Additional)

ToolWhat It DoesSetup
PiMinimal terminal coding agent. TypeScript skills, prompt templates, themesnpm install -g @anthropic-ai/pi → configure local model
Claude CodeAnthropic’s terminal agentRequires API key or OpenAI-compatible proxy

Chat, RAG & Automation

ToolWhat It DoesSetup
OnyxSelf-hosted chat with RAG, web search, agents, connectors (Drive, Slack, Email)Docker → Ollama provider
KotaemonRAG tool for chatting with documents — clean UI, citation supportpip install kotaemon or Docker → set Ollama endpoint
PrivateGPTIngest document collections into vector space, 100% local RAGDocker or pip → configure Ollama
KhojSelf-hosted “AI second brain” — docs, web search, scheduled automationspip install khoj or Docker → Ollama in admin settings
AnythingLLMDesktop app with built-in RAG and document chatSelect Ollama as LLM provider
n8nVisual workflow automation with Ollama nodesCredential → Ollama

Other Chat Clients

ToolWhat It DoesSetup
MstyClean multi-model chat interfaceAuto-detects local Ollama
ChatboxCross-platform desktop client for multiple AI APIsSet provider to Ollama
OpenClawPersonal AI assistant via WhatsApp/Telegram/Discord. Note: cloud APIs work more reliably than localcurl -fsSL https://openclaw.ai/install.sh \| bash

Speed Benchmarks: What to Expect on Apple Silicon

Speed is measured in tokens per second (tok/s). For reference, comfortable reading speed is about 4-5 tok/s, and fast typing speed is about 2 tok/s. Anything above 10 tok/s feels instant.

Approximate Generation Speed (tok/s)

ModelM1 (8 GB)M1 Pro (16 GB)M2 Pro (16 GB)M3 Pro (18 GB)M4 Pro (24 GB)M4 Max (32 GB)M4 Max (64 GB)
Gemma 3 4B~25~35~40~45~55~70~70
Llama 3.1 8B~20~25~30~40~50~50
Gemma 4 E4B~15~20~25~35~45~45
Qwen2.5-Coder 14B~8~12~15~22~30~30
Qwen3.6 35B-A3B (MoE)~15~25~35
Qwen3-Coder-Next (MoE)~25
Llama 3.1 70B~12

“—” means the model doesn’t fit comfortably in that RAM tier. Values are approximate and vary by prompt length, context size, and quantization. Based on community benchmarks from tps.sh and various Apple Silicon LLM benchmark reports.

Key insight: MoE models (Qwen3 30B-A3B, Gemma 4 26B) are significantly faster than dense models of similar total parameter count because they only activate a fraction of parameters per token. A 30B MoE model can be faster than a 14B dense model.

Intel/AMD & NVIDIA Comparison

  • Intel/AMD (no GPU): Expect roughly 3-5x slower speeds than Apple Silicon with the same RAM.
  • NVIDIA GPU: Closes the gap entirely — see the GPU section for VRAM-based model selection and expected speeds.

Local vs Cloud: Honest Comparison

FactorLocal (Ollama)Cloud (ChatGPT, Claude, etc.)
CostFree forever$20-200/month or per-token
Privacy100% localData sent to provider
Speed10-50 tok/s50-150 tok/s
Quality (routine tasks)90-95% of cloudBaseline
Quality (complex reasoning)60-75% of cloudBaseline
Context window8K-64K (RAM-limited)128K-200K
Offline
Setup10-30 minutesSign up and go

Use local for: routine coding, completions, Q&A, summarization, privacy-sensitive work, offline.

Use cloud for: complex multi-step reasoning, large refactors, cutting-edge capabilities.


How to Evaluate a Model Yourself

Don’t just trust benchmarks — test against your actual use cases:

  1. Create 5 test prompts from your real work (code review, generation, explanation, debugging, summarization)
  2. Run the same prompts on 2-3 candidate models: ollama run <model> "your prompt"
  3. Rate each on: correctness, speed, code quality, instruction following
  4. Check resources: ollama ps shows memory usage — ensure you have headroom

Practical tip: The “best” model is the one that gives good-enough results at a comfortable speed. A model that responds in 2 seconds often beats a better model that takes 10 seconds. Use Open WebUI to compare responses side-by-side.


Daily Workflow Examples

Here’s how local AI fits into a real developer’s day:

Morning: Code Review Help

In VS Code with Continue, select a function and press Cmd+I:

“Review this for bugs, edge cases, and potential null pointer issues”

Or in the sidebar chat:

“Explain what this function does and suggest improvements. @file:src/utils/auth.ts”

Midday: Write Documentation

1
2
ollama run qwen3:30b-a3b "Write a README section explaining how to set up
the development environment for a Next.js project with Supabase"

Afternoon: Debug an Error

Paste the error in Continue’s sidebar chat:

“I’m getting this error: TypeError: Cannot read properties of undefined (reading ‘map’). Here’s the relevant code: @file:components/RecipeList.tsx”

Late Afternoon: Summarize a PDF

Open Open WebUI, upload a PDF specification document, and ask:

“Summarize the key requirements from this document. List any breaking changes from the previous version.”

Evening: Transcribe a Meeting

1
2
3
4
5
6
# Convert the recording
ffmpeg -i meeting.m4a -ar 16000 -ac 1 -c:a pcm_s16le meeting.wav

# Transcribe with timestamps
whisper-cli -m ~/.local/share/whisper-cpp/ggml-large-v3-turbo-q5_0.bin \
  -f meeting.wav --output-srt

Weekend: Generate Blog Post Images

Open Draw Things (macOS) or ComfyUI → type a prompt → get an image for your blog post. No cloud API costs.


Quick Reference: Best Model for Each Task

Task8 GB16 GB24 GB32 GB64 GB
General chatGemma 3 4BGemma 4 E4BGemma 4 26BQwen3.6 35B-A3BLlama 3.1 70B
Coding (chat)Qwen2.5-Coder 3BQwen2.5-Coder 7BCodestral 22BQwen3-Coder 30BQwen3-Coder-Next
Tab autocompleteDeepSeek-Coder 1.3BQwen2.5-Coder 7BQwen2.5-Coder 7BQwen2.5-Coder 7BQwen2.5-Coder 7B
Image understandingMiniCPM-V 3BGemma 4 E4BGemma 4 26BGemma 4 31BGemma 4 31B (Q8)
Audio transcriptionWhisper smallWhisper large-v3-turbo q5Whisper large-v3-turbo q5Whisper large-v3Whisper large-v3
Text-to-speechPiper mediumPiper mediumPiper highPiper highPiper high
Image generationSD 1.5SDXLSDXL + refinerFLUX.1-devFLUX.1-dev
SummarizationPhi-4 MiniMistral 7BMistral Small 24BQwen3.6 35B-A3BLlama 3.1 70B
Coding agentsCodestral 22BQwen3-Coder 30BQwen3-Coder-Next
Document/PDF analysisGemma 4 E4B + Open WebUIGemma 4 26B + Open WebUIQwen3.6 35B + Open WebUILlama 3.1 70B + Open WebUI

Tips for the Best Experience

  1. Close unnecessary apps before running models — browsers with many tabs are RAM-hungry
  2. Use one model at a time on 8-16 GB RAM
  3. Unload models explicitlyollama stop <model> frees RAM immediately, or set OLLAMA_KEEP_ALIVE=0
  4. Override the default context — Ollama defaults to 2048 tokens. Set num_ctx to 16384+ for coding tasks
  5. Check model size before pullingollama show <model> shows size, quantization, and license
  6. SSD matters — models load from disk on first use. SSD = near-instant, HDD = minutes
  7. Create custom Modelfiles — a coding assistant with the right system prompt and temperature is noticeably better than defaults
  8. Keep Ollama updatedbrew upgrade ollama for regular performance improvements

Frequently Asked Questions

Can local models replace ChatGPT/Claude?

For routine tasks (code completions, explanations, summaries) — yes. For complex multi-step reasoning or large codebase analysis — cloud models still have an edge. Best approach: local for routine, cloud for complex.

Is Apple Silicon better than Intel/AMD for local AI?

Yes, significantly. Unified memory lets the GPU access all RAM directly. An M1 with 16 GB outperforms most Intel laptops with 32 GB for AI inference. See GPU section for details.

How much disk space do I need?

Budget 5-30 GB per model. A typical 2-3 model setup needs 20-50 GB free. Models live in ~/.ollama/models/.

Can I use these models commercially?

Most have permissive licenses:

  • Gemma 4: Gemma license — commercial use OK
  • Qwen3 / Qwen3-Coder: Apache 2.0 — fully open
  • Llama models: Meta community license — free under 700M MAU
  • Codestral: Mistral Non-Production License — check before commercial use
  • FLUX.1-dev: Non-commercial (use FLUX.1-schnell for commercial)
  • Whisper: MIT — fully open
  • Piper: GPL 3.0

Always verify on the model’s Ollama library page or Hugging Face page.

Do I need a GPU?

No. All models run on CPU. Apple Silicon and NVIDIA GPUs make it 2-5x faster. See GPU section.

How do I use local AI for PDF/document analysis?

Use Open WebUI (simplest — Docker, upload files in chat), Kotaemon or PrivateGPT (large document collections), or AnythingLLM (desktop RAG). All connect to Ollama.


References

This post is licensed under CC BY 4.0 by the author.