How to Run a Local AI on Your Home Lab with Ollama


A year ago, running LLMs locally was a project. Now it’s three commands. Ollama is the reason — it handles downloads, quantization, and serving, and gets out of your way.

This is one of the more popular things happening in home lab circles right now. The appeal is straightforward: no per-token costs, nothing leaving your network, and a model that’s available 24/7 without watching a billing dashboard.


What Ollama Does

Ollama is a local model server. Install it, pull a model (same workflow as docker pull), and it serves an API on localhost:11434. Works on Linux, Mac, and Windows. On Linux it runs as a systemd service — starts on boot, stays running.

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2
ollama run llama3.2

Three commands to a working chat interface in your terminal.


Hardware Requirements

This is where expectations need to be set. Models run in RAM (or VRAM with a GPU), and they’re big.

Small models (1B-7B parameters): 16GB RAM is fine. The Beelink S12 Pro → and Beelink EQ12 Pro → run Llama 3.2 3B without issues. Slower than a cloud API, but usable for background tasks, local automation, and chat-style interactions.

Mid-range models (13B-32B parameters): 32GB+ RAM. The Minisforum UM890 Pro → is a popular choice here — Ryzen 9 8945HS, up to 96GB DDR5. The AMD integrated graphics partially accelerate inference, which gives it an edge over pure CPU. Enough headroom for a 32B model with room to run other services alongside it.

Large models (70B+): GPU territory. RTX 4090 builds are the community standard, but that’s a different article and a different budget.


Models People Actually Run

Llama 3.2 (3B) — The default starting point. Best quality-to-speed ratio on low-power hardware. Handles summarization, Q&A, and simple code help.

Mistral 7B — Slightly slower on the same hardware, but better at following complex instructions. Popular for local chatbot projects.

DeepSeek-R1 (7B) — The quantized 7B fits in 16GB and handles reasoning tasks surprisingly well. This one has been trending on the Ollama leaderboard.

Qwen 2.5 (14B) — Needs 32GB RAM, but it’s where the quality jump for coding tasks becomes obvious. If you can run it, it’s worth trying.

ollama pull mistral
ollama pull deepseek-r1:7b
ollama pull qwen2.5:14b

Plugging It Into Other Tools

Ollama exposes an OpenAI-compatible API, which means most tools that work with OpenAI can point at your local instance instead:

  • Open WebUI — Self-hosted ChatGPT interface. The most popular Ollama frontend by a wide margin.
  • Continue.dev — VS Code extension for local code completion.
  • Home Assistant — Local AI integration that actually works offline.
  • n8n — Workflow automation with local AI nodes for data processing.

Open WebUI alongside Ollama:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

Hit localhost:3000 in your browser and you have a private ChatGPT that runs entirely on your network.


Realistic Speed

N100 mini PC, 16GB RAM, Llama 3.2 3B: 5-15 tokens/second. Readable for chat, a bit slow if you’re waiting on a long response.

Ryzen 9, 32GB RAM, Qwen 14B: 8-20 tokens/second depending on quantization. Usable for real work.

For real-time code completion or anything where latency matters, you still want a GPU or a cloud API. For background processing, automation, and having a model always available without recurring costs — local works.


The thing that surprises most people is how capable the smaller models are for practical tasks. They’re not GPT-4, but for summarization, classification, code help, and automation, 3B and 7B models are genuinely useful — and nothing leaves your network.


Amazon affiliate links on this page.