How to Run a Local AI on Your Home Lab with Ollama
A year ago, running LLMs locally was a project. Now it’s three commands. Ollama is the reason — it handles downloads, quantization, and serving, and gets out of your way.
This is one of the more popular things happening in home lab circles right now. The appeal is straightforward: no per-token costs, nothing leaving your network, and a model that’s available 24/7 without watching a billing dashboard.
What Ollama Does
Ollama is a local model server. Install it, pull a model (same workflow as docker pull), and it serves an API on localhost:11434. Works on Linux, Mac, and Windows. On Linux it runs as a systemd service — starts on boot, stays running.
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2
ollama run llama3.2
Three commands to a working chat interface in your terminal.
Hardware Requirements
This is where expectations need to be set. Models run in RAM (or VRAM with a GPU), and they’re big.
Small models (1B-7B parameters): 16GB RAM is fine. The Beelink S12 Pro → and Beelink EQ12 Pro → run Llama 3.2 3B without issues. Slower than a cloud API, but usable for background tasks, local automation, and chat-style interactions.
Mid-range models (13B-32B parameters): 32GB+ RAM. The Minisforum UM890 Pro → is a popular choice here — Ryzen 9 8945HS, up to 96GB DDR5. The AMD integrated graphics partially accelerate inference, which gives it an edge over pure CPU. Enough headroom for a 32B model with room to run other services alongside it.
Large models (70B+): GPU territory. RTX 4090 builds are the community standard, but that’s a different article and a different budget.
Models People Actually Run
Llama 3.2 (3B) — The default starting point. Best quality-to-speed ratio on low-power hardware. Handles summarization, Q&A, and simple code help.
Mistral 7B — Slightly slower on the same hardware, but better at following complex instructions. Popular for local chatbot projects.
DeepSeek-R1 (7B) — The quantized 7B fits in 16GB and handles reasoning tasks surprisingly well. This one has been trending on the Ollama leaderboard.
Qwen 2.5 (14B) — Needs 32GB RAM, but it’s where the quality jump for coding tasks becomes obvious. If you can run it, it’s worth trying.
ollama pull mistral
ollama pull deepseek-r1:7b
ollama pull qwen2.5:14b
Plugging It Into Other Tools
Ollama exposes an OpenAI-compatible API, which means most tools that work with OpenAI can point at your local instance instead:
- Open WebUI — Self-hosted ChatGPT interface. The most popular Ollama frontend by a wide margin.
- Continue.dev — VS Code extension for local code completion.
- Home Assistant — Local AI integration that actually works offline.
- n8n — Workflow automation with local AI nodes for data processing.
Open WebUI alongside Ollama:
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
ghcr.io/open-webui/open-webui:main
Hit localhost:3000 in your browser and you have a private ChatGPT that runs entirely on your network.
Realistic Speed
N100 mini PC, 16GB RAM, Llama 3.2 3B: 5-15 tokens/second. Readable for chat, a bit slow if you’re waiting on a long response.
Ryzen 9, 32GB RAM, Qwen 14B: 8-20 tokens/second depending on quantization. Usable for real work.
For real-time code completion or anything where latency matters, you still want a GPU or a cloud API. For background processing, automation, and having a model always available without recurring costs — local works.
The thing that surprises most people is how capable the smaller models are for practical tasks. They’re not GPT-4, but for summarization, classification, code help, and automation, 3B and 7B models are genuinely useful — and nothing leaves your network.
Amazon affiliate links on this page.