Duval LC

Stop Using Ollama for Local LLMs (The Alternatives Are Easier Than You Think)

Programming


If you’ve ever dipped your toes into running AI models on your own hardware, you’ve almost certainly stumbled upon Ollama. It’s the default recommendation in tech forums, the star of YouTube tutorials, and the starting point for most self-hosting guides for local LLM inference—and for good reason. Getting a model up and running with Ollama is as simple as typing ollama run gpt-oss-20b and waiting. It’s often called the “Docker of local LLMs,” and that comparison isn’t accidental: some of Ollama’s creators come from the Docker team, bringing that same focus on simplicity to AI inference.

But Ollama’s convenience comes with hidden costs. Once you peek under the hood, it’s hard to justify sticking with it when better alternatives exist. It’s slower than necessary, locks you into opaque settings you can’t easily tweak, and the project’s direction has raised red flags for anyone who values open-source transparency. As someone who’s run local LLMs for years, I abandoned Ollama for most projects long ago—and you might want to too.


image



Ollama Is Slower Than the Tools It’s Built On (And Hides the Fixes)

The biggest immediate issue with Ollama is performance. Community benchmarks and developer reports consistently show that running the same model through Ollama yields fewer tokens per second than running it directly through llama.cpp—the tool Ollama was originally built on. The gap isn’t trivial; it’s tangible when you’re waiting for output during coding, research, or agent workflows.

Much of this slowness stems from Ollama’s questionable default settings. Take the context window: for most users, it defaults to just 4,096 tokens (it was even lower before). Ollama dynamically adjusts this based on VRAM, but those adjustments only kick in for GPUs with more than 24GB of VRAM—something most casual users don’t have. Even Ollama’s own documentation admits you need at least 64,000 tokens for “tasks which require large context like web search, agents, and coding tools.”

In a world where modern models like Gemma 4 support 128K or 256K context (and new architectures reduce VRAM usage for long context), a 4K default feels outdated and restrictive. If you don’t manually increase the num_ctx setting via environment variables, commands, or the Ollama API, it becomes a bottleneck for long-context tasks—something beginners (Ollama’s target audience) won’t notice until it’s too late.

To make matters worse, Ollama’s abstraction layer adds unnecessary overhead that raw llama.cpp avoids. The nullmirror team documented their switch from Ollama to llama.cpp and found consistent throughput improvements across every model they tested—with zero quality tradeoffs. Their conclusion was blunt: throughput and control mattered more than Ollama’s convenience.


The Trust Problem: Ollama’s Choices Are Hard to Ignore

Performance is a tradeoff you can choose to accept, but trust is another story—and Ollama has been losing it steadily over time.

Take the DeepSeek R1 fiasco in early 2025. When DeepSeek released its R1 model family, Ollama listed smaller, distilled variants (like DeepSeek-R1-Distill-Qwen-32B) simply as “DeepSeek-R1” in its library. This caused massive confusion: social media was flooded with users claiming they were running the full 671-billion-parameter DeepSeek-R1 on consumer hardware, when they were actually using tiny, behaviorally different distilled models. Ollama knew the difference but chose to obscure it—likely because “DeepSeek-R1” drives more downloads than the full, less catchy name. Even today, ollama run deepseek-r1 launches the 8B Qwen3-derived distilled variant, not the real thing.

Then there’s vendor lock-in. Ollama stores models with hashed filenames in a proprietary registry format, making it surprisingly hard to use your downloaded models with other tools like LM Studio or llama.cpp. If you’ve been using Ollama for months, you can’t just point another inference engine at your model files without extra work. You can bring your own GGUFs to Ollama via a Modelfile, but moving Ollama models to other platforms is a hassle—a lock-in most users don’t notice until they try to leave.

Ollama also moved away from llama.cpp as its backend a year ago, building a custom implementation on top of ggml (the lower-level library llama.cpp uses). Their stated reason? Stability—llama.cpp moves fast and breaks things, and Ollama’s enterprise partners need reliability. On paper, that’s fair. In practice, their custom backend has reintroduced bugs llama.cpp solved years ago, including broken structured output support and other regressions.

There have also been complaints about license attribution: Ollama’s binary distributions are accused of not properly crediting the llama.cpp authors whose work it’s built on. To add insult to injury, its GUI app launched without being in the main GitHub repo, with an unclear license and no source code (it’s there now, but the botched rollout eroded trust). If your project markets itself as open-source, vagueness about what’s open at launch is unforgivable.

To Ollama’s credit, they’ve tried to make amends—adding a “Thank you” to llama.cpp authors in a blog post during the attribution controversy. But Ollama is a Y Combinator-backed startup with venture capital funding, which means its incentives aren’t purely community-driven. The confusing GUI launch, restrictive model registry, and move away from llama.cpp all point to a focus on product control over user transparency.


The Alternatives Are Easier Than You Think (You Don’t Need Ollama)

The biggest myth about Ollama is that it’s the only beginner-friendly option for local LLMs. The tools Ollama was built on are now just as easy to set up—and far more powerful.

Let’s start with llama.cpp: the C++ inference engine that powers most of the local LLM ecosystem. It gives you full control over everything Ollama hides—including an OpenAI-compatible API server, customizable context windows, and sampling parameters—with consistently better throughput. For even more speed, try ik_llama.cpp, a fork that boosts CPU and multi-GPU performance (with 3-4x improvements in some multi-GPU setups).

If you want Ollama-style automatic model swapping, llama-swap does it with a single YAML config file. It sits in front of llama.cpp, routing requests to the right model and spinning models up/down as needed. Llama.cpp also has a built-in web GUI for easy browser-based interaction.

For a polished GUI, LM Studio is a game-changer. It supports any GGUF model, exposes all of llama.cpp’s optimization options through a clean interface, and has no proprietary format or lock-in—you can use your model files with any tool. koboldcpp is another great GUI option, offering granular control over every sampling parameter and a built-in web UI.

If you’re serving models to multiple users or running agentic workflows, vLLM is the best choice. It handles continuous batching and PagedAttention, which make a huge difference in intense workloads (it’s what I use for Claude Code on my ThinkStation PGX). And for a frontend that works with all these backends, Open WebUI plugs in seamlessly.

None of these tools take more than a few minutes to set up. The idea that Ollama is the only beginner-friendly option falls apart once you try the alternatives—many have caught up to (and surpassed) Ollama’s ease of use.


Final Thought: It’s Time to Ditch Ollama (If You Haven’t Already)

Ollama served a crucial purpose when local LLMs were new and tooling was rough. But those days are gone. Today’s alternatives are faster, more transparent, and free of the baggage of a startup prioritizing control over user needs.

If you’re still using Ollama out of habit, it’s time to make the switch. The learning curve is minimal, and the rewards—better performance, more control, and no lock-in—are worth it. Ollama is great for your first local LLM run, but it’s not the tool you want to rely on long-term.