Edvinas Černauskas — Software, AI Agents & Process Optimisation

A year ago, self-hosting an AI model meant accepting serious capability trade-offs. You could run something small and fast, or something capable — not both. That's changed.

Alibaba's Qwen series has quietly become one of the most practical options for businesses that want real AI capability without the risks that come with sending data to a third-party API.

Why businesses want private AI

The conversation usually starts with compliance. Healthcare, finance, legal — industries with strict data handling requirements have had to watch from the sidelines while competitors experiment with AI. Sending patient records or contract details to an external API isn't viable. Full stop.

But even outside regulated industries, the concerns are real. Proprietary pricing data, customer information, internal processes — most businesses have something they'd rather not route through a provider's servers, no matter how good the privacy policy looks.

There's also the cost argument. API pricing sounds cheap until you're running thousands of queries a day. At scale, a well-tuned self-hosted model often pays for itself within months.

What Qwen actually offers

Qwen3 — the current generation — comes in sizes ranging from 0.6B to 235B parameters. The smaller models run comfortably on a single consumer GPU. The mid-range models (7B–14B) hit a sweet spot that handles most business tasks — summarisation, classification, drafting, Q&A over documents — with response quality that genuinely competes with GPT-4 class models on focused tasks.

The 235B model, a mixture-of-experts architecture, approaches frontier capability when you have the hardware to run it.

Crucially, these aren't hobbyist experiments. Qwen models consistently rank near the top of standard benchmarks, and they support 128K context windows — enough to fit most real-world documents in a single request.

How to actually run it

The simplest path is Ollama. Pull a model, run it locally, hit a local API endpoint. Five minutes from zero to running inference. Good for evaluation and development.

For production deployments, most teams move to vLLM — a serving framework built for throughput. It handles batching, quantisation, and concurrent requests properly. You point it at a Qwen model checkpoint, expose an OpenAI-compatible API endpoint, and swap out your existing API calls with minimal code changes.

Hardware requirements are more approachable than they used to be. A Qwen3-14B model runs on a single A100 or equivalent. Quantised versions (4-bit, 8-bit) cut memory requirements significantly with modest quality trade-offs — often the right call for production where consistency matters more than peak performance.

For businesses without GPU infrastructure, cloud providers now offer dedicated instances where the model runs in your VPC. Data never leaves your environment even though you're not managing bare metal.

The integration picture

Self-hosted Qwen exposes an OpenAI-compatible API. That means most tooling — LangChain, LlamaIndex, custom agent frameworks — works without modification. The migration from an external API to a self-hosted endpoint is usually a one-line config change.

You can also fine-tune on your own data. If your business has domain-specific language — technical terminology, internal processes, product knowledge — a fine-tuned 7B model will often outperform a generic frontier model on your specific tasks. That's not something you get with an external API.

What to watch for

Running your own model means running your own infrastructure. That's a real cost in engineering time, not just hardware. You need monitoring, model versioning, a deployment pipeline, and someone who knows what to do when inference latency spikes.

The answer isn't necessarily to build all of this yourself. Managed self-hosted options exist — where the operational complexity is handled for you, but data stays in your environment. That's often the right call for teams that want the privacy benefits without the ops burden.

Also: model updates. External APIs silently improve over time. Self-hosted models stay at whatever version you deployed. That's a feature if you care about consistency; it's a tax if you want to keep up with improvements. Build an update process before you need it.

Whether it makes sense for you

If you're in a regulated industry, the answer is probably yes — at least for some workloads. The capability gap that used to make this a painful trade-off has largely closed.

If you're not regulated but processing large volumes, run the numbers. API costs at scale are often more than people expect, and the self-hosted break-even point has moved significantly as model quality has improved.

If you're still early and running low volume, an external API is likely fine. The goal is to understand the option exists and plan for it — so that privacy requirements or cost curves don't force a rushed migration later.

If you're evaluating self-hosted AI for your business and want help thinking through the architecture, get in touch. I've run these deployments across different industries and can usually shortcut the evaluation process considerably.

You can now self-host a private AI model that actually performs