Edvinas Černauskas — Software, AI Agents & Process Optimisation

Alibaba released Qwen3 in April 2025, and it changed the self-hosted AI conversation significantly. It's not just another incremental update. The architecture is different, the benchmark results are different, and critically — the economics for businesses running their own infrastructure are different.

If you've been waiting for an open model that can handle serious business workloads without sending data to an external API, this is the closest thing to a clear answer we've had.

What changed in Qwen3

Thinking mode

Qwen3 introduces a switchable reasoning mode. You can ask the model to think through a problem step-by-step before answering — similar to how OpenAI's o-series models approach complex tasks — or bypass it entirely for fast, direct responses.

This matters for business workloads because different tasks genuinely need different behaviour. Document summarisation, ticket classification, and email drafts don't need extended reasoning. Contract review, multi-step analysis, and code generation benefit from it. One model, two modes — you choose per request.

Mixture-of-experts architecture

The flagship Qwen3 model is 235B parameters total, but only activates 22B at inference time. That's the mixture-of-experts design: a large pool of specialist sub-networks, with only a subset running for any given input.

The practical consequence is that you get 235B-scale capability at 22B-scale compute cost. A model that would otherwise require eight high-end GPUs can run on two. For businesses where hardware cost is the main barrier to self-hosting, this is the change that makes it viable.

Benchmark performance

Qwen3-235B-A22B ranks competitively with GPT-4o and Claude Sonnet across standard benchmarks — coding, reasoning, instruction following, multilingual performance. The 30B and 14B dense models punch well above their weight class on most business-relevant tasks.

More importantly, the smaller models (7B–14B) are genuinely useful for production workloads, not just demos. That's the range where most businesses end up running, and Qwen3 holds up there in a way previous generations didn't always manage.

Why it's a strong fit for businesses

Data stays on your infrastructure

The case for self-hosting is stronger the more sensitive your data is. Healthcare, legal, finance, and any business handling customer data under GDPR or similar regimes can't freely route that data through an external API. Qwen3, fully self-hosted, means the model runs in your environment and data never leaves.

This isn't just a compliance argument. It's also a competitive one — your prompts, your documents, and your domain-specific usage patterns aren't training anyone else's model.

Cost becomes predictable

External API pricing is consumption-based, which is fine at low volume and painful at scale. A self-hosted Qwen3 model has fixed infrastructure costs that don't grow linearly with usage. Teams running hundreds of thousands of requests per month typically find the break-even point arrives faster than expected.

Fine-tuning on your domain

Qwen3 supports fine-tuning. If your business has proprietary terminology, internal processes, or domain-specific knowledge that a general model doesn't handle well, you can train on your own data. A fine-tuned 14B Qwen3 model often outperforms a generic frontier model on your specific use cases — and you own it entirely.

How to run it

The two most practical options for production:

Ollama is the fastest path to evaluation. Pull a model, run it, hit a local API endpoint. Useful for testing and development; not suited for multi-user production traffic.

vLLM is the standard for production serving. It handles concurrent requests, batching, and quantisation properly, exposes an OpenAI-compatible API, and scales with load. Most teams deploying Qwen3 in production end up here.

Hardware for the most common configurations:

Model	VRAM needed	Notes
Qwen3-7B	~8 GB	Single consumer GPU
Qwen3-14B	~16 GB	Single A100 or equivalent
Qwen3-30B-A3B	~20 GB	MoE — activates 3B
Qwen3-235B-A22B	~48 GB	MoE — 2× A100 feasible

Quantised versions (4-bit, 8-bit) cut these requirements roughly in half, with modest quality trade-offs that most business applications don't notice.

The deployment question

Running your own model means owning the infrastructure. That includes model versioning, monitoring, latency management, and update cadence. For teams that want the privacy and cost benefits without the operational overhead, managed self-hosted deployments are worth considering — where the ops layer is handled but data stays in your environment.

The question isn't whether Qwen3 is capable enough. For most business workloads, it is. The question is whether your team has the infrastructure capacity to run it, or whether you need help getting there.

If you're evaluating Qwen3 for your business — or trying to work out what size model makes sense for your workload — get in touch. I've run these deployments across regulated and unregulated industries and can usually cut the evaluation time significantly.

Qwen3 is the open model built for business