Running large language models locally is no longer a research novelty, it is a workflow that anyone with a mid-tier gaming laptop can adopt. Over the past month I benchmarked, tweaked, and occasionally broke a Lenovo Legion 5 fitted with an RTX 3070 8 GB GPU, searching for the fastest stable way to run the Qwen-family of large language models as well as a handful of worthy challengers. The result is a repeatable playbook that balances throughput, context window, and thermal sanity, proving that meaningful AI workloads fit comfortably on hardware you can carry to a cafe.
Why bother with local inference?
Cloud APIs are convenient, but they come with latency, metered pricing, and privacy trade-offs. Running models locally solves those problems, and if you set things up right the speed can rival remote endpoints. There are three more subtle reasons to pay attention:
- Customization: You own the weights, so you can quantize, fine-tune, or chain them with other on-device tools. Need a domain-specific chatbot? Fine-tune overnight without asking a vendor’s permission.
- Reliability: An offline assistant keeps answering even when the coffee-shop Wi-Fi dies, the corporate proxy fails, or the openai.com status page turns orange.
- Rapid iteration: When a change - a prompt tweak, a system message, a Modelfile parameter - produces instant feedback you experiment more aggressively and learn faster.
Running a model locally also changes your relationship with the technology. You start noticing details that benchmarks often hide: fan curves, thermal limits, and the surprising way a single num_ctx
tweak can trade performance for conversational memory. Those insights make you a better engineer even when you return to server-hosted models.
The test bench: hardware and software
All benchmarks were captured on a single machine so relative scores are apples to apples:
- Laptop: Lenovo Legion 5 Gen 6.
- CPU: AMD Ryzen 7 5800H, 8 cores, 16 threads, 4.4 GHz boost.
- GPU: NVIDIA RTX 3070 Laptop, 8 GB GDDR6, 5120 CUDA cores. Effective VRAM for models: ~7.5 GiB after driver reservations.
- Sys RAM: 32 GiB DDR4-3200, dual channel.
- Storage: 1 TB NVMe SSD at 3.5 GB/s read.
- Operating system: Arch Linux 6.9-zen kernel with proprietary NVIDIA driver 550.xx.
- Software stack: Ollama 0.9.0 with
OLLAMA_NEW_ENGINE=true
, CUDA backend enabled, unified memory switched on (GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
).
The unsung hero is Ollama’s Modelfile. By pinning num_ctx
and num_gpu
per build I could trade context window for offloaded layers with surgical precision. Because those parameters translate directly into memory allocation of KV cache and weight buffers, they are your steering wheel and throttle respectively.
Models, quantization schemes, and why they matter
Seven base models made the short-list, each converted to GGUF for llama.cpp compatibility. Two of them were MoE giants, the rest dense “small” networks:
- Qwen3 30B MoE -- original GGUF Q4_K_M and an Unsloth re-quantized variant. 30 billion parameters, but only a subset is active per token because of the mixture-of-experts routing.
- Qwen3 8B and Qwen3 4B -- dense cousins for baseline speed.
- Gemma3 4B-it (Q8_0) -- suspiciously fast, possibly mislabeled FP16.
- Cogito 8B -- a Llama-architecture reference tuned for reasoning.
- Phi-4 Mini 3.8B -- tested in both Q8_0 and FP16 to experience the pain of half precision.
Quantization drives everything. Q4_K_M keeps weights tight without butchering accuracy and fully fits a 30B MoE on 8 GB when you juggle layers right. Q8_0 is heavier but lets tiny models saturate the GPU and fly. FP16, as the Phi-4 run showed, is pure self-punishment on consumer silicon.
A primer on mixture-of-experts vs dense LLMs
The Qwen3 30B is not really thirty billion parameters in the traditional sense. It is a 64-expert architecture where only two experts activate on any forward pass, meaning roughly 5 billion parameters are multiplied with the hidden state while the rest sleep. That sparsity delivers near-GPT-3.5 quality at half the effective compute, but it complicates memory planning.
In llama.cpp a MoE layer is decomposed into multiple GEMM calls. Offload exactly N layers and you actually offload N sets of experts. Oversubscribe VRAM and CUDA spills to host memory, which is an order of magnitude slower than device GDDR. The art of optimisation is finding the layer count that keeps all active experts resident while leaving enough room for the KV cache.
Methodology: hunting the VRAM sweet spot
I treated VRAM like Tetris blocks. Each model layer offloaded to CUDA eats a chunk, each extra token in the context window allocates KV cache, and the Ollama runner itself needs overhead. Cram too much and the driver falls back to unified memory, slashing throughput. The process for every model looked like this:
- Set
num_ctx 8192
, generate - and time a 512-token prompt to get a baseline tokens per second. - Increment
num_gpu
(i.e., layers on the GPU) until speed plateaued then crashed, noting VRAM utilisation fromnvtop
and ggml logs. - Repeat at 16384 and 24576 context to map the entire speed/docs surface.
- Graph performance vs layer count to visualise the cliff edge.
Capturing logs mattered more than I expected. Ollama prints a terse “buffer=CUDA0 size=XXXX” line for every offloaded tensor. Summing those sizes gave an exact VRAM budget, confirming that the magic 7.6 GiB mark was where performance nosedived. When the total size exceeded that by even 200 MiB, tokens per second halved.
Reading Ollama and llama.cpp logs like a mechanic
Until this project I skimmed server output. Now I read it line by line. Here are the greatest hits:
server.go:168
-- shows guessed GPU layers, often conservative for MoE models.ggml.go:... buffer=CUDA0
lines -- real weight copies, multiply by 1.07 to include allocator overhead.--ctx-size
in the runner command -- if it does not match your Modelfile you have an override bug.KV cache size:
printed on model load, a direct function of context length and head dimension.
Couple those with nvidia-smi looping every second and you have a live dashboard: when usage hovers at 7.4 GiB and stays there, generation is smooth; when it spikes to 7.9 GiB then oscillates between 6 GiB and 8 GiB, unified memory has kicked in and your benchmark is invalid.
Raw numbers: the top 10 configurations
Rank | Model | Tok/s | GPU Layers | Context | VRAM est. |
---|---|---|---|---|---|
1 | Gemma3 4B-it Q8_0 | 32.77 | 34/32* | 24 576 | 5.0 GiB |
2 | Qwen3 4B Q4_K_M | 31.41 | 25/37 | 24 576 | 6.2 GiB |
3 | Qwen3 30B Unsloth 19L@8k | 23.92 | 19/49 | 8 192 | 7.6 GiB |
4 | Qwen3 30B Unsloth 18L@8k | 23.35 | 18/49 | 8 192 | 7.4 GiB |
5 | Qwen3 30B Original 18L@8k | 22.85 | 18/49 | 8 192 | 7.2 GiB |
6 | Phi-4 Mini Q8_0 | 22.31 | ≈25/32 | 24 576 | 6.8 GiB |
7 | Qwen3 30B Original 16L@8k | 22.09 | 16/49 | 8 192 | 6.8 GiB |
8 | Qwen3 30B Original 10L@24k | 20.00 | 10/49 | 24 576 | 7.5 GiB |
9 | Cogito 8B Q8_0 | 19.84 | 28/32 | 24 576 | 7.0 GiB |
10 | Phi-4 Mini FP16 | 11.22 | FC | 8 192 | 7.9 GiB |
*34/32 means four layers spill out of the CUDA allowance and are computed on the CPU, a quirk of Gemma’s GGUF conversion.
Thermal behaviour: keeping the laptop alive
Raw speed is pointless if the machine throttles ten minutes into a marathon summarisation job. I logged GPU and CPU temperatures with nvtop --temperature
while each benchmark ran for at least 15 minutes. Two patterns emerged:
- GPU-bound runs: Gemma3 and Qwen3 4B saturate CUDA cores, stabilising at 71-73 °C with the Legion’s fans set to ‘Performance’. Core clock holds steady at 1380 MHz.
- Memory-latency runs: The MoE giants leave part of the GPU idle due to routing overhead. They peak lower (68 °C) but fluctuate, causing fan oscillation. Enabling a custom fan curve with Tuxedo-Control eliminated micro-stutters.
Power draw hovered between 105 W and 125 W package-wide, meaning the 300 W brick still had headroom. During a continuous eight-hour coding session the palm-rest never exceeded 33 °C---your wrists will survive.
Latency vs throughput: chat isn’t batch
Benchmarks traditionally report tokens per second, but interactive chat feels sluggish if first-token latency is high. I timed four scenarios:
- Short chat: 128-token prompt, 128-token response.
- Long form: 2048-token system message, 1024-token response.
- Batch summarise: 16 documents, 512 tokens each, processed sequentially.
- Code generation: 64-token prompt, unlimited generation until stop word.
Gemma3 glittered in short chat (first token at 0.38 s), while Qwen3 30B 19L@8k won the long-form crown thanks to its deeper layers. For batch summarisation the 8B dense models edged MoE because routing overhead accumulates. Moral: pick your model not only by peak Tok/s but by the prompt pattern you actually use.
Energy cost: silence the bean-counters
During the heaviest MoE run, the Kill-A-Watt meter read 128 W on the AC side. At my local rate of USD 0.23 per kWh, that equals USD 0.03 per hour. A 10 000-token coding session costs less than the coffee powering it---a trivial spend compared with cloud inference pricing.
Modelfiles you can copy-paste
Rather than burying configs in screenshots, here is the working Modelfile for the Unsloth model:
Limitations and open questions
- FP16 drought: Anything above 7.6 GiB VRAM forces host-to-device streaming, so true FP16 runs are painful. RTX 4080 Laptop users will have more breathing room.
- Quantisation artefacts: Q4_K_M is mostly safe, but I noticed occasional nonsense in regex generation tasks. Q5_1 may be a better sweet spot when quality matters more than speed.
- Windows reproducibility: I tested on Arch; early reports say the same configs work under WSL 2 with CUDA 12, but I have not verified fan behaviour or token speeds.
- Inference framework lock-in: Ollama’s Modelfile is elegant but proprietary. Porting these numbers to LMStudio or llama-cpp-server required manual flag mapping. A common schema would help the community.
Conclusions: the take-away playbook
- Measure before you tweak: Watch VRAM, not just Tok/s. Unified memory tanking is the silent killer.
- Start small, scale context: Find the layer count that maxes GPU RAM at 8k tokens, then stretch the window. It is easier than the other way around.
- Dense for chat, MoE for docs: Interactive work favours tiny high-clock models; deep reasoning across long pages favours sparsity.
- One CPU flag away from throttling: An errant llama.cpp compile without --avx2 costs you 30 % speed even if everything looks “fine”.
- Quantise with intent: Q4_K_M is the default, but invest time trying Q5_1 or Q6_K if your domain punishes precision loss.
The bigger moral is that local inference is no longer an exotic party trick. With a methodical approach---and an hour of tracing log outputs---you can push frontier-grade models through silicon that fits in a backpack. The cloud is still king for massive batched workloads; for personal dev, research, and tinkering, the 8 GB RTX 3070 has become surprisingly regal.
Next steps
My own roadmap is clear:
- Flash the Legion’s BIOS to unlock the 140 W TGP limit and rerun the MoE benchmarks.
- Test the freshly released Qwen3 235B with Sliced Attention and VRAM paging tricks.
- Port everything to a Ryzen-7840U handheld PC to answer once and for all: can a Steam Deck run GPT-3.5 quality offline?
Expect follow-up numbers and, inevitably, more melted USB-C chargers.
Did this guide help?
If you replicated the results, improved them, or discovered a better trick, drop a comment on the GitHub issue linked in the sidebar. Collective tinker-power beats any single benchmark run.