Why Local vs. Cloud GPU Benchmarking Still Matters
Benchmarking AI performance isn't just about theoretical throughput, it's about making informed decisions when deploying resource-intensive workloads. For developers, system architects, and self-hosters, the choice between local hardware and cloud platforms involves trade-offs in cost, accessibility, and raw speed. This article compares a 24B-parameter LLM's performance across six GPU setups: one consumer laptop and five high-end cloud instances, quantifying how VRAM limitations, GPU architecture, and cloud infrastructure overhead impact real-world inference.
The Testing Stack: Hardware, Software, and Methodology
All tests used the magistral:24b-small-2506-q4_K_M
model, a 14GB quantized 24B-parameter LLM. Ollama v0.9.1-rc0
served as the inference engine with standardized settings:
- Core variables:
OLLAMA_NUM_PARALLEL=3
(concurrent requests),OLLAMA_FLASH_ATTENTION=1
(memory-efficient attention),OLLAMA_NEW_ENGINE=true
(latest optimizations). - Test prompt: A 2,048-token context load followed by a 512-token generation task.
Local vs. Cloud Configuration: What Was Tuned and Why
Local hardware required aggressive VRAM management:
- Laptop (RTX 3070):
OLLAMA_GPU_LAYERS=20
offloaded only 20 of 41 layers to GPU, withOLLAMA_CONTEXT_LENGTH=8192
.
Cloud instances (Novita.ai) used full offload:
- Streamlined setup: Identical core variables with
OLLAMA_HOST=0.0.0.0
, no layer or context restrictions.
Split Inference Explained: GPU Layer Limits and CPU Offload
When models exceed VRAM capacity, frameworks split workloads between GPU and CPU. Limiting offloading to 20 layers (of 41) forced half the inference onto the laptop's CPU, creating serialization bottlenecks as data shuttled between memory systems.
Real-World Performance: What 8GB VRAM Gets You
The Lenovo Legion 5 (RTX 3070 Laptop, 8GB VRAM) delivered 3.66 tokens/second. Generating 512 tokens took 19m 15s, unusable for interactive tasks. This wasn't a test of compute power but a VRAM-induced failure: the 14GB model + 8K context couldn't fit in 8GB VRAM even with Flash Attention optimizations.
Takeaway: For models exceeding ~6B parameters, 8GB VRAM forces unacceptable compromises.
The Cloud GPU Lineup: L40S, 3090, A100 SXM4, 5090, 4090
Five cloud GPUs tested via Novita.ai:
- L40S (48GB): NVIDIA's data-center card
- RTX 3090 (24GB): Earlier consumer flagship
- A100 SXM4 (80GB): High-memory data-center GPU
- RTX 4090 (24GB): Previous-gen consumer flagship
- RTX 5090 (32GB): Current top-end GPU
No layer limits, full model offload to VRAM.
Benchmark Results Table and Analysis
GPU | VRAM | Environment | Inference Rate (tokens/s) | Performance vs. Baseline | Total Time (512 tokens) |
---|---|---|---|---|---|
RTX 3070 Laptop (Baseline) | 8GB | Bare Metal (PC) | 3.66 | - | 19m 15s |
L40S | 48GB | Novita.ai Cloud | 5.52 | +51% | 14m 5s |
RTX 3090 | 24GB | Novita.ai Cloud | 8.71 | +138% | 8m 15s |
A100 SXM4 | 80GB | Novita.ai Cloud | 8.93 | +144% | 7m 42s |
RTX 5090 | 32GB | Novita.ai Cloud | 9.16 | +150% | 7m 40s |
RTX 4090 | 24GB | Novita.ai Cloud | 9.42 | +157% | 5m 44s |
- Consumer dominance: RTX 4090 (24GB) outperformed all others, including the 80GB A100.
- Efficiency wins: Older RTX 3090 nearly matched A100, highlighting consumer cards' per-dollar advantage for single-user inference.
- L40S anomaly: Despite 48GB VRAM, it lagged significantly, likely optimized for different workloads.
- VRAM ceiling: Beyond 24GB, additional memory (A100's 80GB) didn't boost speed.
The Cloud "Convenience Tax" – Why the Fastest Still Isn't Fast
Despite the RTX 4090's 9.42 tok/s lead, it fell far short of bare-metal benchmarks (30+ tok/s) due to:
- Noisy neighbors: Shared hardware causing contention for resources
- Virtualization tax: Container overhead vs. bare metal
- Power throttling: Datacenter GPUs running at lower TDPs
Cost Analysis: What Does This Performance Cost?
While speed is a critical metric, the cost-effectiveness of each solution is equally important. To analyze this, I used the "On Demand" hourly rates from the Novita.ai platform at the time of testing. Each test, including launching the instance, setting up the environment, pulling the model, and running the prompt, was estimated to take 30 minutes (0.5 hours) of runtime.
Here is a breakdown of the estimated cost for each cloud-based test:
GPU | On-Demand Hourly Rate | Test Duration | Estimated Cost per Test |
---|---|---|---|
A100 SXM 80GB | $1.60 / hr | 30 minutes | ~$0.80 |
L40S 48GB | $0.55 / hr | 30 minutes | ~$0.28 |
RTX 5090 32GB | $0.50 / hr | 30 minutes | ~$0.25 |
RTX 4090 24GB | $0.35 / hr | 30 minutes | ~$0.18 |
RTX 3090 24GB | $0.21 / hr | 30 minutes | ~$0.11 |
Analysis of Cost-Effectiveness
This cost breakdown adds a fascinating layer to our performance findings:
-
The Most Expensive is Not the Fastest: The A100, the most expensive card to rent by a significant margin (~$0.80 per test), was not the fastest performer. It was beaten by both the RTX 4090 and the RTX 5090.
-
The "Bang for the Buck" Champions: The consumer-grade cards offered incredible value. The RTX 4090, our top performer in terms of speed, was also one of the cheapest to run at just ~$0.18 per test. The older RTX 3090 was the most budget-friendly option of all, delivering strong performance for a mere ~$0.11.
-
The Datacenter Value Proposition: This data suggests that for single-user, direct inference workloads, the high rental cost of premium datacenter GPUs like the A100 may not be justified if raw speed-per-dollar is the primary concern. Their strengths likely lie in other areas, such as multi-user serving, massive-scale training, or specific computational tasks not highlighted by this test.
For an individual researcher or developer looking to run inference tasks on a budget, high-end consumer cloud instances like the RTX 4090 and 3090 clearly provide the best balance of high performance and low cost on this platform.
Comparing Cloud Results to Ideal Bare-Metal Performance
Bare-metal RTX 4090s typically hit 15–30 tok/s. Here, cloud instances delivered just 60% of that potential, a "convenience tax" where a 512-token task taking 5m 44s in-cloud could run in ~2m on dedicated hardware.
Four Lessons for Anyone Optimizing LLM Performance
- VRAM is non-negotiable: Below 16GB, large-model inference bottlenecks before compute matters.
- Cloud = accessibility, not peak speed: Eliminates VRAM barriers but introduces multi-tenant overhead.
- Consumer GPUs punch above their weight: RTX 5090/4090/3090 rival data-center cards at fractional cost.
- Test actual workloads: Synthetic benchmarks lie, measure real prompts in your environment.
Bottom line: Until consumer GPUs gain 48GB+ VRAM, cloud access remains essential, but treat it as a bridge to dedicated hardware for latency-sensitive tasks.