Abstract header graphic for the local LLM benchmarking blog post

Local LLM Performance Benchmarks 2026: Qwen, Gemma, and Ministral

genai lmstudio apple qwen gemma ministral local-llm

Today I'm investigating Tokens Per Second (TPS), Time to First Token (TTFT), and overall output quality of the local LLM models. I'll be running a series of benchmarks on a Linux VPS (CPU-only) and Apple Silicon (M2 and M4) to gain insights into how Unified Memory Architecture and modern instruction sets impact local AI performance in 2026.

What I learned: CPU vs. Unified Memory Performance

Here is what I learned from this 2026 benchmarking exercise:

The Test Bench

I'm testing GGUF equivalents of the nine models across three hardware configurations and the goal is to measure raw speed (Tokens Per Second), responsiveness (Time to First Token), and logical reasoning capabilities. Models:

Terminal screenshot showing the list of local LLM models being tested including Qwen, Gemma, and Ministral

I'm using for model management and inference via the lms CLI utility, which worked perfectly in headless mode on Linux. On Mac, however, I couldn't get it to work completely headless - I had to install the app and open the UI at least once for the lms CLI to function correctly. Here is a video of it running the qwen model:

Animated GIF demonstrating Qwen model inference in LM Studio via CLI utility

Hardware Specifications

Platform CPU/vCores RAM GPU
Linux VM 8 vCores 24 GB None
Mac M2 8 Cores 16 GB 10 Cores
Mac Mini M4 10 Cores 32 GB 10 Cores

1. Inference Performance

I was surprised by the inference speed on a CPU-only machine - the small qwen models (up to 3 billion parameters) performed quite well. When I last tested gemma models locally about 6 months ago, they were slower and worse in output quality than they are today.

Not surprisingly, there was a performance gap between CPU-only and the Mac. The CPU-only machine struggled with inference performance as the parameter count increased.

Tokens Per Second (TPS)

I used this prompt to measure the TPS:

Write a highly detailed, 1,500-word science fiction short story about a civilization living on a Dyson sphere. Focus on the engineering challenges they face and the daily life of a maintenance worker. Be extremely descriptive and do not summarize.

Results

Bar chart comparing Tokens Per Second (TPS) performance across Linux CPU-only, Mac M2, and Mac M4 for various LLMs
Model Linux CPU-only, TPS M2 Mac (16GB RAM), TPS M4 Mac (32GB RAM), TPS Output Quality of the generated story*
qwen2.5-0.5b 27.39 88.46 64.53 9
qwen2.5-1.5b 9.40 45.06 69.79 7
qwen2.5-3b 7.42 40.83 44.40 6
qwen2.5-7b 3.70 18.12 21.58 5
qwen2.5-14b 1.99 9.29 10.98 4
gemma-3-1b 18.45 94.40 126.96 3
gemma-3-4b 5.67 33.21 40.25 1
ministral-3-3b 7.28 36.32 39.84 8
ministral-3-14b 2.24 8.51 12.14 2

* The AI-storytelling quality for each model is ranked by a judge LLM (Gemini), with outputs scored from 1 (the winner) to 9 (the loser). Worth pointing out that none of the models reached the requested 1500-word count - most hovered between 800 and 1100 words.

Time to First Token (TTFT)

I used to measure the TTFT, basically I provided a long text and asked the model to process it and signal completion by outputting the word "READY" when finished.

Results

Line chart showing Time to First Token (TTFT) comparisons for local LLMs on Linux and Apple Silicon hardware
Model Linux CPU-only, seconds M2 Mac (16GB RAM), seconds M4 (32GB RAM), seconds
qwen2.5-0.5b 5.449 2.442 1.116
qwen2.5-1.5b 17.484 2.391 2.522
qwen2.5-3b 28.946 5.350 3.927
qwen2.5-7b 62.468 11.533 5.847
qwen2.5-14b 136.175 22.253 6.057
gemma-3-1b 7.193 1.467 1.715
gemma-3-4b 69.297 9.161 4.903
ministral-3-3b 29.750 5.758 2.884
ministral-3-14b 128.380 20.225 5.554

On the CPU-only machine, I had to wait two minutes for the 14B models to start generating a response, whereas the M4 Mac provided nearly instant responses across the board.

2. Intelligence & Logic Test

I used the classic "Sally's Brothers" riddle to test reasoning:

Sally has 3 brothers. Each of her brothers has 2 sisters. How many sisters does Sally have?

There was a notable "intelligence jump" at the 3B-4B parameter mark. The result:

Model Name Result Output Provided
qwen2.5-0.5b ❌ Wrong 5 sisters
qwen2.5-1.5b ❌ Wrong 5 sisters
qwen2.5-3b ❌ Wrong 5 sisters
qwen2.5-7b ❌ Wrong 5 sisters
qwen2.5-14b ✅ Correct 1 sister
gemma-3-1b ❌ Wrong 2 sisters
gemma-3-4b ✅ Correct 1 sister
ministral-3-3b ✅ Correct 1 sister
ministral-3-14b ✅ Correct 1 sister

3. VRAM Pressure Test

I used to run a stress test - I asked the models to sumamrize a sufficiently long text into 10 bullet points. All models were able to complete the task on every machine. While it took a lot more time for bigger models to process the text, no crashes were expirinced.