Abstract header graphic for the local LLM benchmarking blog post

Local LLM Performance Benchmarks 2026: Qwen, Gemma, and Ministral

February 16, 2026

genai lmstudio apple qwen gemma ministral local-llm

Today I'm investigating Tokens Per Second (TPS), Time to First Token (TTFT), and overall output quality of the local LLM models. I'll be running a series of benchmarks on a Linux VPS (CPU-only) and Apple Silicon (M2 and M4) to gain insights into how Unified Memory Architecture and modern instruction sets impact local AI performance in 2026.

What I learned: CPU vs. Unified Memory Performance

Here is what I learned from this 2026 benchmarking exercise:

Local models have drastically improved: Last time I checked them (about half a year ago), the quality of the output was suspicious, as was the inference speed. Today's GGUF/llama.cpp-powered inference is significantly more robust.
The 3B Parameter Sweet Spot: Smaller models (up to 3B parameters) are now certainly okay with quality for most tasks. Maybe my standards are too low, but at least they no longer give me "the capital of Moscow is Russia" when asked "what's the capital of France?"
Qwen 2.5 is impressive: I'm impressed by the output quality of the 0.5B model for text generation tasks like grammar suggestions.
CPU-only inference is viable: The inference speed of tiny models on CPU-only machines is surprisingly good.
Mac Mini M4 is a value king: I did my tests on cloud Mac instances and am now considering getting a Mac Mini. The price/quality ratio for non-Pro versions with high-bandwidth Unified Memory is unbeatable for local LLMs.
Ministral 3B for Reasoning: Mistral's 3B model was the smallest model that passed my simple reasoning test, though it lagged in creative storytelling.
The Memory Bandwidth Bottleneck: Long-context ingestion on CPU-only machines is bad. A 2-minute wait for a 14B model to "read" a prompt is a workflow killer. The Mac M4's 5-6 second TTFT keeps the process fluid.
LM Studio & Memory Management: LM Studio looks to be a solid choice for model management, handling memory pressure robustly even for 14B models on 16GB systems.

The Test Bench

I'm testing GGUF equivalents of the nine models across three hardware configurations and the goal is to measure raw speed (Tokens Per Second), responsiveness (Time to First Token), and logical reasoning capabilities. Models:

Terminal screenshot showing the list of local LLM models being tested including Qwen, Gemma, and Ministral

I'm using LM Studio for model management and inference via the lms CLI utility, which worked perfectly in headless mode on Linux. On Mac, however, I couldn't get it to work completely headless - I had to install the app and open the UI at least once for the lms CLI to function correctly. Here is a video of it running the qwen model:

Animated GIF demonstrating Qwen model inference in LM Studio via CLI utility

Hardware Specifications

Platform	CPU/vCores	RAM	GPU
Linux VM	8 vCores	24 GB	None
Mac M2	8 Cores	16 GB	10 Cores
Mac Mini M4	10 Cores	32 GB	10 Cores

1. Inference Performance

I was surprised by the inference speed on a CPU-only machine - the small qwen models (up to 3 billion parameters) performed quite well. When I last tested gemma models locally about 6 months ago, they were slower and worse in output quality than they are today.

Not surprisingly, there was a performance gap between CPU-only and the Mac. The CPU-only machine struggled with inference performance as the parameter count increased.

Tokens Per Second (TPS)

I used this prompt to measure the TPS:

Write a highly detailed, 1,500-word science fiction short story about a civilization living on a Dyson sphere. Focus on the engineering challenges they face and the daily life of a maintenance worker. Be extremely descriptive and do not summarize.

Results

Bar chart comparing Tokens Per Second (TPS) performance across Linux CPU-only, Mac M2, and Mac M4 for various LLMs

Model	Linux CPU-only, TPS	M2 Mac (16GB RAM), TPS	M4 Mac (32GB RAM), TPS	Output	Quality of the generated story*
qwen2.5-0.5b	27.39	88.46	64.53	output.txt	9
qwen2.5-1.5b	9.40	45.06	69.79	output.txt	7
qwen2.5-3b	7.42	40.83	44.40	output.txt	6
qwen2.5-7b	3.70	18.12	21.58	output.txt	5
qwen2.5-14b	1.99	9.29	10.98	output.txt	4
gemma-3-1b	18.45	94.40	126.96	output.txt	3
gemma-3-4b	5.67	33.21	40.25	output.txt	1
ministral-3-3b	7.28	36.32	39.84	output.txt	8
ministral-3-14b	2.24	8.51	12.14	output.txt	2

* The AI-storytelling quality for each model is ranked by a judge LLM (Gemini), with outputs scored from 1 (the winner) to 9 (the loser). Worth pointing out that none of the models reached the requested 1500-word count - most hovered between 800 and 1100 words.

Time to First Token (TTFT)

I used this prompt to measure the TTFT, basically I provided a long text and asked the model to process it and signal completion by outputting the word "READY" when finished.

Results

Line chart showing Time to First Token (TTFT) comparisons for local LLMs on Linux and Apple Silicon hardware

Model	Linux CPU-only, seconds	M2 Mac (16GB RAM), seconds	M4 (32GB RAM), seconds
qwen2.5-0.5b	5.449	2.442	1.116
qwen2.5-1.5b	17.484	2.391	2.522
qwen2.5-3b	28.946	5.350	3.927
qwen2.5-7b	62.468	11.533	5.847
qwen2.5-14b	136.175	22.253	6.057
gemma-3-1b	7.193	1.467	1.715
gemma-3-4b	69.297	9.161	4.903
ministral-3-3b	29.750	5.758	2.884
ministral-3-14b	128.380	20.225	5.554

On the CPU-only machine, I had to wait two minutes for the 14B models to start generating a response, whereas the M4 Mac provided nearly instant responses across the board.

2. Intelligence & Logic Test

I used the classic "Sally's Brothers" riddle to test reasoning:

Sally has 3 brothers. Each of her brothers has 2 sisters. How many sisters does Sally have?

There was a notable "intelligence jump" at the 3B-4B parameter mark. The result:

Model Name	Result	Output Provided
qwen2.5-0.5b	❌ Wrong	5 sisters
qwen2.5-1.5b	❌ Wrong	5 sisters
qwen2.5-3b	❌ Wrong	5 sisters
qwen2.5-7b	❌ Wrong	5 sisters
qwen2.5-14b	✅ Correct	1 sister
gemma-3-1b	❌ Wrong	2 sisters
gemma-3-4b	✅ Correct	1 sister
ministral-3-3b	✅ Correct	1 sister
ministral-3-14b	✅ Correct	1 sister

3. VRAM Pressure Test

I used this prompt to run a stress test - I asked the models to sumamrize a sufficiently long text into 10 bullet points. All models were able to complete the task on every machine. While it took a lot more time for bigger models to process the text, no crashes were expirinced.