Why the Same LLM Uses 100% GPU on Mac but 80% CPU on Windows

I didn’t start with a grand theory about hardware architecture. I just wanted to run a model locally.
Instead, I ended up debugging CPU vs GPU usage, WSL memory limits, VRAM bottlenecks, and Apple’s unified memory—and discovered something most local LLM guides don’t explain properly.
This is that story.
The Setup
Here’s what I was running:
Windows machine
32GB RAM
RTX 2060 (6GB VRAM)
Using Ollama
Model:
qwen3.6(~27GB)
I expected GPU usage to be high.
Instead, I saw this:
ollama ps
→ 84% CPU / 16% GPU
That didn’t make sense.
I have a GPU. Why is the CPU doing most of the work?
The Surprise
Then I saw the same model running on a Mac with Apple M3 Pro:
→ 100% GPU
Same model. Same tool. Completely different behavior.
So the obvious question:
Is Mac just better? Or am I doing something wrong?
The Debugging Journey
1. VRAM Bottleneck (The Real Culprit)
My GPU:
- 6GB VRAM
Model:
- ~27GB
That means:
- The model cannot fit into GPU memory
So what happens?
Some layers → GPU
Remaining layers → CPU
This is called partial offloading.
Result:
GPU is underutilized
CPU does most of the heavy lifting
2. The WSL Memory Trap
Then I tried running inside Windows Subsystem for Linux.
And hit this:
model requires more system memory (20.6 GiB) than is available (18.7 GiB)
Wait… I have 32GB RAM.
Turns out:
WSL doesn’t use full system memory by default
It silently caps available RAM
Fix:
[wsl2]
memory=28GB
Lesson:
Even your RAM isn’t fully usable unless you configure it.
3. Windows vs Linux Overhead
Even after fixing memory:
Windows → higher CPU usage
Linux/WSL → slightly better GPU usage
Why?
Driver overhead
Memory management differences
But this still didn’t explain the Mac behavior.
The Breakthrough: Unified Memory
This is where everything clicked.
Apple Silicon (like Apple M3 Pro) uses:
👉 Unified Memory Architecture (UMA)
Meaning:
CPU and GPU share the same memory pool
No separate VRAM
No data copying
Compare that to my PC:
| Component | Memory |
|---|---|
| CPU | 32GB RAM |
| GPU | 6GB VRAM |
They are separate pools.
What this means in practice
On Mac:
Entire model sits in unified memory
GPU can access all of it
No splitting
Result:
100% GPU
On my PC:
Only 6GB fits in GPU
Rest spills to CPU
Constant data movement
Result:
84% CPU / 16% GPU
The Truth About “100% GPU”
This is the most misunderstood part.
“100% GPU” on Mac does NOT mean it’s faster.
It means:
The GPU has access to the entire model
Not: That it’s doing all computation faster than a high-end NVIDIA GPU
Important distinction
| Metric | What it means |
|---|---|
| GPU % | Where computation happens |
| Tokens/sec | Actual performance |
Mac:
High GPU usage
Moderate throughput
High-end NVIDIA:
Lower % (sometimes)
Higher throughput
Real-World Behavior
Here’s what actually matters:
| Scenario | Mac (Unified Memory) | PC (6GB GPU) |
|---|---|---|
| Large model (27GB) | Smooth | CPU-heavy |
| Small model (7B) | Smooth | Smooth |
| GPU utilization | High | Low |
| Setup friction | Low | High |
Practical Lessons
1. VRAM > GPU power (for local LLMs)
A fast GPU with low VRAM is still a bottleneck.
2. Don’t run oversized models
Just because you can load a model doesn’t mean you should.
3. WSL needs tuning
Without .wslconfig, you’re not using your full system.
4. Memory architecture matters more than OS
This is not:
- Mac vs Windows
This is:
- Unified memory vs split memory
Bigger Takeaway
The real insight from all this:
This isn’t a software problem. It’s a hardware architecture problem.
Final Thought
When running local LLMs, we tend to focus on:
Model size
GPU specs
Frameworks
But the actual bottleneck is often:
How memory is structured and accessed
Where This Leads (for me)
This debugging journey pushed me to rethink how I’m building my local coding assistant (bonfire):
Instead of:
- Forcing large models
I’m moving toward:
Smaller, efficient models
Better memory utilization
Hybrid architectures (local + cloud)
Newer inference engines like llama.cpp
TL;DR
Mac shows 100% GPU because of unified memory
Windows struggles due to VRAM limits
WSL can hide available RAM
“GPU usage” ≠ “performance”
Memory architecture is the real bottleneck
If you’re building local AI tools, this is something you’ll hit sooner or later.
Better to understand it early.
Stay tuned to my blog & github for my latest open source releases (bonfire).




