Why the Same LLM Uses 100% GPU on Mac but 80% CPU on Windows

UpdatedApril 23, 2026

•5 min read

Why the Same LLM Uses 100% GPU on Mac but 80% CPU on Windows

I am a proficient Full Stack Developer with a growing focus on Generative AI, ML Engineering and intelligent automation. Throughout my tenure, I've made significant contributions to the enhancement of various software systems, frameworks and open source software, with core challenges primarily centered around scalability, security and production reliability. As I have been operating within Research and Development teams, my responsibilities include spearheading the development of advanced backend microservices, scalable pipelines using Kubernetes and end-to-end platform ownership. I've contributed across Python, Go, React, Kubernetes and Azure services, taking ownership of ambiguous or broken processes and converting them into structured, repeatable solutions. Additionally, I've tackled challenges in the cybersecurity space — partnering with Infosec teams on secure log pipelines, implementing RBAC and access controls, and leading the exploration of AI-driven approaches for network security and threat detection. More recently, I've expanded into the AI/ML space — fine-tuning Large Language Models, deploying them to production inference systems and taking end-to-end ownership of the ML lifecycle from dataset engineering and distributed GPU training to model serving and inference optimization. Beyond model training, I've been actively designing and implementing AI agents, onboarding MCP servers and building automation workflows that reduce manual operational effort at scale. I've built and standardized base templates for MCP servers, accelerating team onboarding and reducing ad-hoc implementation overhead. On the observability and reliability front, I've played a key role in metrics backend transitions, Prometheus/Grafana dashboards and improving operational visibility through better telemetry and alerting. I'm known for clear technical communication, driving cross-team alignment and contributing well beyond ownership boundaries. Outside of work, I pursue my passion for coding by occasionally developing full stack applications and serving as the maintainer of my personal GitHub projects, continuously exploring the intersection of software engineering, cybersecurity and applied AI.

I didn’t start with a grand theory about hardware architecture. I just wanted to run a model locally.

Instead, I ended up debugging CPU vs GPU usage, WSL memory limits, VRAM bottlenecks, and Apple’s unified memory—and discovered something most local LLM guides don’t explain properly.

This is that story.

The Setup

Here’s what I was running:

Windows machine
32GB RAM
RTX 2060 (6GB VRAM)
Using Ollama
Model: qwen3.6 (~27GB)

I expected GPU usage to be high.

Instead, I saw this:

ollama ps
→ 84% CPU / 16% GPU

That didn’t make sense.

I have a GPU. Why is the CPU doing most of the work?

The Surprise

Then I saw the same model running on a Mac with Apple M3 Pro:

→ 100% GPU

Same model. Same tool. Completely different behavior.

So the obvious question:

Is Mac just better? Or am I doing something wrong?

The Debugging Journey

1. VRAM Bottleneck (The Real Culprit)

My GPU:

6GB VRAM

Model:

~27GB

That means:

The model cannot fit into GPU memory

So what happens?

Some layers → GPU
Remaining layers → CPU

This is called partial offloading.

Result:

GPU is underutilized
CPU does most of the heavy lifting

2. The WSL Memory Trap

Then I tried running inside Windows Subsystem for Linux.

And hit this:

model requires more system memory (20.6 GiB) than is available (18.7 GiB)

Wait… I have 32GB RAM.

Turns out:

WSL doesn’t use full system memory by default
It silently caps available RAM

Fix:

[wsl2]
memory=28GB

Lesson:

Even your RAM isn’t fully usable unless you configure it.

3. Windows vs Linux Overhead

Even after fixing memory:

Windows → higher CPU usage
Linux/WSL → slightly better GPU usage

Why?

Driver overhead
Memory management differences

But this still didn’t explain the Mac behavior.

The Breakthrough: Unified Memory

This is where everything clicked.

Apple Silicon (like Apple M3 Pro) uses:

👉 Unified Memory Architecture (UMA)

Meaning:

CPU and GPU share the same memory pool
No separate VRAM
No data copying

Compare that to my PC:

Component	Memory
CPU	32GB RAM
GPU	6GB VRAM

They are separate pools.

What this means in practice

On Mac:

Entire model sits in unified memory
GPU can access all of it
No splitting

Result:

100% GPU

On my PC:

Only 6GB fits in GPU
Rest spills to CPU
Constant data movement

Result:

84% CPU / 16% GPU

The Truth About “100% GPU”

This is the most misunderstood part.

“100% GPU” on Mac does NOT mean it’s faster.

It means:

The GPU has access to the entire model

Not: That it’s doing all computation faster than a high-end NVIDIA GPU

Important distinction

Metric	What it means
GPU %	Where computation happens
Tokens/sec	Actual performance

Mac:

High GPU usage
Moderate throughput

High-end NVIDIA:

Lower % (sometimes)
Higher throughput

Real-World Behavior

Here’s what actually matters:

Scenario	Mac (Unified Memory)	PC (6GB GPU)
Large model (27GB)	Smooth	CPU-heavy
Small model (7B)	Smooth	Smooth
GPU utilization	High	Low
Setup friction	Low	High

Practical Lessons

1. VRAM > GPU power (for local LLMs)

A fast GPU with low VRAM is still a bottleneck.

2. Don’t run oversized models

Just because you can load a model doesn’t mean you should.

3. WSL needs tuning

Without .wslconfig, you’re not using your full system.

4. Memory architecture matters more than OS

This is not:

Mac vs Windows

This is:

Unified memory vs split memory

Bigger Takeaway

The real insight from all this:

This isn’t a software problem. It’s a hardware architecture problem.

Final Thought

When running local LLMs, we tend to focus on:

Model size
GPU specs
Frameworks

But the actual bottleneck is often:

How memory is structured and accessed

Where This Leads (for me)

This debugging journey pushed me to rethink how I’m building my local coding assistant (bonfire):

Instead of:

Forcing large models

I’m moving toward:

Smaller, efficient models
Better memory utilization
Hybrid architectures (local + cloud)
Newer inference engines like llama.cpp

TL;DR

Mac shows 100% GPU because of unified memory
Windows struggles due to VRAM limits
WSL can hide available RAM
“GPU usage” ≠ “performance”
Memory architecture is the real bottleneck

If you’re building local AI tools, this is something you’ll hit sooner or later.

Better to understand it early.

Stay tuned to my blog & github for my latest open source releases (bonfire).

#ai #llm #gpu #machine-learning

41 views

Comments

Join the discussion

No comments yet. Be the first to comment.

Why the Same LLM Uses 100% GPU on Mac but 80% CPU on Windows

The Setup

The Surprise

The Debugging Journey

1. VRAM Bottleneck (The Real Culprit)

2. The WSL Memory Trap

3. Windows vs Linux Overhead

The Breakthrough: Unified Memory

Compare that to my PC:

What this means in practice

On Mac:

On my PC:

The Truth About “100% GPU”

Important distinction

Real-World Behavior

Practical Lessons

1. VRAM > GPU power (for local LLMs)

2. Don’t run oversized models

3. WSL needs tuning

4. Memory architecture matters more than OS

Bigger Takeaway

Final Thought

Where This Leads (for me)

TL;DR

Comments

More from this blog

Why I built bonfire — a local-first terminal coding assistant

Claude Mythos Preview — This Changes How We Think About LLMs

My Take on Running Open-Source and Open-Weight LLMs with Claude Code, Open code

Command Palette

The Setup

The Surprise

The Debugging Journey

1. VRAM Bottleneck (The Real Culprit)

2. The WSL Memory Trap

3. Windows vs Linux Overhead

The Breakthrough: Unified Memory

Compare that to my PC:

What this means in practice

On Mac:

On my PC:

The Truth About “100% GPU”

Important distinction

Real-World Behavior

Practical Lessons

1. VRAM > GPU power (for local LLMs)

2. Don’t run oversized models

3. WSL needs tuning

4. Memory architecture matters more than OS

Bigger Takeaway

Final Thought

Where This Leads (for me)

TL;DR

Comments

More from this blog