Skip to main content

Command Palette

Search for a command to run...

Why the Same LLM Uses 100% GPU on Mac but 80% CPU on Windows

Updated
5 min read
Why the Same LLM Uses 100% GPU on Mac but 80% CPU on Windows
P
I am a proficient Full Stack Developer with a growing focus on Generative AI, ML Engineering and intelligent automation. Throughout my tenure, I've made significant contributions to the enhancement of various software systems, frameworks and open source software, with core challenges primarily centered around scalability, security and production reliability. As I have been operating within Research and Development teams, my responsibilities include spearheading the development of advanced backend microservices, scalable pipelines using Kubernetes and end-to-end platform ownership. I've contributed across Python, Go, React, Kubernetes and Azure services, taking ownership of ambiguous or broken processes and converting them into structured, repeatable solutions. Additionally, I've tackled challenges in the cybersecurity space — partnering with Infosec teams on secure log pipelines, implementing RBAC and access controls, and leading the exploration of AI-driven approaches for network security and threat detection. More recently, I've expanded into the AI/ML space — fine-tuning Large Language Models, deploying them to production inference systems and taking end-to-end ownership of the ML lifecycle from dataset engineering and distributed GPU training to model serving and inference optimization. Beyond model training, I've been actively designing and implementing AI agents, onboarding MCP servers and building automation workflows that reduce manual operational effort at scale. I've built and standardized base templates for MCP servers, accelerating team onboarding and reducing ad-hoc implementation overhead. On the observability and reliability front, I've played a key role in metrics backend transitions, Prometheus/Grafana dashboards and improving operational visibility through better telemetry and alerting. I'm known for clear technical communication, driving cross-team alignment and contributing well beyond ownership boundaries. Outside of work, I pursue my passion for coding by occasionally developing full stack applications and serving as the maintainer of my personal GitHub projects, continuously exploring the intersection of software engineering, cybersecurity and applied AI.

I didn’t start with a grand theory about hardware architecture. I just wanted to run a model locally.

Instead, I ended up debugging CPU vs GPU usage, WSL memory limits, VRAM bottlenecks, and Apple’s unified memory—and discovered something most local LLM guides don’t explain properly.

This is that story.


The Setup

Here’s what I was running:

  • Windows machine

  • 32GB RAM

  • RTX 2060 (6GB VRAM)

  • Using Ollama

  • Model: qwen3.6 (~27GB)

I expected GPU usage to be high.

Instead, I saw this:

ollama ps
→ 84% CPU / 16% GPU

That didn’t make sense.

I have a GPU. Why is the CPU doing most of the work?


The Surprise

Then I saw the same model running on a Mac with Apple M3 Pro:

→ 100% GPU

Same model. Same tool. Completely different behavior.

So the obvious question:

Is Mac just better? Or am I doing something wrong?


The Debugging Journey

1. VRAM Bottleneck (The Real Culprit)

My GPU:

  • 6GB VRAM

Model:

  • ~27GB

That means:

  • The model cannot fit into GPU memory

So what happens?

  • Some layers → GPU

  • Remaining layers → CPU

This is called partial offloading.

Result:

  • GPU is underutilized

  • CPU does most of the heavy lifting


2. The WSL Memory Trap

Then I tried running inside Windows Subsystem for Linux.

And hit this:

model requires more system memory (20.6 GiB) than is available (18.7 GiB)

Wait… I have 32GB RAM.

Turns out:

  • WSL doesn’t use full system memory by default

  • It silently caps available RAM

Fix:

[wsl2]
memory=28GB

Lesson:

Even your RAM isn’t fully usable unless you configure it.


3. Windows vs Linux Overhead

Even after fixing memory:

  • Windows → higher CPU usage

  • Linux/WSL → slightly better GPU usage

Why?

  • Driver overhead

  • Memory management differences

But this still didn’t explain the Mac behavior.


The Breakthrough: Unified Memory

This is where everything clicked.

Apple Silicon (like Apple M3 Pro) uses:

👉 Unified Memory Architecture (UMA)

Meaning:

  • CPU and GPU share the same memory pool

  • No separate VRAM

  • No data copying


Compare that to my PC:

Component Memory
CPU 32GB RAM
GPU 6GB VRAM

They are separate pools.


What this means in practice

On Mac:

  • Entire model sits in unified memory

  • GPU can access all of it

  • No splitting

Result:

100% GPU

On my PC:

  • Only 6GB fits in GPU

  • Rest spills to CPU

  • Constant data movement

Result:

84% CPU / 16% GPU

The Truth About “100% GPU”

This is the most misunderstood part.

“100% GPU” on Mac does NOT mean it’s faster.

It means:

The GPU has access to the entire model

Not: That it’s doing all computation faster than a high-end NVIDIA GPU


Important distinction

Metric What it means
GPU % Where computation happens
Tokens/sec Actual performance

Mac:

  • High GPU usage

  • Moderate throughput

High-end NVIDIA:

  • Lower % (sometimes)

  • Higher throughput


Real-World Behavior

Here’s what actually matters:

Scenario Mac (Unified Memory) PC (6GB GPU)
Large model (27GB) Smooth CPU-heavy
Small model (7B) Smooth Smooth
GPU utilization High Low
Setup friction Low High

Practical Lessons

1. VRAM > GPU power (for local LLMs)

A fast GPU with low VRAM is still a bottleneck.


2. Don’t run oversized models

Just because you can load a model doesn’t mean you should.


3. WSL needs tuning

Without .wslconfig, you’re not using your full system.


4. Memory architecture matters more than OS

This is not:

  • Mac vs Windows

This is:

  • Unified memory vs split memory

Bigger Takeaway

The real insight from all this:

This isn’t a software problem. It’s a hardware architecture problem.


Final Thought

When running local LLMs, we tend to focus on:

  • Model size

  • GPU specs

  • Frameworks

But the actual bottleneck is often:

How memory is structured and accessed


Where This Leads (for me)

This debugging journey pushed me to rethink how I’m building my local coding assistant (bonfire):

Instead of:

  • Forcing large models

I’m moving toward:

  • Smaller, efficient models

  • Better memory utilization

  • Hybrid architectures (local + cloud)

  • Newer inference engines like llama.cpp


TL;DR

  • Mac shows 100% GPU because of unified memory

  • Windows struggles due to VRAM limits

  • WSL can hide available RAM

  • “GPU usage” ≠ “performance”

  • Memory architecture is the real bottleneck


If you’re building local AI tools, this is something you’ll hit sooner or later.

Better to understand it early.

Stay tuned to my blog & github for my latest open source releases (bonfire).