Fixing Gemma 4 OOM (Out of Memory) Errors: A Troubleshooting Guide

The Symptoms

Many users with 24GB GPUs report the following crash sequence:

The model loads successfully into VRAM (taking about 18-20GB for Q4_K_M).
The first short prompt works fine.
A long document is pasted, or the chat history grows.
The inference engine crashes with a CUDA error: out of memory or Ollama abruptly stops responding.

Root Cause Analysis: The Hidden Cost of KV Cache

While the model weights for a quantized 31B model fit within ~19GB, the attention mechanism requires memory to store the context (KV Cache). In Gemma 4, long contexts scale memory usage linearly. If you leave the context window at the default 8k or try to push it to 32k without memory limits, the KV cache will instantly consume the remaining 4-5GB of your 24GB VRAM, causing a hard crash.

Fix 1: Constraining Context in Ollama

If you are using Ollama, you need to explicitly limit the context size (num_ctx) when running the model or within your API calls.

# Run via CLI with limited context
ollama run gemma4:31b --num_ctx 4096

If you are using a custom Modelfile, add this parameter:

FROM gemma4:31b
PARAMETER num_ctx 4096

Fix 2: Managing GPU Memory Utilization in vLLM

If you are deploying via vLLM for higher throughput, vLLM's default behavior is to reserve 90% of the GPU memory upfront. For Gemma 4 31B on a 24GB card, this aggressive reservation often conflicts with the OS overhead.

You must explicitly lower the gpu-memory-utilization flag:

python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-4-27b-it \
  --quantization awq \
  --gpu-memory-utilization 0.85 \
  --max-model-len 4096

Fixing Gemma 4 OOM (Out of Memory) Errors on 24GB GPUs

The Symptoms

Root Cause Analysis: The Hidden Cost of KV Cache

Fix 1: Constraining Context in Ollama

Fix 2: Managing GPU Memory Utilization in vLLM

Related Guides

Fixing Gemma 4 OOM (Out of Memory) Errors on 24GB GPUs

The Symptoms

Root Cause Analysis: The Hidden Cost of KV Cache

Fix 1: Constraining Context in Ollama

Fix 2: Managing GPU Memory Utilization in vLLM

Related Guides

VRAM Requirements

Offline Knowledge Agent