Troubleshooting Updated 5 min read

Fixing Gemma 4 OOM (Out of Memory) Errors on 24GB GPUs

Quick answer: If you are getting a CUDA Out of Memory error when loading Gemma 4 27B or 31B on an RTX 3090/4090, the issue is rarely the model weights themselves. It is almost always caused by unconstrained KV Cache or overly ambitious context length settings in Ollama or vLLM.

Context: This guide aggregates real-world fixes from developers on X (Twitter) and Reddit who successfully squeezed Gemma 4 31B into 24GB of VRAM by tuning inference parameters.

The Symptoms

Many users with 24GB GPUs report the following crash sequence:

  1. The model loads successfully into VRAM (taking about 18-20GB for Q4_K_M).
  2. The first short prompt works fine.
  3. A long document is pasted, or the chat history grows.
  4. The inference engine crashes with a CUDA error: out of memory or Ollama abruptly stops responding.

Root Cause Analysis: The Hidden Cost of KV Cache

While the model weights for a quantized 31B model fit within ~19GB, the attention mechanism requires memory to store the context (KV Cache). In Gemma 4, long contexts scale memory usage linearly. If you leave the context window at the default 8k or try to push it to 32k without memory limits, the KV cache will instantly consume the remaining 4-5GB of your 24GB VRAM, causing a hard crash.

Fix 1: Constraining Context in Ollama

If you are using Ollama, you need to explicitly limit the context size (num_ctx) when running the model or within your API calls.

# Run via CLI with limited context
ollama run gemma4:31b --num_ctx 4096

If you are using a custom Modelfile, add this parameter:

FROM gemma4:31b
PARAMETER num_ctx 4096

Fix 2: Managing GPU Memory Utilization in vLLM

If you are deploying via vLLM for higher throughput, vLLM's default behavior is to reserve 90% of the GPU memory upfront. For Gemma 4 31B on a 24GB card, this aggressive reservation often conflicts with the OS overhead.

You must explicitly lower the gpu-memory-utilization flag:

python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-4-27b-it \
  --quantization awq \
  --gpu-memory-utilization 0.85 \
  --max-model-len 4096

Related Guides