Gemma 4 Config Guide: Best Settings for Your Hardware & Use Case

Quick answer: the best Gemma 4 settings for most people

If you want the shortest practical answer, start with Gemma 4 E4B on 8 to 12 GB GPUs and smaller Macs, start with Gemma 4 26B on 16 GB GPUs or 32 GB Macs, and move to Gemma 4 31B only when you already have 24 GB GPU class hardware or a 64 GB+ Mac and your workload clearly benefits from the extra quality.

Hardware class	Best starting point	Why
8 GB GPU / smaller local machine	E4B	Fast, stable, and much more usable than forcing a larger model that swaps.
16 GB GPU / 32 GB Mac	26B	The most important quality jump for coding, agents, and document-heavy local work.
24 GB GPU / 64 GB+ Mac	31B or high-quality 26B	Better fit for longer context and quality-first workflows, but not required for most users.

Best settings by hardware

Use these presets as recommended starting points. They are intentionally conservative because a stable local setup is more useful than a benchmark-chasing setup that constantly swaps or reloads.

Hardware	Recommended model	Recommended context	Best-practice settings
8 GB GPU or smaller Mac	`gemma4:e4b`	4K	Start with default Ollama settings, keep context modest, and avoid parallel requests.
RTX 3060 12 GB / RTX 4070 12 GB	`gemma4:e4b`	4K to 8K	Use E4B as the daily model. Only treat 26B as a test, not the default.
16 GB GPU or Mac 32 GB	`gemma4:26b`	8K	Best upgrade tier for coding, agents, and document-heavy work.
24 GB GPU or Mac 64 GB+	`gemma4:31b`	8K to 16K	Use when you actually want the larger quality tier, not just because the hardware can load it.

Mac 32 GB

ollama pull gemma4:26b
export OLLAMA_NUM_GPU=99
export OLLAMA_KEEP_ALIVE=0
export OLLAMA_NUM_PARALLEL=1
ollama run gemma4:26b

Recommended settings: 8K context, single parallel request, and as few memory-heavy background apps as possible.

RTX 3060 12 GB / RTX 4070 12 GB

ollama pull gemma4:e4b
ollama run gemma4:e4b

Recommended settings: keep context around 4K for everyday use. If your real need is quality over speed, move up only after confirming the machine is still comfortable.

RTX 4080 16 GB

ollama pull gemma4:26b
ollama run gemma4:26b

Recommended settings: 8K context is the clean default. This is the first mainstream GPU tier where 26B starts feeling normal instead of forced.

RTX 4090 24 GB / Mac 64 GB+

ollama pull gemma4:31b
ollama run gemma4:31b

Recommended settings: start at 8K context, then move higher only if your workload really benefits. A larger model with a realistic context window is usually better than an oversized context that hurts latency.

Recommended Ollama settings that actually matter

Ollama's defaults are conservative. These are the variables worth setting, and what they do:

Variable	Value	Effect
`OLLAMA_NUM_GPU`	`99`	Offload all possible layers to GPU/Metal. Without this on Mac, layers may run on CPU and generation slows significantly.
`OLLAMA_KEEP_ALIVE`	`0`	Keep model loaded forever. Default is 5 minutes — after that, Ollama unloads the model and reloads it on next request (15–30s for 26B). Set to `0` if you're using it as a dev server.
`OLLAMA_NUM_PARALLEL`	`1`	Number of parallel requests. Default is 1. Increase only if you're running multiple clients simultaneously — single-user setups are fastest at 1.
`OLLAMA_HOST`	`0.0.0.0:11434`	Expose Ollama to other machines on your network. Default binds to localhost only.

On Mac, set these via launchctl so they persist across restarts:

launchctl setenv OLLAMA_NUM_GPU 99
launchctl setenv OLLAMA_KEEP_ALIVE 0
# Then restart Ollama from the menu bar

On Linux/Windows, add them to your shell profile or set them before running ollama serve.

Recommended context size

Ollama's default context window is 2048 tokens. That's too small for most real workflows. A coding session with a few files in context, a long document, or a multi-turn conversation will hit this limit quickly.

Create a Modelfile to set a larger default:

# Create Modelfile
cat << 'EOF' > Modelfile
FROM gemma4:26b
PARAMETER num_ctx 8192
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.0
EOF

# Build your custom model
ollama create gemma4-26b-8k -f Modelfile

# Run it
ollama run gemma4-26b-8k

Recommended context sizes by hardware:

Hardware	Model	Safe context	Maximum before swapping
8 GB GPU	E4B	4K	8K
12 GB GPU	26B Q5	4K–8K	16K (tight)
24 GB GPU	26B Q8	16K	32K
Mac 32 GB	26B Q5	8K–16K	32K (watch memory pressure)
Mac 64 GB+	31B Q8	32K	128K (model max for larger variants)

Warning on KV cache: at very long contexts (128K+), the KV cache alone can add 10–20 GB of memory on top of the model weights. Community testing shows 31B at 262K context needs ~22 GB just for KV cache. If you're pushing long context, use --cache-type-k q4_0 in llama.cpp to compress KV cache memory.

LM Studio and GGUF settings

LM Studio is a good choice if you want a GUI and don't want to touch a terminal. A few settings worth knowing:

Setting	Where	Recommended value	Why
GPU Offload	Model load dialog	Max / Auto	Same as `OLLAMA_NUM_GPU=99` — offloads as many layers as possible to GPU
Context Length	Model load dialog	4096–8192	Default 2048 is too small; increase based on your VRAM headroom
Flash Attention	Advanced settings	On	Reduces memory and improves speed for long contexts — enable it if your model version supports it
Keep Model in Memory	Server settings	On	Prevents the 15–30s reload delay between requests
Repeat Penalty	Inference params	1.0 (disabled)	Google's official recommendation for Gemma 4 is to leave repeat penalty disabled

For model selection in LM Studio, search for Gemma 4 GGUF builds and choose Q4_K_M when your first goal is fitting the model cleanly, or Q5 when the machine still has real headroom. Keep that distinction in LM Studio or llama.cpp. In Ollama, the safest habit is still choosing the right model tier first.

Recommended settings by use case

Benchmarks tell you what a model can do in theory. This section tells you what it feels like in daily use across four common tasks.

Coding assistance

Best setup: 26B or 31B, 8K+ context, thinking mode off for quick edits and on for harder problems.

For routine coding — autocomplete, explaining a function, simple refactors — E4B is genuinely good enough and much faster. The step up to 26B A4B becomes noticeable when you're debugging something non-obvious, reasoning about architectural tradeoffs, or working across multiple files. The 31B's advantage over 26B A4B is smaller than you'd expect for coding specifically; most people can't tell the difference on typical tasks.

One thing that's different from Gemma 3: function calling is reliable now. If you're using Gemma 4 with a coding agent that makes tool calls (read file, write file, run tests), it actually follows the schema consistently without requiring special prompt engineering.

Local agents and automation

Best setup: 26B, thinking mode on, context 8K+, repeat_penalty disabled.

This is where Gemma 4 shows the biggest improvement over its predecessor. Multi-step agentic tasks — plan a task, call tools, check results, adjust — were unreliable with Gemma 3. With Gemma 4 26B A4B, they work. The native function calling means the model returns valid JSON tool calls without needing to be coaxed with elaborate system prompts.

Expect the 26B tier to feel meaningfully slower than E4B on constrained hardware, but much more capable for multi-step work. On stronger Macs and 16 GB+ GPUs, the tradeoff becomes much easier to justify.

Chat and document Q&A

Best setup: E4B for fast back-and-forth, 26B for document-heavy work where you need more nuance.

For general chat, E4B is genuinely excellent and the speed makes it feel more like a conversation. You lose some nuance on complex questions but for everyday use you won't miss the 26B. For document Q&A — especially longer documents where you need the model to synthesize information from different parts — the 26B's 256K context and better recall at long range matters. Set context to at least 8K for document work; the default 2K will truncate everything interesting.

Local RAG setup

Best setup: 26B with 8K context, temperature 0.3, thinking mode off.

A minimal working RAG stack with Gemma 4 via Ollama:

# Install dependencies
pip install ollama chromadb

# Python snippet — embed documents and query
import ollama
import chromadb

# Initialize ChromaDB
client = chromadb.Client()
collection = client.create_collection("docs")

# Add documents (simplified)
collection.add(
    documents=["your document text here"],
    ids=["doc1"]
)

# Query
results = collection.query(query_texts=["your question"], n_results=3)
context = "\n".join(results["documents"][0])

# Ask Gemma 4
response = ollama.chat(
    model="gemma4:26b",
    messages=[{
        "role": "system",
        "content": "Answer using only the provided context. Say 'not found' if unsupported."
    }, {
        "role": "user",
        "content": f"Context:\n{context}\n\nQuestion: your question here"
    }]
)
print(response["message"]["content"])

Key RAG-specific settings: lower temperature (0.2–0.4) reduces hallucination. Explicit grounding instructions in the system prompt ("answer only from context, say not found if unsupported") matter more than model size for RAG accuracy. E4B works well for RAG if your chunks are short; 26B helps when chunks are longer and need more synthesis.

Thinking mode: when to turn it on

Gemma 4 has a built-in chain-of-thought reasoning mode. It generates up to 4,000 tokens of internal reasoning before giving a final answer. This significantly improves performance on hard problems but slows down simple ones.

To enable thinking in Ollama, add a system prompt that starts with <|think|>:

ollama run gemma4:26b
# In the prompt, use a system message:
# System: <|think|> You are a helpful assistant...
# Or via API:
curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b",
  "messages": [
    {"role": "system", "content": "<|think|> Think step by step."},
    {"role": "user", "content": "Your hard question here"}
  ]
}'

Task	Thinking mode	Why
Math / logic problems	On	Dramatic quality improvement; AIME scores jump significantly
Complex debugging	On	Reasoning through multi-step failures benefits from chain of thought
Code generation (routine)	Off	Adds latency without meaningful quality gain on simple tasks
Chat / conversation	Off	Slows responses; feels unnatural for back-and-forth dialogue
RAG / document Q&A	Off	Grounding is more important than reasoning; thinking doesn't help much
Agent planning	On	Multi-step task planning benefits significantly from extended reasoning

FAQ

What are the best Gemma 4 settings for most people?

The best general starting point is E4B on smaller machines and 26B on 16 GB GPUs or 32 GB Macs. Those two tiers cover most practical local use without constant memory problems.

Should I use Ollama or LM Studio for Gemma 4?

Use Ollama if you want the fastest path to a stable local setup and simple API access. Use LM Studio if you prefer a GUI and want easier model management. For this site, Ollama remains the default recommendation because it keeps setup and troubleshooting simpler.

How much context should I give Gemma 4?

Start at 4K for chat and coding, then move to 8K-16K only when your workflow actually benefits from longer prompts or documents. Oversized context is one of the fastest ways to make a local setup feel worse.

When should I move from E4B to 26B or 31B?

Move up when your hardware already has headroom and your tasks justify it. If you mainly want faster responses, E4B is often enough. If you want better coding, agent, or document performance, 26B is the meaningful upgrade. Treat 31B as a quality-first option for higher-end machines.

Gemma 4 best settings: recommended Ollama and LM Studio configs by hardware

Quick answer: the best Gemma 4 settings for most people

Best settings by hardware

Mac 32 GB

RTX 3060 12 GB / RTX 4070 12 GB

RTX 4080 16 GB

RTX 4090 24 GB / Mac 64 GB+

Recommended Ollama settings that actually matter

Recommended context size

LM Studio and GGUF settings

Recommended settings by use case

Coding assistance

Local agents and automation

Chat and document Q&A

Local RAG setup

Thinking mode: when to turn it on

FAQ

What are the best Gemma 4 settings for most people?

Should I use Ollama or LM Studio for Gemma 4?

How much context should I give Gemma 4?

When should I move from E4B to 26B or 31B?

Related guides

Gemma 4 best settings: recommended Ollama and LM Studio configs by hardware

Quick answer: the best Gemma 4 settings for most people

Best settings by hardware

Mac 32 GB

RTX 3060 12 GB / RTX 4070 12 GB

RTX 4080 16 GB

RTX 4090 24 GB / Mac 64 GB+

Recommended Ollama settings that actually matter

Recommended context size

LM Studio and GGUF settings

Recommended settings by use case

Coding assistance

Local agents and automation

Chat and document Q&A

Local RAG setup

Thinking mode: when to turn it on

FAQ

What are the best Gemma 4 settings for most people?

Should I use Ollama or LM Studio for Gemma 4?

How much context should I give Gemma 4?

When should I move from E4B to 26B or 31B?

Related guides

Gemma 4 VRAM requirements

Gemma 4 Ollama setup

Gemma 4 vs Qwen3