Gemma 4 best settings: recommended Ollama and LM Studio configs by hardware
If you searched for the best Gemma 4 settings or recommended Gemma 4 config, start here: use gemma4:e4b on smaller machines, gemma4:26b on 16 GB-class hardware or stronger Macs, and move to gemma4:31b only when you already have a real 24 GB path. The goal is not fancy flags. The goal is a stable local setup with the right model, a realistic context window, and runtime defaults that do not fight your hardware.
Quick positioning: use this page after you have already checked hardware fit on Gemma 4 VRAM requirements and basic runtime install steps on Gemma 4 Ollama setup. This guide focuses on practical configuration choices, not first-time installation.
Jump to
Quick answer: the best Gemma 4 settings for most people
If you want the shortest practical answer, start with Gemma 4 E4B on 8 to 12 GB GPUs and smaller Macs, start with Gemma 4 26B on 16 GB GPUs or 32 GB Macs, and move to Gemma 4 31B only when you already have 24 GB GPU class hardware or a 64 GB+ Mac and your workload clearly benefits from the extra quality.
| Hardware class | Best starting point | Why |
|---|---|---|
| 8 GB GPU / smaller local machine | E4B | Fast, stable, and much more usable than forcing a larger model that swaps. |
| 16 GB GPU / 32 GB Mac | 26B | The most important quality jump for coding, agents, and document-heavy local work. |
| 24 GB GPU / 64 GB+ Mac | 31B or high-quality 26B | Better fit for longer context and quality-first workflows, but not required for most users. |
Best settings by hardware
Use these presets as recommended starting points. They are intentionally conservative because a stable local setup is more useful than a benchmark-chasing setup that constantly swaps or reloads.
| Hardware | Recommended model | Recommended context | Best-practice settings |
|---|---|---|---|
| 8 GB GPU or smaller Mac | gemma4:e4b |
4K | Start with default Ollama settings, keep context modest, and avoid parallel requests. |
| RTX 3060 12 GB / RTX 4070 12 GB | gemma4:e4b |
4K to 8K | Use E4B as the daily model. Only treat 26B as a test, not the default. |
| 16 GB GPU or Mac 32 GB | gemma4:26b |
8K | Best upgrade tier for coding, agents, and document-heavy work. |
| 24 GB GPU or Mac 64 GB+ | gemma4:31b |
8K to 16K | Use when you actually want the larger quality tier, not just because the hardware can load it. |
Mac 32 GB
ollama pull gemma4:26b
export OLLAMA_NUM_GPU=99
export OLLAMA_KEEP_ALIVE=0
export OLLAMA_NUM_PARALLEL=1
ollama run gemma4:26b
Recommended settings: 8K context, single parallel request, and as few memory-heavy background apps as possible.
RTX 3060 12 GB / RTX 4070 12 GB
ollama pull gemma4:e4b
ollama run gemma4:e4b
Recommended settings: keep context around 4K for everyday use. If your real need is quality over speed, move up only after confirming the machine is still comfortable.
RTX 4080 16 GB
ollama pull gemma4:26b
ollama run gemma4:26b
Recommended settings: 8K context is the clean default. This is the first mainstream GPU tier where 26B starts feeling normal instead of forced.
RTX 4090 24 GB / Mac 64 GB+
ollama pull gemma4:31b
ollama run gemma4:31b
Recommended settings: start at 8K context, then move higher only if your workload really benefits. A larger model with a realistic context window is usually better than an oversized context that hurts latency.
Recommended Ollama settings that actually matter
Ollama's defaults are conservative. These are the variables worth setting, and what they do:
| Variable | Value | Effect |
|---|---|---|
OLLAMA_NUM_GPU |
99 |
Offload all possible layers to GPU/Metal. Without this on Mac, layers may run on CPU and generation slows significantly. |
OLLAMA_KEEP_ALIVE |
0 |
Keep model loaded forever. Default is 5 minutes — after that, Ollama unloads the model and reloads it on next request (15–30s for 26B). Set to 0 if you're using it as a dev server. |
OLLAMA_NUM_PARALLEL |
1 |
Number of parallel requests. Default is 1. Increase only if you're running multiple clients simultaneously — single-user setups are fastest at 1. |
OLLAMA_HOST |
0.0.0.0:11434 |
Expose Ollama to other machines on your network. Default binds to localhost only. |
On Mac, set these via launchctl so they persist across restarts:
launchctl setenv OLLAMA_NUM_GPU 99
launchctl setenv OLLAMA_KEEP_ALIVE 0
# Then restart Ollama from the menu bar
On Linux/Windows, add them to your shell profile or set them before running ollama serve.
Recommended context size
Ollama's default context window is 2048 tokens. That's too small for most real workflows. A coding session with a few files in context, a long document, or a multi-turn conversation will hit this limit quickly.
Create a Modelfile to set a larger default:
# Create Modelfile
cat << 'EOF' > Modelfile
FROM gemma4:26b
PARAMETER num_ctx 8192
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.0
EOF
# Build your custom model
ollama create gemma4-26b-8k -f Modelfile
# Run it
ollama run gemma4-26b-8k
Recommended context sizes by hardware:
| Hardware | Model | Safe context | Maximum before swapping |
|---|---|---|---|
| 8 GB GPU | E4B | 4K | 8K |
| 12 GB GPU | 26B Q5 | 4K–8K | 16K (tight) |
| 24 GB GPU | 26B Q8 | 16K | 32K |
| Mac 32 GB | 26B Q5 | 8K–16K | 32K (watch memory pressure) |
| Mac 64 GB+ | 31B Q8 | 32K | 128K (model max for larger variants) |
Warning on KV cache: at very long contexts (128K+), the KV cache alone can add 10–20 GB of memory on top of the model weights. Community testing shows 31B at 262K context needs ~22 GB just for KV cache. If you're pushing long context, use --cache-type-k q4_0 in llama.cpp to compress KV cache memory.
LM Studio and GGUF settings
LM Studio is a good choice if you want a GUI and don't want to touch a terminal. A few settings worth knowing:
| Setting | Where | Recommended value | Why |
|---|---|---|---|
| GPU Offload | Model load dialog | Max / Auto | Same as OLLAMA_NUM_GPU=99 — offloads as many layers as possible to GPU |
| Context Length | Model load dialog | 4096–8192 | Default 2048 is too small; increase based on your VRAM headroom |
| Flash Attention | Advanced settings | On | Reduces memory and improves speed for long contexts — enable it if your model version supports it |
| Keep Model in Memory | Server settings | On | Prevents the 15–30s reload delay between requests |
| Repeat Penalty | Inference params | 1.0 (disabled) | Google's official recommendation for Gemma 4 is to leave repeat penalty disabled |
For model selection in LM Studio, search for Gemma 4 GGUF builds and choose Q4_K_M when your first goal is fitting the model cleanly, or Q5 when the machine still has real headroom. Keep that distinction in LM Studio or llama.cpp. In Ollama, the safest habit is still choosing the right model tier first.
Recommended settings by use case
Benchmarks tell you what a model can do in theory. This section tells you what it feels like in daily use across four common tasks.
Coding assistance
Best setup: 26B or 31B, 8K+ context, thinking mode off for quick edits and on for harder problems.
For routine coding — autocomplete, explaining a function, simple refactors — E4B is genuinely good enough and much faster. The step up to 26B A4B becomes noticeable when you're debugging something non-obvious, reasoning about architectural tradeoffs, or working across multiple files. The 31B's advantage over 26B A4B is smaller than you'd expect for coding specifically; most people can't tell the difference on typical tasks.
One thing that's different from Gemma 3: function calling is reliable now. If you're using Gemma 4 with a coding agent that makes tool calls (read file, write file, run tests), it actually follows the schema consistently without requiring special prompt engineering.
Local agents and automation
Best setup: 26B, thinking mode on, context 8K+, repeat_penalty disabled.
This is where Gemma 4 shows the biggest improvement over its predecessor. Multi-step agentic tasks — plan a task, call tools, check results, adjust — were unreliable with Gemma 3. With Gemma 4 26B A4B, they work. The native function calling means the model returns valid JSON tool calls without needing to be coaxed with elaborate system prompts.
Expect the 26B tier to feel meaningfully slower than E4B on constrained hardware, but much more capable for multi-step work. On stronger Macs and 16 GB+ GPUs, the tradeoff becomes much easier to justify.
Chat and document Q&A
Best setup: E4B for fast back-and-forth, 26B for document-heavy work where you need more nuance.
For general chat, E4B is genuinely excellent and the speed makes it feel more like a conversation. You lose some nuance on complex questions but for everyday use you won't miss the 26B. For document Q&A — especially longer documents where you need the model to synthesize information from different parts — the 26B's 256K context and better recall at long range matters. Set context to at least 8K for document work; the default 2K will truncate everything interesting.
Local RAG setup
Best setup: 26B with 8K context, temperature 0.3, thinking mode off.
A minimal working RAG stack with Gemma 4 via Ollama:
# Install dependencies
pip install ollama chromadb
# Python snippet — embed documents and query
import ollama
import chromadb
# Initialize ChromaDB
client = chromadb.Client()
collection = client.create_collection("docs")
# Add documents (simplified)
collection.add(
documents=["your document text here"],
ids=["doc1"]
)
# Query
results = collection.query(query_texts=["your question"], n_results=3)
context = "\n".join(results["documents"][0])
# Ask Gemma 4
response = ollama.chat(
model="gemma4:26b",
messages=[{
"role": "system",
"content": "Answer using only the provided context. Say 'not found' if unsupported."
}, {
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: your question here"
}]
)
print(response["message"]["content"])
Key RAG-specific settings: lower temperature (0.2–0.4) reduces hallucination. Explicit grounding instructions in the system prompt ("answer only from context, say not found if unsupported") matter more than model size for RAG accuracy. E4B works well for RAG if your chunks are short; 26B helps when chunks are longer and need more synthesis.
Thinking mode: when to turn it on
Gemma 4 has a built-in chain-of-thought reasoning mode. It generates up to 4,000 tokens of internal reasoning before giving a final answer. This significantly improves performance on hard problems but slows down simple ones.
To enable thinking in Ollama, add a system prompt that starts with <|think|>:
ollama run gemma4:26b
# In the prompt, use a system message:
# System: <|think|> You are a helpful assistant...
# Or via API:
curl http://localhost:11434/api/chat -d '{
"model": "gemma4:26b",
"messages": [
{"role": "system", "content": "<|think|> Think step by step."},
{"role": "user", "content": "Your hard question here"}
]
}'
| Task | Thinking mode | Why |
|---|---|---|
| Math / logic problems | On | Dramatic quality improvement; AIME scores jump significantly |
| Complex debugging | On | Reasoning through multi-step failures benefits from chain of thought |
| Code generation (routine) | Off | Adds latency without meaningful quality gain on simple tasks |
| Chat / conversation | Off | Slows responses; feels unnatural for back-and-forth dialogue |
| RAG / document Q&A | Off | Grounding is more important than reasoning; thinking doesn't help much |
| Agent planning | On | Multi-step task planning benefits significantly from extended reasoning |
FAQ
What are the best Gemma 4 settings for most people?
The best general starting point is E4B on smaller machines and 26B on 16 GB GPUs or 32 GB Macs. Those two tiers cover most practical local use without constant memory problems.
Should I use Ollama or LM Studio for Gemma 4?
Use Ollama if you want the fastest path to a stable local setup and simple API access. Use LM Studio if you prefer a GUI and want easier model management. For this site, Ollama remains the default recommendation because it keeps setup and troubleshooting simpler.
How much context should I give Gemma 4?
Start at 4K for chat and coding, then move to 8K-16K only when your workflow actually benefits from longer prompts or documents. Oversized context is one of the fastest ways to make a local setup feel worse.
When should I move from E4B to 26B or 31B?
Move up when your hardware already has headroom and your tasks justify it. If you mainly want faster responses, E4B is often enough. If you want better coding, agent, or document performance, 26B is the meaningful upgrade. Treat 31B as a quality-first option for higher-end machines.