Gemma 4 Offline Knowledge Base Agent: 4090 + 31B + Ollama + LlamaIndex

Is this for you?

This case study is best for three types of readers: those who can already run Gemma 4 locally, those preparing to build a private knowledge base Q&A system, and users worried about putting personal data on cloud APIs. If you just want to get the model running first, read the Gemma 4 Ollama Setup. If you aren't sure if your machine is capable, check the VRAM requirements page.

Your Situation	Recommendation	Reason
24GB GPU, ready for offline RAG + Agent	Follow the main path of this article	The RTX 4090 / 3090 tier is closest to the original case, making 31B Q4_K_M feasible.
12-16GB GPU, want to build a similar project	Downgrade to Gemma 4 26B or E4B	The 31B setup will squeeze VRAM, leading to high replication costs and poor stability.
CPU only or thin laptop	Do not copy this; take a lightweight route	A knowledge base agent relies not only on model size but also on embedding, indexing, and long contexts, which will drag down performance.

Stack Breakdown: Why This Combination Works

The core of this public case is not the generic conclusion that "Gemma 4 is strong," but that it assembles a complete offline knowledge base pipeline: inference model, local embedding, vector database, RAG framework, and optional front-end, each playing a different role.

"I'm testing an idea: building a fully offline personal knowledge base Agent with Gemma 4. All data is local, all inference is local, no API fees, no privacy issues."

Component	Author's Choice	Our Judgment
Inference Model	Gemma 4 31B, Q4_K_M quantization	Good for 24GB GPU users; unfriendly to 12-16GB machines. Consider downgrading first when reproducing.
Runtime	Ollama	This is the most reproducible part. It's cross-platform and the easiest to copy within a team.
RAG Framework	LlamaIndex	Mature enough for document Q&A and tool orchestration; great for a v1.
Embedding	nomic-embed-text	More stable for mixed English/Chinese knowledge bases, especially as the author noted improved Chinese retrieval.
Vector DB	Chroma	Simple local persistence; suitable as a starting point for a personal knowledge base.
Frontend	AnythingLLM later changed to Streamlit	Shows that "getting it running" and "long-term iteration" are different. The author eventually chose a more flexible frontend.

Truly Transferable Experience: The most valuable part is not "I used a 4090," but that the system is broken down into replaceable parts. Even if you don't use 31B, you can keep the Ollama + LlamaIndex + Chroma + Incremental Update skeleton and swap the model for one that fits your machine.

Real Hardware and Data Scale of the Original Case

To avoid the "looks runnable, but the environment is too different" trap, it's useful to list the author's public hardware specs:

Item	Public Specs	What It Means
GPU	RTX 4090 24GB	Leaves headroom for 31B Q4_K_M and running embeddings alongside.
CPU	AMD 7950X	Reduces CPU bottlenecks when rebuilding indexes and parsing documents.
Memory	64GB DDR5	More stable for parsing large documents and batch indexing without running out of RAM.
Storage	2TB NVMe	The knowledge base has about 1800 PDFs, Markdown, and Notion exports; it's not a toy dataset.

Shortest Reproduction Steps: Build the Skeleton First

If you plan to implement this on your machine, don't aim for "exactly the same" right away. A more successful sequence is: run the model -> run the embedding -> build the index -> add the Agent and write-back tools.

Install Ollama, and pull a Gemma 4 version your machine can handle.
Pull nomic-embed-text and confirm the local embedding service works.
Build the first version of the index with a small batch of documents, rather than dumping everything in at once.
Get the Q&A pipeline working, then attach write-back tools like save_to_note.
Finally, tune long context, incremental updates, hybrid search, and the frontend UI.

ollama pull gemma4:31b
ollama pull nomic-embed-text

python -m venv .venv
source .venv/bin/activate
pip install llama-index llama-index-llms-ollama llama-index-embeddings-ollama chromadb streamlit

The core LlamaIndex code structure from the original post is below. Its value isn't that you can "copy-paste to production," but that it shows the case is built around standard RAG pipelines.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

Settings.llm = Ollama(model="gemma4:31b", request_timeout=300.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.chunk_size = 1024
Settings.chunk_overlap = 200

documents = SimpleDirectoryReader("./my_knowledge_base").load_data()
index = VectorStoreIndex.from_documents(documents, show_progress=True)
index.storage_context.persist(persist_dir="./storage")

Safer Reproduction: If you don't have a 24GB GPU, don't treat gemma4:31b as the default. Swap it for gemma4:26b or gemma4:e4b, verify the pipeline is correct, then decide whether to upgrade the model tier.

Pitfalls and Fixes

The most dense part of the original post is the troubleshooting log. This table extracts the problems, symptoms, and fixes.

Problem	Typical Symptom	Author's Fix
OOM / Context too large	OOM when a long document enters, unstable model startup	Converge context to `--ctx-size 8192` and lower `chunk_size` from 2048 to 1024.
Poor Chinese retrieval hit rate	Chinese notes not found, RAG has "bad memory"	Switch to `nomic-embed-text`, add Chinese stopwords and Hybrid Search.
Agent hallucinations and infinite loops	Model invents records or repeats tool calls	Strengthen system prompt, limit `max_iterations=8`, add self-reflection.
Poor PDF parsing	Bad extraction quality from scanned or table PDFs	Pre-process parsing before ingesting, rather than throwing dirty documents to the indexer.
Slow generation speed	Single-digit tokens/s, painful daily use	Switch to Q4_K_M and offload more layers to the GPU.
Cumbersome KB updates	Rebuilding the whole DB for every new file	Switch to incremental indexing and scheduled updates, instead of full rebuilds.

Information Reliability

To make this page more convincing for search users (and AdSense reviewers), we must separate "public reproducible facts" from "author's local environment results."

Information Type	Reliability	How to Use It
The Ollama, LlamaIndex, Chroma, nomic-embed-text combo	High	Can be used directly as your v1 architecture skeleton.
Author's public code snippets and commands	Medium-High	Use to understand the pipeline, but replace model tags and params based on your hardware.
Specific numbers like "hit rate increased from 60% to 92%"	Medium	Treat as a directional indicator, not a guarantee for all knowledge bases.
Speed conclusions like "stable 28-35 t/s"	Medium	Highly dependent on hardware, quantization, and context length. Reference only for similar machines.

FAQ

Does this case mean everyone should use Gemma 4 31B?

No. It proves that a 31B-tier Gemma 4 can serve as an offline knowledge base agent on a 24GB GPU, not that 31B is the only correct answer. For most users, it's more reasonable to run the pipeline with a smaller Gemma 4 version first.

If I only have a 12-16GB GPU, is this still worth following?

It's worth referencing the architecture, but not copying the model size. You can keep the Ollama, LlamaIndex, Chroma, and incremental update ideas, but downgrade the inference model to Gemma 4 26B or E4B.

What is the most underestimated problem with offline knowledge base agents?

Not the model itself, but document cleaning and indexing strategies. Scanned PDFs, unreasonable chunking, and wrong embeddings will make the system look like the "model is weak," when in fact the data input is broken.

Gemma 4 Offline Knowledge Base Agent: What to Copy and What to Change

Is this for you?

Stack Breakdown: Why This Combination Works

Real Hardware and Data Scale of the Original Case

Shortest Reproduction Steps: Build the Skeleton First

Pitfalls and Fixes

Information Reliability

FAQ

Related Guides

Gemma 4 Offline Knowledge Base Agent: What to Copy and What to Change

Is this for you?

Stack Breakdown: Why This Combination Works

Real Hardware and Data Scale of the Original Case

Shortest Reproduction Steps: Build the Skeleton First

Pitfalls and Fixes

Information Reliability

FAQ

Related Guides

Gemma 4 VRAM Requirements

Gemma 4 Ollama Setup