Gemma 4 Offline Knowledge Base Agent: What to Copy and What to Change
Bottom line: If you have a 24GB GPU and want to build a fully offline personal knowledge base agent, the Gemma 4 31B + Ollama + LlamaIndex + nomic-embed-text + Chroma stack works. However, it is not a one-size-fits-all template. Its true value lies in revealing the hardware baseline, RAG components, incremental updates, and common pitfalls so you can avoid the first wave of mistakes.
Editorial Note: Gemma4Guide structures, annotates risks, and judges the reusability of public case studies to help readers decide if it is worth replicating, rather than just acting as a substitute for the original code repository.
Source and Editorial Boundaries: This article is based on a practical sharing post on X (Twitter). We translated the raw experience into actionable checklists, prerequisites, and risk boundaries. The original case is from Powerpei's public post. The hardware usage and hit-rate improvements are the author's self-reported experiences. We have clearly marked what is reusable and what is a local environment conclusion.
Quick links
Is this for you?
This case study is best for three types of readers: those who can already run Gemma 4 locally, those preparing to build a private knowledge base Q&A system, and users worried about putting personal data on cloud APIs. If you just want to get the model running first, read the Gemma 4 Ollama Setup. If you aren't sure if your machine is capable, check the VRAM requirements page.
| Your Situation | Recommendation | Reason |
|---|---|---|
| 24GB GPU, ready for offline RAG + Agent | Follow the main path of this article | The RTX 4090 / 3090 tier is closest to the original case, making 31B Q4_K_M feasible. |
| 12-16GB GPU, want to build a similar project | Downgrade to Gemma 4 26B or E4B | The 31B setup will squeeze VRAM, leading to high replication costs and poor stability. |
| CPU only or thin laptop | Do not copy this; take a lightweight route | A knowledge base agent relies not only on model size but also on embedding, indexing, and long contexts, which will drag down performance. |
Stack Breakdown: Why This Combination Works
The core of this public case is not the generic conclusion that "Gemma 4 is strong," but that it assembles a complete offline knowledge base pipeline: inference model, local embedding, vector database, RAG framework, and optional front-end, each playing a different role.
"I'm testing an idea: building a fully offline personal knowledge base Agent with Gemma 4. All data is local, all inference is local, no API fees, no privacy issues."
| Component | Author's Choice | Our Judgment |
|---|---|---|
| Inference Model | Gemma 4 31B, Q4_K_M quantization | Good for 24GB GPU users; unfriendly to 12-16GB machines. Consider downgrading first when reproducing. |
| Runtime | Ollama | This is the most reproducible part. It's cross-platform and the easiest to copy within a team. |
| RAG Framework | LlamaIndex | Mature enough for document Q&A and tool orchestration; great for a v1. |
| Embedding | nomic-embed-text | More stable for mixed English/Chinese knowledge bases, especially as the author noted improved Chinese retrieval. |
| Vector DB | Chroma | Simple local persistence; suitable as a starting point for a personal knowledge base. |
| Frontend | AnythingLLM later changed to Streamlit | Shows that "getting it running" and "long-term iteration" are different. The author eventually chose a more flexible frontend. |
Truly Transferable Experience: The most valuable part is not "I used a 4090," but that the system is broken down into replaceable parts. Even if you don't use 31B, you can keep the Ollama + LlamaIndex + Chroma + Incremental Update skeleton and swap the model for one that fits your machine.
Real Hardware and Data Scale of the Original Case
To avoid the "looks runnable, but the environment is too different" trap, it's useful to list the author's public hardware specs:
| Item | Public Specs | What It Means |
|---|---|---|
| GPU | RTX 4090 24GB | Leaves headroom for 31B Q4_K_M and running embeddings alongside. |
| CPU | AMD 7950X | Reduces CPU bottlenecks when rebuilding indexes and parsing documents. |
| Memory | 64GB DDR5 | More stable for parsing large documents and batch indexing without running out of RAM. |
| Storage | 2TB NVMe | The knowledge base has about 1800 PDFs, Markdown, and Notion exports; it's not a toy dataset. |
Shortest Reproduction Steps: Build the Skeleton First
If you plan to implement this on your machine, don't aim for "exactly the same" right away. A more successful sequence is: run the model -> run the embedding -> build the index -> add the Agent and write-back tools.
- Install Ollama, and pull a Gemma 4 version your machine can handle.
- Pull
nomic-embed-textand confirm the local embedding service works. - Build the first version of the index with a small batch of documents, rather than dumping everything in at once.
- Get the Q&A pipeline working, then attach write-back tools like
save_to_note. - Finally, tune long context, incremental updates, hybrid search, and the frontend UI.
ollama pull gemma4:31b
ollama pull nomic-embed-text
python -m venv .venv
source .venv/bin/activate
pip install llama-index llama-index-llms-ollama llama-index-embeddings-ollama chromadb streamlit
The core LlamaIndex code structure from the original post is below. Its value isn't that you can "copy-paste to production," but that it shows the case is built around standard RAG pipelines.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
Settings.llm = Ollama(model="gemma4:31b", request_timeout=300.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.chunk_size = 1024
Settings.chunk_overlap = 200
documents = SimpleDirectoryReader("./my_knowledge_base").load_data()
index = VectorStoreIndex.from_documents(documents, show_progress=True)
index.storage_context.persist(persist_dir="./storage")
Safer Reproduction: If you don't have a 24GB GPU, don't treat gemma4:31b as the default. Swap it for gemma4:26b or gemma4:e4b, verify the pipeline is correct, then decide whether to upgrade the model tier.
Pitfalls and Fixes
The most dense part of the original post is the troubleshooting log. This table extracts the problems, symptoms, and fixes.
| Problem | Typical Symptom | Author's Fix |
|---|---|---|
| OOM / Context too large | OOM when a long document enters, unstable model startup | Converge context to --ctx-size 8192 and lower chunk_size from 2048 to 1024. |
| Poor Chinese retrieval hit rate | Chinese notes not found, RAG has "bad memory" | Switch to nomic-embed-text, add Chinese stopwords and Hybrid Search. |
| Agent hallucinations and infinite loops | Model invents records or repeats tool calls | Strengthen system prompt, limit max_iterations=8, add self-reflection. |
| Poor PDF parsing | Bad extraction quality from scanned or table PDFs | Pre-process parsing before ingesting, rather than throwing dirty documents to the indexer. |
| Slow generation speed | Single-digit tokens/s, painful daily use | Switch to Q4_K_M and offload more layers to the GPU. |
| Cumbersome KB updates | Rebuilding the whole DB for every new file | Switch to incremental indexing and scheduled updates, instead of full rebuilds. |
Information Reliability
To make this page more convincing for search users (and AdSense reviewers), we must separate "public reproducible facts" from "author's local environment results."
| Information Type | Reliability | How to Use It |
|---|---|---|
| The Ollama, LlamaIndex, Chroma, nomic-embed-text combo | High | Can be used directly as your v1 architecture skeleton. |
| Author's public code snippets and commands | Medium-High | Use to understand the pipeline, but replace model tags and params based on your hardware. |
| Specific numbers like "hit rate increased from 60% to 92%" | Medium | Treat as a directional indicator, not a guarantee for all knowledge bases. |
| Speed conclusions like "stable 28-35 t/s" | Medium | Highly dependent on hardware, quantization, and context length. Reference only for similar machines. |
FAQ
Does this case mean everyone should use Gemma 4 31B?
No. It proves that a 31B-tier Gemma 4 can serve as an offline knowledge base agent on a 24GB GPU, not that 31B is the only correct answer. For most users, it's more reasonable to run the pipeline with a smaller Gemma 4 version first.
If I only have a 12-16GB GPU, is this still worth following?
It's worth referencing the architecture, but not copying the model size. You can keep the Ollama, LlamaIndex, Chroma, and incremental update ideas, but downgrade the inference model to Gemma 4 26B or E4B.
What is the most underestimated problem with offline knowledge base agents?
Not the model itself, but document cleaning and indexing strategies. Scanned PDFs, unreasonable chunking, and wrong embeddings will make the system look like the "model is weak," when in fact the data input is broken.