RAG knowledge base
Xiajiao (虾饺) includes a production-style RAG (Retrieval-Augmented Generation) stack. Upload documents, indexing runs automatically, and questions hit rag_query from Tool Calling—your docs become the Agent’s grounding layer. Pair with Agent persistent memory: RAG for document facts; memory for user prefs and history.

An Agent retrieves from the knowledge base and answers in a structured way.
Why RAG?
LLMs are limited by:
- Knowledge cutoff — no brand-new facts after training
- Private data — no access to your internal API docs or manuals
RAG fixes this: retrieve relevant chunks first, then generate grounded answers.
| Without RAG | With RAG |
|---|---|
| “I don’t know your API shape” | “Per your API doc, auth uses Bearer Token…” |
| Higher hallucination risk | Grounded in real docs |
| No citations | Can point to source docs |
Three-stage retrieval pipeline
Layers of filtering keep only relevant context for the LLM:
┌─────────────────────────────────────────────────────┐
│ │
│ User question │
│ ↓ │
│ ┌──── Stage 1: dual retrieval ────┐ │
│ │ │ │
│ │ BM25 Vector search │ │
│ │ (keywords) (semantics) │ │
│ │ ↓ ↓ │ │
│ │ set A set B │ │
│ └──────────┬─────────────────────┘ │
│ ↓ │
│ ┌──── Stage 2: RRF fusion ─────┐ │
│ │ Merge lists, dedupe │ │
│ └──────────┬───────────────────┘ │
│ ↓ │
│ ┌──── Stage 3: LLM rerank ─────┐ │
│ │ Score Top-N, reorder │ │
│ │ Take Top-K for context │ │
│ └──────────┬───────────────────┘ │
│ ↓ │
│ Inject into Agent → answer │
│ │
└─────────────────────────────────────────────────────┘Stage 1: dual retrieval
BM25 (lexical)
Query: "API authentication"
→ Matches chunks containing "API" and "auth"
→ Good for exact terms, symbols, identifiersVector (semantic)
Query: "how do we authorize requests"
→ Embedding similarity to "auth", "token", "Bearer" passages
→ Good for paraphrases and intentWhy both?
BM25 alone misses synonyms; vectors alone can drift on symbols like getUserById. Together you get precision and recall.
Stage 2: RRF fusion
RRF score = Σ 1 / (k + rank_i)k is a constant (often 60); rank_i is the rank in list i.
Chunks strong in both lists rise to the top after fusion.
Stage 3: LLM reranking
Take Top-N fused candidates; have the LLM score relevance:
System: Score each passage 0–10 for relevance to the query.
Query: "API authentication"
Passage 1: "All API calls must send Bearer Token in Header..." → 9
Passage 2: "API returns JSON..." → 3
Passage 3: "Auth module uses JWT, 24h expiry..." → 8Inject the highest-scoring Top-K into the Agent context.
Hierarchical chunking
Quality depends on chunking. Xiajiao (虾饺) uses parent/child chunks:
Two-tier design
┌────────────────────────────────────┐
│ Parent chunk (~800 chars) │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Child │ │ Child │ ... │
│ │ ~200 ch │ │ ~200 ch │ │
│ └──────────┘ └──────────┘ │
│ │
└────────────────────────────────────┘- Child (~200): retrieval granularity—tighter focus, better hits
- Parent (~800): context for the model—full surrounding text
Flow: search children → map to parents → send parents to the LLM.
Why not retrieve on parents only?
Large parents mix topics; one embedding averages everything and hurts precision. Small children retrieve accurately; parents restore context.
Supported formats
| Format | Notes | Parsing |
|---|---|---|
| Common | pdf-parse | |
| TXT | Plain text | Direct read |
| Markdown | Docs | Direct read, structure kept |
| Other text | .log, .csv, etc. | Direct read |
Usage
Upload
In the web UI, open Agent settings and upload into the knowledge area:
- Open Agent settings
- Upload in the KB section
- System parses → chunks → embeds → indexes
Query
The Agent uses rag_query (see Tool Calling):
You: @Code assistant How do we call the payment API?
Code assistant: [rag_query: "payment API usage"]
→ pulls from your uploaded API doc
→ answers from retrieved textCompared to other RAG stacks
| Xiajiao (虾饺) RAG | Dify RAG | LangChain RAG | |
|---|---|---|---|
| Retrieval | BM25 + vector + LLM rerank | Vector + keywords | Configurable |
| Chunking | Hierarchical | Fixed size | Configurable |
| Vector store | SQLite (zero extra services) | Qdrant / Weaviate | FAISS / Pinecone |
| Deploy | npm start | Docker Compose | Bring your own |
| External deps | None for vectors | Vector DB usually required | Often vector DB |
Less flexible than a bespoke stack, but zero external vector DB—everything in one Node process with SQLite.
Tuning tips
1. Document quality
- Clear structure: headings, lists, paragraphs—not walls of text
- High signal, low fluff
- Consistent terminology
2. Chunk sizes
Defaults are 200/800. Very long native paragraphs may need code-level tuning.
3. Embedding model
Default text-embedding-3-small. With providers like Qwen, the matching provider embedding is used. For Chinese-heavy corpora, a Chinese-optimized embedding model helps.
4. Query wording
BM25 is lexical. If “how do I pay” fails, try the doc’s exact term payment API.
5. What works well / poorly
| Works well | Works poorly |
|---|---|
| API docs (structured, crisp terms) | Scanned PDFs (bad OCR) |
| Specs with clear sections | Dense Excel exports |
| Q&A style manuals | Image-only slide decks |
| Markdown notes | Long nested legal prose |
Example trace
User: @Code assistant What order statuses exist?
Internal:
├─ BM25 (FTS5): "order status"
│ hit 1: "Statuses: pending, paid, shipped, completed, cancelled" (8.2)
│ hit 2: "POST /api/orders" (3.1)
│
├─ Vector: embed("what order statuses")
│ hit 1: "enum OrderStatus { PENDING, PAID, ... }" (0.89)
│ hit 2: "Flow: pending→paid→shipped→completed" (0.85)
│
├─ RRF → Top 5
│
└─ LLM rerank → inject top 3 chunks
Assistant:
"Per your docs there are five statuses:
1. pending
2. paid
3. shipped
4. completed
5. cancelled
Flow: pending → paid → shipped → completed; any → cancelled.
Source: API doc v2.3 §4.2"Related docs
- Tool Calling —
rag_querydetails and ACLs - Agent persistent memory — personalization
- Security & privacy
- Collaboration flow
- Recipes — private KB assistant, support bots
- Architecture — RAG implementation
