Knowledge Base Setup

LuaN1aoAgent uses a Retrieval-Augmented Generation (RAG) system to provide the Executor with domain-specific attack payloads, bypass techniques, and vulnerability exploitation methods during task execution. Without the knowledge base, the agent relies solely on the LLM’s parametric knowledge.

Why the knowledge base matters

During execution, the Executor can call the retrieve_knowledge MCP tool to fetch relevant techniques for a specific attack scenario — for example, retrieving SQL injection payloads for a specific database, or WAF bypass sequences for a detected firewall product. The knowledge service returns semantically ranked results from a FAISS vector index built from your local documents. The agent also uses the distill_knowledge tool to write new attack insights discovered during a task back into the knowledge base, enabling accumulation of custom intelligence over time.

Step 1: Set up PayloadsAllTheThings

The recommended starter knowledge base is PayloadsAllTheThings, which contains a comprehensive set of attack payloads organized by vulnerability type.

mkdir -p knowledge_base
git clone https://github.com/swisskyrepo/PayloadsAllTheThings \
    knowledge_base/PayloadsAllTheThings

The knowledge base directory is at <project_root>/knowledge_base/. You can add any number of subdirectories with .md or .txt files — all will be indexed.

Step 2: Build the vector index

Run the knowledge base preparer to scan documents, chunk them, generate embeddings, and write the FAISS index:

cd rag
python -m rag_kdprepare

This takes a few minutes on first run, depending on your hardware and the size of the knowledge base. Subsequent runs are incremental — only new or modified files are re-vectorized.

The preparer checks file hashes (SHA-256) to detect changes. If a document has not changed since the last run, it is skipped entirely.

What `rag_kdprepare` does

Scan the knowledge_base directory

Walks knowledge_base/ recursively, collecting all .md and .txt files. Each file is assigned a stable doc_id based on its relative path from the project root.

Detect new and modified documents

Compares SHA-256 hashes against rag/faiss_db/faiss_manifest.json. Documents whose hash has changed or that are new are queued for processing. Documents that no longer exist are removed from the index.

Chunk documents

Passes each document through MarkdownChunker, which splits content into chunks respecting Markdown headings. Chunk sizes are controlled by environment variables:

RAG_MIN_CHUNK_SIZE=100    # minimum characters per chunk
RAG_MAX_CHUNK_SIZE=1000   # maximum characters per chunk

Generate embeddings

Encodes chunks using a SentenceTransformer model (loaded from rag/models/all-MiniLM-L6-v2 if available locally). Falls back to an offline hash-based embedder (OfflineHasherEmbedder, 384 dimensions) if the model cannot be loaded.

Write to FAISS index

Adds normalized vectors to a faiss.IndexIDMap2 wrapping a faiss.IndexFlatIP (inner product = cosine similarity). Persists the index and document store:

rag/faiss_db/
├── kb.faiss              # FAISS vector index
├── kb_store.json         # chunk text and metadata
└── faiss_manifest.json   # document hash manifest

Force-rebuilding the index

To force a full rebuild of the entire index:

cd rag
python -m rag_kdprepare --force-all

To force rebuilding only documents matching a specific pattern:

python -m rag_kdprepare --force-doc=SQLInjection

Both flags are also available as environment variables:

RAG_FORCE_ALL=true python -m rag_kdprepare
RAG_FORCE_DOCS=SQLInjection,XSS python -m rag_kdprepare

Step 3: Start the knowledge service

The knowledge service is a FastAPI application that exposes the FAISS index over HTTP:

python -m uvicorn rag.knowledge_service:app --port 8081

The service loads the FAISS index on startup and responds to semantic retrieval queries.

Auto-start behavior

The agent automatically starts the knowledge service if it is not already running when a task begins. The KnowledgeServiceManager in agent.py:

Checks GET /health on http://127.0.0.1:8081
If the service is not healthy, spawns a uvicorn subprocess with start_new_session=True
Polls health every 500ms for up to 5 seconds
Proceeds with the task if the service becomes healthy; logs a warning if it times out

The auto-started knowledge service process is detached from the agent. It continues running after the agent exits. On the next run, the health check will detect it is already running and skip the startup step.

Health check

Verify the knowledge service is running and the index is loaded:

curl http://localhost:8081/health

Expected response:

{
    "status": "healthy",
    "knowledge_base": {
        "status": "healthy",
        "total_chunks": 4821
    }
}

If status is "unavailable", the FAISS index was not found or failed to load. Re-run rag_kdprepare.

Adding custom knowledge documents

Place any .md or .txt files anywhere under knowledge_base/ and re-run rag_kdprepare. The preparer scans the entire directory tree, so subdirectory organization is up to you:

knowledge_base/
├── PayloadsAllTheThings/    # upstream knowledge base
│   ├── SQL Injection/
│   ├── XSS Injection/
│   └── ...
└── custom/                  # your organization's techniques
    ├── internal-targets.md
    ├── waf-bypass-notes.md
    └── custom-payloads.txt

The `retrieve_knowledge` and `distill_knowledge` tools

During task execution, the Executor can invoke these tools via MCP:

retrieve_knowledge

Performs a semantic similarity search against the FAISS index and returns the top-k most relevant chunks.Example query (invoked internally by the agent):

{
  "query": "MySQL blind SQL injection time-based payload",
  "top_k": 5
}

Response:

{
  "success": true,
  "query": "MySQL blind SQL injection time-based payload",
  "total_results": 5,
  "results": [
    {
      "text": "' AND SLEEP(5)--\n' AND (SELECT * FROM ...",
      "meta": { "type": "code_block", "doc_id": "knowledge_base/PayloadsAllTheThings/SQL Injection/..." }
    }
  ]
}

Timeout: 15 seconds (configurable via TOOL_TIMEOUT_RETRIEVE in .env).

distill_knowledge

Writes new attack insights discovered during a task back into the knowledge base as a custom document. This enables the agent to accumulate intelligence across runs.Timeout: 20 seconds (configurable via TOOL_TIMEOUT_DISTILL in .env).

Configuring the knowledge service port

The default port is 8081. Override it in .env:

KNOWLEDGE_SERVICE_PORT=8081
KNOWLEDGE_SERVICE_HOST=127.0.0.1
KNOWLEDGE_SERVICE_URL=http://127.0.0.1:8081

If you change the port, update all three variables to keep them consistent. The agent reads KNOWLEDGE_SERVICE_URL to determine where to send health checks and tool requests.

Get Started

Core Concepts

Configuration

Guides

Reference

Project

Why the knowledge base matters

Step 1: Set up PayloadsAllTheThings

Step 2: Build the vector index

What `rag_kdprepare` does

Force-rebuilding the index

Step 3: Start the knowledge service

Auto-start behavior

Health check

Adding custom knowledge documents

The `retrieve_knowledge` and `distill_knowledge` tools

Configuring the knowledge service port

Get Started

Core Concepts

Configuration

Guides

Reference

Project

Documentation Index

​Why the knowledge base matters

​Step 1: Set up PayloadsAllTheThings

​Step 2: Build the vector index

​What rag_kdprepare does

​Force-rebuilding the index

​Step 3: Start the knowledge service

​Auto-start behavior

​Health check

​Adding custom knowledge documents

​The retrieve_knowledge and distill_knowledge tools

​Configuring the knowledge service port

Why the knowledge base matters

Step 1: Set up PayloadsAllTheThings

Step 2: Build the vector index

What `rag_kdprepare` does

Force-rebuilding the index

Step 3: Start the knowledge service

Auto-start behavior

Health check

Adding custom knowledge documents

The `retrieve_knowledge` and `distill_knowledge` tools

Configuring the knowledge service port