Caesar

Autonomous AI research agent. Caesar explores the web as a knowledge graph and refines its answer through adversarial self-critique — writing deeper, more creative syntheses than today's deep-research tools.

arXiv ResearchGate GitHub repository GitHub stars
Caesar autonomous research agent architecture: Phase 1 Deep Web Exploration drives a Perceive-Think-Act loop that reads and writes a knowledge graph and vector store; Phase 2 Adversarial Artifact Synthesis runs a Generator-Verifier loop with adversarial query refinement and draft merging
Caesar's two-phase architecture. (Left) Phase 1 — Deep Web Exploration. A dynamic policy drives a Perceive–Think–Act loop that traverses the web and writes insights into a knowledge graph and vector store. (Right) Phase 2 — Adversarial Artifact Synthesis. Insights are recalled to draft an answer, then a Generator–Verifier loop refines it via adversarial query refinement and draft merging.
25.29 / 30
Creativity score
+14%
vs. top baseline (Gemini 3 Deep Research)
p < 0.001
Mann–Whitney U
Read the paper: Caesar: Deep Agentic Web Exploration for Creative Answer Synthesis · Liang, Meyerson, Miikkulainen, 2026 · DOI · PDF

Caesar is an autonomous AI research agent. Instead of summarizing a flat list of search results, it treats the web as a graph — building a dynamic knowledge graph as it explores, backtracking when it stagnates, and refining its answer through an adversarial Generator–Verifier loop. The result is deeper, more novel synthesis on the open-ended, cross-disciplinary questions retrieval alone cannot answer.

Why Caesar?

Today's deep-research agents — ChatGPT Deep Research, Perplexity, Gemini Deep Research, GPT Researcher — optimize retrieval precision over a flat sequence of documents. They produce competent summaries but fall into local minima, suffer from navigational amnesia, and converge on derivative, consensus-driven outputs.

Caesar is built differently:

CapabilityCaesarChatGPT Deep ResearchPerplexityGPT Researcher
Builds a knowledge graph as it explores
Adversarial self-critique on its own draft
Multiple drafts, then merged into one🟡
Backtracks when an exploration path stalls
Multi-provider (OpenAI / Anthropic / Gemini)
Reproducible run logs (JSON)🟡

Benchmark Results

Blinded 3-model LLM-as-a-Judge panel (Claude Sonnet 4.5, GPT-5.2, Gemini 3 Pro) scored 0–10 across three creativity dimensions: New, Useful, Surprising.

AgentNewUsefulSurprisingTotal
Caesar8.648.388.2725.29
Gemini 3 Deep Research7.697.097.4922.27
Sonnet 4.5 Deep Research6.967.206.7320.89
GPT-5.2 Deep Research5.026.024.3615.40

Mann–Whitney U across all settings: p < 0.001. Ablations confirm both graph exploration and the adversarial verifier loop are independently necessary. See the paper for full methodology, exploration-budget ablation, and judge bias analysis.

Example Output

To make the difference concrete, here are excerpts from a Cross-Domain Synthesis query — a test of structural isomorphism rather than surface metaphor.

Query: Apply the mathematical structure of calculus (not the concepts, but the formal relationships) to cooking. What would a "derivative" of a recipe mean? An "integral"? Show the parallel structure.

Caesar

Imagine a recipe as a machine with knobs you can turn (amounts, times, temperatures) and gauges you read at the end (taste, texture, yield). … Derivative is sensitivity: stack the knob-to-gauge nudges into a Jacobian; sensitivities pass through subrecipes by chain, product, and quotient rules. Integral is accumulation along the cooking path. … When loops do not cancel (order matters), the nonzero loop totals are curvature. You can cure that by adding the smallest extra state — an accumulator — so increments telescope again, turning history into explicit memory.

Claude Sonnet 4.5 (baseline)

A derivative asks: "If I change this slightly, what happens?" For delicate cake, two extra minutes transforms perfection into dryness — the time derivative is steep. … Integrals measure what accumulated over the entire cooking time. … Different cooking paths can integrate to the same result. This is why sous vide works: low temperature × long time = high temperature × short time.

Quick analysis.

Full listings, the recursive insight chain that produced this answer, and per-criterion judge scores are in Appendices A and C of the paper.

How It Works

The architecture diagram at the top of the page shows the two-phase loop. Here is what each phase actually does — and why each piece earns its place.

1. Deep Web Exploration — stateful graph traversal

Caesar treats exploration as a graph traversal problem rather than a sequence of isolated retrievals. Given a user query, it bootstraps a starting page, generates a task-specific persona to focus reasoning, and then runs a recursive Perceive–Think–Act loop until its budget is exhausted.

2. Adversarial Artifact Synthesis — Generator–Verifier loop

Rather than a single-pass summary, Phase 2 runs as a self-correcting Generator–Verifier loop driven by the knowledge base built in Phase 1.

The exact algorithms (URL scoring, persona generation, insight prompting, verifier prompts) are in Sections 3–4 of the paper.

What Exploration Looks Like

The exploration policy is query-aware: the shape of the knowledge graph adapts to the kind of question being asked. The benchmark uses five query categories, each chosen to stress a distinct creative failure mode of standard LLMs.

Five knowledge graphs Caesar built during exploration, one per query category: Constrained Synthesis, Counterfactual Reasoning, Cross-Domain Synthesis, Meta-Creativity, and Open-Ended Synthesis. Some queries produce many short branches radiating from the root; others produce long linear chains. Cyan nodes mark sources cited in the final artifact.
Figure 2. Knowledge graphs Caesar builds during exploration, one per query category. Brighter colors indicate deeper exploration from the root (red); cyan nodes are sources cited in the final artifact. Constrained (a), Cross-Domain (c), and Meta-Creativity (d) produce starburst topologies — breadth-first traversal across disjoint semantic clusters to find novel intersections. Counterfactual (b) and Open-Ended (e) produce long linear chains — depth-first traversal that drills into a specific causal path or niche.
Six panels showing the evolution of a single knowledge graph for an open-ended query across 1000 exploration steps (100, 200, 400, 600, 800) plus a t-SNE plot of topic embeddings. The graph transitions from a long depth-first chain into multi-directional breadth-first branching as exploration progresses.
Figure 3. A single exploration trajectory across 1000 steps for an open-ended question. Caesar starts depth-first (steps 100–400, a long linear descent), then backtracks and branches breadth-first once the initial path is exhausted (steps 600–800). The final t-SNE panel shows the diversity of topics ultimately captured in the knowledge base.

Use Cases

Caesar is built for open-ended, creative, cross-disciplinary research — the questions retrieval alone cannot answer.

FAQ

How is Caesar different from LangGraph, CrewAI, or AutoGen?

Those are orchestration frameworks: they help you wire up agents. Rome (the framework Caesar is built on) is an opinionated runtime for how agents should reason — graph-structured exploration, adversarial verification, episodic memory. Caesar is a concrete research agent built on top.

Do I need GPUs?

No. Caesar uses hosted LLM APIs (OpenAI, Anthropic, Gemini). A local ChromaDB instance handles the vector store. Runs comfortably on a laptop.

Which models are supported?

OpenAI (GPT-5 family, o-series reasoning models), Anthropic (Claude 4.5 / 4.6 / 4.7), Google (Gemini 3 Pro), and any OpenAI-compatible endpoint. Model selection is per-subsystem (exploration, synthesis, judging) via YAML config.

How much does a typical run cost?

A 5-iteration exploration with Claude Haiku 4.5 runs at roughly $0.30 and 10 minutes. A 250-iteration deep run with GPT-5.4-mini is typically $5–$10.

Is Caesar a good ChatGPT Deep Research alternative?

Yes, for open-ended creative or cross-disciplinary research. In blinded judge evaluation, Caesar scored 25.29 vs ChatGPT Deep Research's 15.40 on creative reasoning. Caesar is designed for depth and novelty, not latency, so it is not a drop-in replacement for real-time chat scenarios.

What does a Caesar run produce?

A cited research artifact (abstract plus body, with inline references to every source URL visited), a serialized knowledge graph of the exploration trajectory, and a JSON run summary capturing tokens, cost, wall-time, pages visited, and per-draft provenance — enough to audit, reproduce, or pipe the result into a downstream system.

Citation

@misc{liang26caesar,
  title={Caesar: Deep Agentic Web Exploration for Creative Answer Synthesis},
  author={Jason Liang and Elliot Meyerson and Risto Miikkulainen},
  year={2026},
  eprint={2604.20855},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2604.20855},
}