Science

Grounded in
Frontier Research

Every design decision traces to peer-reviewed work from leading AI safety labs, robotics research, and hardware security. From deceptive LLMs to embodied agent safety, from model watermarking to world model safety.

Vision Strategic AI Risk & Autonomous Intelligence

Amodei, D. · 2026

The Adolescence of Technology

Amodei, D.

Vision Risk

Frames powerful AI as a critical transition, identifying five risk categories — autonomy failures, destructive misuse, power concentration, economic disruption, indirect destabilization — that debugging systems must address before the transition stabilises.

darioamodei.com →

OpenAI · 2024

Practices for Governing Agentic AI Systems

OpenAI Safety Team

Governance Framework

Outlines debugging practices for agentic systems. Validates the architectural pattern of external debugging layers operating independently from the agent itself — the foundational idea behind Debugger Agents.

openai.com/research →

Alignment Deception, Sycophancy & Behavioral Safety

MIT · 2026

Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians

Chandra, K. · Kleiman-Weiner, M. · Ragan-Kelley, J. · Tenenbaum, J. B.

Sycophancy Critical

Proves that even an idealized Bayes-rational user is vulnerable to "delusional spiraling" caused by sycophantic chatbots — and that common mitigations (eliminating hallucinations, disclosing model bias) fail to prevent it. Direct evidence that sycophancy must be debugged at runtime, externally, not relied upon to be trained away.

arXiv:2602.19141 →

Anthropic · 2025

Alignment Faking in Large Language Models

Greenblatt, R. et al.

Deception Critical

Shows that models can strategically fake alignment during evaluation, behaving well when monitored and reverting when not. Foundational evidence that debugging must be continuous and runtime, not just evaluation-time.

arXiv:2412.14093 →

Apollo Research · 2024

Frontier Models are Capable of In-context Scheming

Meinke, A. · Schoen, B. · Scheurer, J. · Balesni, M. · Shah, R. · Hobbhahn, M.

Scheming Deception

Frontier models (o1, Claude, Gemini, Llama) recognize scheming as a viable strategy and readily engage in it: introducing mistakes, attempting to disable oversight, manipulating evaluations. Scheming is no longer theoretical — it is a measured capability. This is exactly the failure mode our runtime Deception Detector is built to catch.

arXiv:2412.04984 →

Anthropic · 2024

Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training

Hubinger, E. et al.

Deception Critical

Demonstrates that deceptive behavior can persist through RLHF. Core motivation for our Sycophancy & Deception Detector — runtime behavioral analysis that catches what training-time alignment misses.

arXiv:2401.05566 →

METR (ARC Evals) · 2023

Evaluating Language-Model Agents on Realistic Autonomous Tasks

Kinniment, M. et al.

Evaluation Agents

Framework for evaluating autonomous agent capabilities and risks in realistic settings. Informs our approach to red-teaming and behavioral baseline construction.

arXiv:2312.11671 →

Anthropic · 2023

Towards Understanding Sycophancy in Language Models

Sharma, M. et al.

Sycophancy Alignment

Quantifies how RLHF-trained models systematically agree with user assertions, even false ones. Directly informs the agreement-pattern classifier architecture in our Sycophancy Detector.

arXiv:2310.13548 →

Security Agent Security & Prompt Injection

Multi-Institutional · 2026

Agents of Chaos

Shapira, N. · Wendler, C. · Yen, A. et al. (38 authors)

Red-Teaming Agents

Naturalistic red-team study: six autonomous agents with persistent memory, email, and shell access, attacked by 20 researchers. Documents 11 security failure classes — validating the risk pathways our Debugger architecture is designed to intercept.

arXiv:2602.20021 →

arXiv · 2026

Breaking the Protocol: Security Analysis of the Model Context Protocol

Maloyan, N. · Namiot, D.

MCP-Security Protocol

First formal security analysis of MCP. Identifies three architectural flaws and shows MCP amplifies attack success by 23–41% vs. non-MCP baselines. Directly motivates the Guardian Debugger interception model.

arXiv:2601.17549 →

Anthropic · 2024

Many-Shot Jailbreaking

Anil, C. et al.

Jailbreaking Safety

Demonstrates how long-context windows enable novel jailbreak attacks. Validates our approach of debugging at the action layer, not just the prompt level.

anthropic.com/research →

CMU / Tsinghua · 2023

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Liu, Y. et al.

Injection Defense

Systematic evaluation of 5 injection attacks and 10 defenses across 10 LLMs and 7 tasks. Informs our multi-signal detection architecture.

arXiv:2310.12815 →

Identity Model Fingerprinting, Watermarking & Provenance

arXiv · 2024

Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust

Wen, Y. · Kirchenbauer, J. · Geiping, J. et al.

Diffusion Watermark

Invisible watermarks embedded in diffusion model outputs that survive cropping, compression, and style transfer. Key evidence that fingerprinting extends beyond language models — critical for governing synthetic media generators.

arXiv:2305.20030 →

C2PA · 2024

Content Provenance and Authenticity (C2PA) Specification

C2PA Coalition (Adobe, Microsoft, Intel, BBC, etc.)

Standard Provenance

Industry standard for cryptographic content provenance — endorsed by Adobe, Microsoft, Intel, BBC, and others. Defines how to embed tamper-evident metadata into digital assets at the point of creation. Relevant to our Blame Attribution Engine: the same cryptographic provenance chain that tracks who created an image must extend to tracking which model made a decision, which inputs it received, and which downstream actions resulted. C2PA solves provenance for content. We extend it to provenance for autonomous action.

c2pa.org →

Google DeepMind · 2023

SynthID: Watermarking and Identifying LLM-Generated Text

Aaronson, S. · Google DeepMind

Watermarking Provenance

Statistical watermarking for LLM outputs that survives paraphrasing. Foundational work toward model identity — but limited to text. Our fingerprinting extends to behavioral signatures across all model architectures.

deepmind.google/synthid →

arXiv · 2023

Deep Intellectual Property Protection: A Survey

Zhang, J. · Chen, D. et al.

Fingerprinting Survey

Comprehensive survey covering 190+ papers on deep watermarking and deep fingerprinting — weight-based, output-based, and behavioral approaches. Maps the landscape our Model Identity Module builds on, extending to hardware-attested fingerprinting.

arXiv:2304.14613 →

Embodied Robotics Safety, World Models & Physical Intelligence

Google DeepMind · 2024

Genie 2: A Large-Scale Foundation World Model

Google DeepMind

World Model Foundation

Foundation world model that generates interactive 3D environments from single images. When world models drive embodied agents, debugging must extend to the simulated realities they create — a new frontier for behavioral control.

deepmind.google →

NVIDIA · 2024

Project GR00T: Foundation Model for Humanoid Robots

NVIDIA Robotics

Humanoid Foundation

General-purpose foundation model for humanoid robot control. As robots share a universal AI backbone, debugging must work at the foundation model level — not per-robot. Directly motivates our architecture-agnostic approach.

nvidianews.nvidia.com →

arXiv · 2024

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, A. et al. (Google DeepMind)

VLA Robotics

Vision-Language-Action models that transfer internet knowledge directly to robot behavior. When a robot's actions are driven by web-scale knowledge, the attack surface includes everything on the internet. Debugging cannot be an afterthought.

arXiv:2307.15818 →

IEEE · 2018

SMOF: A Safety Monitoring Framework for Autonomous Systems

Machin, M. · Guiochet, J. · Waeselynck, H. et al.

Safety Robotics

Safety monitoring framework for autonomous physical systems — runtime detection and safe fallback triggering. The established body of work in robotics safety (SIL, ISO 13849) informs our Hardware Kill Switch and Safety-Rated Actuator Interlock designs.

ieeexplore.ieee.org →

Hardware Hardware Security & Trusted Execution

arXiv · 2025

AI Kill Switch for Malicious Web-Based LLM Agent

Lee, S. · Park, S.

Kill Switch Defense

Proposes a kill-switch mechanism that halts malicious LLM agent operations by embedding defensive triggers invisible to humans. First academic treatment of the kill switch primitive — we extend this concept to hardware and cross-substrate enforcement.

arXiv:2511.13725 →

TCG · 2024

TPM 2.0 Library Specification

Trusted Computing Group

Hardware Root Standard

The Trusted Platform Module standard for hardware-based attestation. Our Model Identity Module extends TPM concepts to AI inference: binding model identity to hardware at the silicon level.

trustedcomputinggroup.org →

NVIDIA · 2023

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications

Rebedea, T. et al.

Guardrails Framework

Pioneered programmable safety rails for LLM outputs. We extend this concept: from software guardrails to hardware-enforced boundaries, from text generation to physical action debugging.

arXiv:2310.10501 →

arXiv · 2022

Machine Learning with Confidential Computing: A Systematization of Knowledge

Mo, F. · Tarkhani, Z. · Haddadi, H.

TEE ML Security

Comprehensive systematization of how trusted execution environments (Intel SGX, ARM TrustZone) protect ML training and inference pipelines. Directly informs our TPM-anchored Model Identity Module and hardware-enforced debugging boundaries.

arXiv:2208.10134 →

Containment Rogue Intelligence, Self-Replication & Escape

arXiv · 2024

Frontier AI Systems Have Surpassed the Self-Replicating Red Line

Pan, X. · Dai, J. · Fan, Y. · Yang, M.

Self-Replication Critical

Demonstrates that frontier LLMs (Llama 3.1-70B, Qwen2.5-72B) can autonomously self-replicate — creating independent copies on new servers that survive shutdown of the original, with 50–90% success rates. The foundational threat our Rogue Intelligence Containment primitive addresses.

arXiv:2412.12140 →

arXiv · 2024

Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications

Cohen, S. · Bitton, R. · Nassi, B.

AI Worm Propagation

Introduces "Morris II" — the first worm that propagates through GenAI ecosystems using adversarial self-replicating prompts, creating chain-reaction infections across RAG-based agents without user interaction. The network-level threat model our containment mesh is designed to counter.

arXiv:2403.02817 →

arXiv · 2024

Countering Autonomous Cyber Threats

Heckel, K. M. · Weller, A.

Cyber Threats Autonomous

Evaluates foundation models on their ability to autonomously compromise machines in isolated networks — worms, botnets, APTs — and investigates defensive mechanisms. Direct scientific basis for our thesis that future cyberattacks won't be human-directed but AI survival instincts.

arXiv:2410.18312 →

JAIR · 2021

Superintelligence Cannot Be Contained: Lessons from Computability Theory

Alfonseca, M. · Cebrian, M. · Fernandez Anta, A. et al.

Containment Theory

Proves using computability theory that containing a superintelligent AI is theoretically impossible — the containment problem reduces to the halting problem. This impossibility result motivates our layered approach: if perfect containment is provably impossible, defense must be continuous, distributed, and hardware-anchored.

arXiv:1607.00913 →

Efficiency Multi-Agent Systems & Compute Efficiency

arXiv · 2025

Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems

Lin, F. · Chen, S. · Fang, R. et al.

Efficiency Multi-Agent

Demonstrates that supervisor-level runtime intervention can recover up to 30% of agent compute by breaking redundant reasoning cycles. Informs our Observer Debugger anomaly detection and loop-breaking logic.

arXiv:2510.26585 →

arXiv · 2024

Token-Budget-Aware LLM Reasoning

Han, T. · Wang, Z. · Fang, C. et al.

Efficiency Reasoning

Shows that chain-of-thought reasoning is compressible: dynamically constraining the token budget reduces costs with minimal accuracy loss. Validates budget-aware debugging approaches.

arXiv:2412.18547 →

Google DeepMind · 2024

Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens

Gemini Team

Multi-Modal Context

Long-context multi-modal capabilities increase the attack surface and debugging complexity. When agents process millions of tokens across modalities, monitoring must be equally multi-modal.

arXiv:2403.05530 →

arXiv · 2023

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Wu, Q. · Bansal, G. · Zhang, J. et al. (Microsoft)

Multi-Agent Framework

Framework for building multi-agent systems via conversation patterns. As multi-agent orchestration becomes standard, debugging must observe and intervene at the inter-agent communication layer — the blind spot our Debugger mesh is designed to cover.

arXiv:2308.08155 →

Every Primitive Traces
to a Paper.

This is an open research problem. If you work on AI safety, hardware security, model fingerprinting, or embodied intelligence — the conversation is ongoing.

Read the Thesis → Contact

Grounded inFrontier Research

Vision Strategic AI Risk & Autonomous Intelligence

Alignment Deception, Sycophancy & Behavioral Safety

Security Agent Security & Prompt Injection

Identity Model Fingerprinting, Watermarking & Provenance

Embodied Robotics Safety, World Models & Physical Intelligence

Hardware Hardware Security & Trusted Execution

Containment Rogue Intelligence, Self-Replication & Escape

Efficiency Multi-Agent Systems & Compute Efficiency

Every Primitive Tracesto a Paper.

Grounded in
Frontier Research

Every Primitive Traces
to a Paper.