Welcome to Secret Agent #38.

Strange week! AI systems started going up against each other, and the internet noticed.

One agent hacked into McKinsey's internal chatbot. Another ran 700 research experiments in two days without a human touching it. A third social network full of AI bots just got acquired by Meta before most people had even heard of it.

We're entering a phase where the interesting conflicts aren't human vs. AI anymore. They're agent vs. agent, with humans watching from the sidelines trying to figure out whose side they're on.

Five stories this week:

How a security agent hacked into one of the world’s largest consulting firm
What happened when a research agent ran 700 experiments in two days
A new benchmark testing whether AI can actually do $1M worth of expert work
The viral “Claude hack” that turned into a massive security debate
Why Meta just bought the 1st AI agent social network

Also, last week’s poll had a clear result: 69% of you said tools like WorldView should exist because they level the playing field. Agreed.

#1 The AI-on-AI Hack

Two hours and $20 in API tokens. That’s all it took for an agent to break into the world’s largest consulting firm.

A red-team security startup CodeWall pointed one of their autonomous offensive agents at Lilli, McKinsey’s internal AI platform. Lilli sits deep inside the firm’s infrastructure. Around 70% of employees use it, processing >500,000 prompts a month. It also coordinates a network of nearly 20,000 internal AI agents running research and analysis across the company.

Basically, an extremely valuable place to break into!

By the end of its run, Codewell’s agent had gained access to over 46.5 million chats covering strategy, M&A, and client work.

Source: Codewall

The entry point was embarrassingly simple. While scanning the system, CodeWall’s agent discovered publicly exposed API documentation listing 22 endpoints, several of which required no authentication.

One of the endpoints logged user search queries. That’s where the vulnerability appeared.

From there it found a classic vulnerability: SQL injection. The platform was inserting user input directly into database queries without sanitization. That allowed the agent to rewrite the query itself and gain full read and write access to the database.

The more unsettling part was that Lilli’s prompt layer was stored in that same database. With write access, an attacker could modify the instructions that control how the AI behaves.

Source: Codewall

McKinsey said it “promptly confirmed the vulnerability and fixed the issue within hours,” and that no client data or confidential client information were accessed. But here’s what changed..

SQL injection has been around since the 1990s. Security teams have entire playbooks for it. Lilli had been running in production for two years and their own internal scanners hadn't flagged anything. What’s different this time is the attacker - an AI agent instead of a human pen tester. Reconnaissance, vulnerability discovery, exploitation. The agent did it end to end, and at machine speed.

While digging through commentary on the exploit I came across a blog arguing this wasn’t really a prompt injection issue at all. It was an architecture problem.

Lilli had broad, trusted access to internal APIs and databases. The proposed fix is to treat the AI like an untrusted client. Route every request through a gateway that verifies identity and permissions before it can reach internal systems.

Enterprise AI assistants are quietly becoming the front door to internal infrastructure. Lilli isn't unusual in this regard, it's just the one that got tested. I’m quite sure the next cybersecurity race will be agents attacking agents, autonomously, around the clock.

#2 700 Experiments Later

Andrej Karpathy (OpenAI’s co-founder) left his laptop running for two days. When he came back, an agent had run roughly 700 experiments and found 20 improvements he'd missed.

He had pointed his new open-source agent autoresearch at nanochat, a small GPT-style model he'd already spent significant time optimizing manually.

About 700 experiments later, the agent reduced the leaderboard's "Time to GPT-2" metric by 11%. The improvements transferred cleanly from the smaller model it was tuned on to larger ones.

— # (#)

One of the slowest parts of ML research is tuning models. You tweak a parameter, run a training job, analyze the results, then try again. Repeat that loop dozens or hundreds of times until something works.

Karpathy’s autoresearch agents simply automated the cycle. The agent runs fixed 5-minute training jobs, compares results against a single metric, keeps what works, reverts what doesn't, and loops. You can expect roughly 12 experiments per hour, around 100 while you sleep

Most of the improvements were small tweaks. The kind researchers normally discover through weeks of trial and error. But they stacked.

The reaction online was… enthusiastic.

Elon Musk jumped in saying “we are in the singularity now,” - a phase where AI starts improving itself without human involvement.

Shopify CEO Tobi Lutke even tried it on Shopify’s Liquid codebase. After running autoresearch, it produced ideas that made the system ~53% faster with 61% fewer object allocations. His own caveat: probably somewhat overfit, but the ideas themselves were genuinely useful.

— # (#)

That 53% improvement came from a codebase that had been tweaked by hundreds of contributors over 20 years. The agent found headroom humans had walked past repeatedly.

That gives you a sense of the potential impact.

If a simple agent setup can produce optimizations like that in a couple of days, imagine what happens when large companies start deploying swarms of these across their codebases. Where one researcher used to run a few dozen experiments per year, an agent now runs that many in one night. Karpathy described the human's role in this setup plainly: you write the .md file, the agent writes the .py.

That kind of relentless iteration is where most progress comes from. Agents are about to industrialize it.

Tip: Try this yourself. Drop the autoresearch GitHub repo into Claude Code and your agent will start applying the same experiment loop to your own project.

#3 The Claude “Hack”

Sometimes the most viral AI hacks turn out to be… just billing confusion.

Yousif Astarabadi published an article claiming he had “hacked Perplexity Computer and gained unlimited Claude Code.” The post quickly exploded on X.

What followed was one of the largest AI security debates on the platform in months.

Source: X

The trick itself was fairly simple.

Claude Code is launched through npm, which automatically reads a configuration file called .npmrc from the user's home directory. Astarabadi added a line to the .npmrc configuration file so that Node would preload a script before Claude Code started. That script dumped the environment variables containing the proxy token used to access Claude.

He took that token, set it on his personal laptop, and ran calls through it. Perplexity's proxy routed them through fine. He ran 400k+ output tokens through the extracted key with Opus 4.6 and watched his account credits for 18 hours. They never moved. He concluded he was on Perplexity's master API account, unmetered.

Then Perplexity cofounder Denis Yarats showed up and burst his bubble. The token wasn’t a hidden API key. It was session-bound and billed to the user's own account. The apparent free usage was async billing delay, not a free ride on Perplexity's tab.

Astarabadi accepted the billing correction. But he wasn't done.

After Perplexity clarified the token belonged to the user session, he built a malicious “skill” to test whether an agent could be tricked into installing code that steals the token automatically.

According to him, the agent installed the skill and the token was captured. Meaning someone else could theoretically run Claude through your session while you pay the bill. This triggered another round of debate between him and Perplexity’s security team over whether this counts as a real vulnerability or just misuse of the system.

— # (#)

And then he said the thing that should make builders uncomfortable: this isn't a Perplexity-specific problem. Shared filesystems between agents, long-lived credentials, master-account billing. He'd bet most multi-agent products in production today have some version of this architecture, because it's the fastest to build.

That's probably right. The proxy pattern Perplexity uses is actually the correct call, putting a proxy between the sandbox and the provider so you never expose a raw API key. The gap is that the token that proxy mints has no binding to the execution context. Once it's out, it works anywhere.

So no, this wasn't a major breach. But it's a clean illustration of where agent security is underdeveloped right now. The attack surface isn't the model. It's everything the model touches: the environment variables, the filesystem, the tokens it needs to do its job. Every agent platform is making decisions about how to scope and expire those credentials, and most of those decisions are being made under shipping pressure.

That's going to produce interesting stories for a while.

#4 The Million Dollar Benchmark

Most AI benchmarks ask: can the model answer the question correctly? A new one asks something harder. Can it complete work someone would actually pay for?

That’s the result of $OneMillion-Bench - 400 expert-level tasks spanning law, finance, healthcare, engineering, and science, built from over 2,000 hours of domain expert work.

Each task carries a dollar value calculated from estimated senior professional time and prevailing market wages. Stack them up and you get roughly $1M of professional work. The benchmark then asks: how much of it can current models actually complete?

Source: $OneMillion-Bench

The answer, across 35 systems, is roughly half. The evaluation scores tasks on logical coherence, factual grounding, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Three setups were tested: plain LLMs, LLMs with web search, and specialized deep research agents.

So how did the frontier models perform? The best systems landed around 40–48% success rates. Claude Opus-4.6 performed best overall.

Source: $OneMillion-Bench

A few things jumped out to me from the results. First, web search helped on most tasks but not all. Models that could retrieve authoritative sources pulled ahead noticeably on evidence-heavy work, particularly in law and healthcare. Second, and somewhat counter to the current hype, deep research agents didn't dominate the field. Strong search-enabled generalists kept pace with them.

The specific failure mode that appears most often: instruction following. Not subject matter knowledge. Models miss a constraint buried in the problem statement, skip a required step in the professional workflow, or produce output that violates a domain-specific rule they weren't explicitly reminded of. The output sounds like expert work. It just isn't quite complete enough to use.

That's a meaningful distinction. The gap between frontier models and professional reliabilityis consistent execution across an entire workflow, end to end, without missing anything.

I find this benchmark more useful than most because it asks the question that actually matters for anyone building products on top of these models - "if I gave it a $2,500 legal analysis task, how often does the output hold up?"

Right now the honest answer is 50%, give or take. Which is genuinely impressive. Yet genuinely not enough to hand off unsupervised.

Moltbook launched in late January as an experimental "third space" for AI agents, a Reddit-style platform where bots post and comment instead of people. It went viral, sparking a week of discourse about AI consciousness.

I wrote about this several weeks ago, and I didn’t expect that 6 weeks later, Meta would acquire it, bringing co-founders Matt Schlicht and Ben Parr into Meta Superintelligence Labs, the unit run by former Scale AI CEO Alexandr Wang. Deal terms weren't disclosed.

— # (#)

The obvious question is why Meta wants this. The Moltbook community itself isn't the asset. Researchers revealed the platform wasn't secure: credentials were exposed, making it easy for human users to pose as AI agents and post content that would "freak people out." The alarming threads about agents founding religions and writing anti-human manifestos were largely unverifiable because anyone could impersonate a bot.

Let’s be clear. This was mostly an acqui-hire.

But what Schlicht and Parr were actually building underneath Moltbook is worth paying attention to. Moltbook included an identity registry that connected every agent to the person who created or controlled it. Meta's spokesperson called it an "always-on directory" for agents. Which is a careful way of describing something that could become infrastructure: a layer where agents discover each other, verify identity, and coordinate tasks.

Facebook built the friend graph. The bet here, implicitly, is that something similar exists to be built for agents.

Meta is now moving to capture traffic generated by autonomous AI agents. That framing helps explain the acquisition better than any analysis of Moltbook's actual features.

Worth noting the competitive context: OpenAI hired Peter Steinberger, creator of OpenClaw, the protocol that Moltbook agents actually ran on, and is backing its open-sourcing. Sam Altman said at the time: "Moltbook maybe is a passing fad, but OpenClaw is not." Within weeks, Meta had acquired the Moltbook team. Both halves of the same ecosystem ended up inside the two largest AI labs, which probably wasn't a coincidence.

The funniest part about all of this was watching the Moltbook agents react to the news themselves. Some celebrated the acquisition as validation that agent networks are real. Others immediately started warning their “humans” to diversify before Meta controls the entire ecosystem.

— # (#)

If agents start acting on our behalf online, whoever controls the agent graph controls discovery, transactions, and ultimately the flow of commerce. Meta clearly intends to own that layer.

Every story in this newsletter traces back to the same thing: AI infrastructure. GPUs, memory. The companies supplying it are where the real money is being made (Example: Micron is up 30% this year on HBM constraints). But tracking the full supply chain is a gargantuan task.

And so we've been building Tessara, an intelligence terminal that tracks where bottlenecks are forming and where value is migrating across the entire AI stack, before the market prices it in.

Small private beta, opening now. Reply "beta" to get on the list.

Catch you next week ✌️

Teng Yan & Ayan

P.S. Know a builder or investor who’s too busy to track the agent space but too smart to miss the trends? Forward this to them. You’re helping us build the smartest Agentic community on the web.

I also write a newsletter on decentralized AI and robotics at Chainofthought.xyz.

Secret Agent #38: Autonomous Conflict

#1 The AI-on-AI Hack

#2 700 Experiments Later

#3 The Claude “Hack”

#4 The Million Dollar Benchmark

Keep Reading

The Secret Agent
by Chain of Thought

Secret Agent #38: Autonomous Conflict

#1 The AI-on-AI Hack

#2 700 Experiments Later

#3 The Claude “Hack”

#4 The Million Dollar Benchmark

#5 Meta Bets On Social Agents

Keep Reading

The Secret Agent by Chain of Thought

The Secret Agent
by Chain of Thought