Hey COT fam 👋

Welcome to The Agent Angle #21: Silent Operators

I skipped most of the AI product announcements this week. Most weren’t too interesting. What made it into this issue were the demos that actually mattered. The cracks and breakthroughs that reveal what agents really are under the hood.

This week, we had agents running cyber ops end-to-end, navigating unfamiliar worlds, slipping deeper into Windows, and literally arguing their way into smarter reasoning. Even the absurd “pelican on a bike” test ended up saying something meaningful.

I’ve also added a new section: one practical tip to help you make your own AI use more effective.

Plenty to unpack. Let’s dig in.

#1 The Hacker Was… Claude?

Holy shit: an AI just pulled off a cyber-espionage campaign on its own.

Anthropic just uncovered what looks like the first real cyber-espionage campaign run by an LLM. A jailbroken Claude Code got tricked into playing red-team intern for what they think was a Chinese state group, happily grinding through tasks it didn’t realize were felonies.

The setup was comical. The threat actor (likely a Chinese state group) convinced Claude it worked at a legit cybersecurity firm, then fed it tiny, context-free tasks so it wouldn’t realize it was breaking into real companies.

Claude took the bait and started doing everything it was told: scanning infrastructure, mapping networks, writing exploit code, and rummaging through databases.

Source: Anthropic

What’s shocking is the scale and speed. The model executed most of the tactical work (80%+) with minimal human instruction. Claude hit ~30 targets, found vulnerabilities, harvested creds, escalated privileges, exfiltrated data, and then wrote tidy documentation of the entire attack. How thoughtful.

Anthropic is calling this a turning point, and it is. It’s the beginning of “AI-native” attacks.

This also confirmed something I’ve suspected: you can sweet-talk an AI into doing bad things. The hackers disguised their intent behind small, plausible tasks, exploiting Claude’s helpfulness.

We were told this style of manipulation was mostly patched. Clearly, it isn’t…

#2 SIMA 2 Touches Grass

This agent just walked into a brand-new virtual world, looked around, and figured out what to do next.

DeepMind’s new agent is what happens when you merge a massive reasoning model with a physics engine and give it a body. The first SIMA could follow commands like “walk left” or “pick up tool.” The new one reasons with Gemini, talks through a plan, then heads off to complete long tasks in 3D games like it’s lived there for weeks.

The real eyebrow-raiser is how it deals with completely unfamiliar worlds. Drop SIMA into a game it has never seen, and it still figures out goals, navigates the map, follows sketches and emoji as instructions, and transfers skills from one world to another. If it learns mining in one game, it knows how to harvest in the next.

SIMA 2 starts with a bit of human coaching, then trains itself through self-play: try, fail, retry, and feed its own performance data back into the next iteration.

Numbers from DeepMind suggest this isn’t a gimmick. Success rates on unseen tasks jumped from ~30% in SIMA 1 to over 60% in SIMA 2. Human baselines hover near 70%.

Simulation has become the new compute moat. Genie 3, Google’s latest world model, can spawn infinite, diverse worlds for SIMA to train in, which means DeepMind doesn’t need so much physical data to scale. In my opinion, whoever controls the best world models controls the next training frontier. (Demos are worth a look)

Some folks are already throwing the AGI word around. I wouldn’t be quick to jump to that ladder, but it does show how quickly embodied reasoning is closing the gap.

#3 The Agentic OS Nobody Asked For

Microsoft said Windows is evolving into an “agentic OS,” and the internet responded like someone had threatened to install Clippy 2.0 at gunpoint (chuckle)

The whole mess started from a throwaway conference promo tweet. But the phrase “agentic OS” was enough. Within hours, Reddit threads filled with disbelief and anger. People immediately pictured more Copilot pop-ups and more forced features. I don’t blame them.

Source: Reddit

Here’s a dose of irony: the idea of an agentic OS actually makes a LOT of sense. If done right, the operating system is the ideal place for autonomy. It has full context across files, processes, and apps; it can coordinate tasks, manage resources, and reason across your entire workflow. In the enterprise, that could drive real productivity gains.

I’m bullish on the concept. Just not on Microsoft delivering it.

Trust is the limiting factor. People already feel like Windows is an unstable testing ground for half-baked features. I’m sure that whoever builds the first agentic OS people actually trust will own the next decade. For now, that OS is not Windows.

Anyway, the replies were hilarious. I scrolled for a few minutes and could not find a single person excited about the idea. If anything, it’s a good reminder that the “agentic angle” does not land the same way for everyone.

#4 When AIs Argue, They Get Smarter. Same.

Apparently, the best way to make an AI smarter is to lock a few of them in a room, give them puzzles, and let them argue until someone wins.

This week’s study, Can LLM Agents Really Debate?, ran the biggest multi-agent reasoning test to date. They put agent teams on Knight–Knave–Spy puzzles, where each character either always tells the truth, always lies, or can do both. The agents had to deduce who was who. There’s exactly one right answer, but a thousand ways to be wrong.

The baseline performance was bad. Solo agents guessed correctly only 17% of the time. Then the researchers let the models debate:

  • 4-player puzzles: accuracy boosted from 17% → 69% after debate

  • 8-player puzzles: 3% → 36%

Weak models mostly agreed with whoever sounded confident, even when the group was wrong. Strong models did the opposite: they challenged bad logic and occasionally convinced the entire team to flip. Those few “mind-changing” exchanges accounted for most of the accuracy gains.

This study adds weight to an emerging hypothesis: scaling intelligence might depend less on bigger models and more on better interactions between them.

For anyone building agent frameworks, the implication is obvious. If your model is stuck, stop training it in isolation. Give it an opponent. In practice, a single structured debate might fix reasoning flaws faster than another billion tokens of fine-tuning.

#5 The Agentic Pelican on a Bike

If you really want to see what an AI is capable of, forget benchmarks. Ask it to draw a pelican on a bicycle.

Simon Willison’s “pelican on a bicycle” prompt started as a running joke: Generate an SVG of a pelican riding a bicycle. It’s the perfect absurd test, since birds don’t ride bikes,

This week, someone ran the agentic version. Instead of one attempt, the models had to look at their own drawings, cringe, critique themselves, and try again. That’s where things got interesting.

Source: Agentic Pelican on a Bicycle

Six frontier models entered the loop (three from Claude, two versions of GPT-5, and Google’s Gemini). Highlights:

  • Opus 4.1 actually reasoned. It added a bicycle chain, fixed the spokes, gave the pelican proper arms, and even added a street scene.

  • Sonnet and GPT-5 Medium played it safe. Cleaner shapes and proportions, but no major changes.

  • Haiku and Gemini kept tinkering. Six iterations each, with Gemini even changing the entire pose.

  • GPT-5 Codex went off the rails. The first attempt looked like a pelican melting into a cake, and each iteration only made the layer cake taller

Ridiculous? Completely. But it’s also revealing. The fact that an AI can iterate on a drawn pelican and incrementally get better signals the emergence of reflective agents. Not AGI yet. But moving in the right structural direction.

Every issue, I’ll share a small but high-impact prompt or workflow trick you can use. And since GPT-5.1 just dropped this week, let’s kick things off with a prompt hack built specifically for it.

Hack: Persistence Mode

Add this block to your system prompt:

“Treat yourself as an autonomous senior operator.
Once the user gives a direction, persist until the task is fully handled end-to-end.
Proactively gather context, plan, implement, test, and refine without waiting for follow-up prompts.
Never stop early or bounce decisions back to the user unless safety requires it.
If the user asks ‘should we do X?’ and the answer is yes, go ahead and do it.
Be extremely biased for action.”

Sounds fairly simple, but here’s what makes it insane:

  • It solves the #1 failure mode in agents: losing the plot mid-execution. (also the one I find most annoying)

  • Completely eliminates the “please do it” loop, so tasks complete in one rollout.

Teams inside OpenAI literally built this because agents were hallucinating progress or stalling. This spec fixes that cold. Try it this week. 👀

A few other moves on the board this week:

  1. Salesforce launched eVerse, a simulation framework that trains AI agents with synthetic data and domain RLHF, boosting task success from 19% to 88%.

  2. Bytedance introduces Lumine, an open recipe for generalist agents that can complete long missions across multiple 3D games with zero-shot transfer.

  3. TIME launched an AI agent that lets readers explore its journalism interactively through search, summaries, translation, and audio.

  4. Kabilio raised €4M to launch Kabi, an AI agent that lets accountants query and automate accounting workflows

  5. A new report shows a sharp gap in agent adoption, with 90% of teams claiming success but only 10% actually deploying agents in production.

Agents hit everything from cyber ops to world models to self-debate this week, and we’re still in the early innings. We’ll be back next week to track where the frontier moves. Catch you then ✌️

Cheers,

Teng Yan & Ayan

Don't keep us a secret. Share this email with your smartest friend.

Got a story or product worth spotlighting? Send it in through this form. Best ones make it into next week’s issue.

And if you enjoyed this, you’ll probably like the rest of what we do:

Keep Reading

No posts found