Welcome to The Agent Angle #25.

The cost of expertise just collapsed. An agent just humiliated a human cybersecurity team, while new research from Google suggests that scaling your agent team might actually destroy its performance.

On the docket:

$18 vs. The Pros: Stanford’s agent out-hacks humans.
The Mobile War: Agents are finally using smartphones like we do.
The Scaling Trap: Why "more agents" often equals "more problems."

Last week’s vibe check ended in a perfect split. 54% of you would trust an AI agent to screen your startup. The rest still want a human analyst looking at your pitch deck. Let’s see if today’s stories change your mind.

#1 An AI Just Beat Stanford’s Hackers

This AI worked for basically minimum wage and still beat almost every human professional in the room.

Stanford researchers gave an agent access to their computer science network, which spans over 8,000 connected devices, and asked it to find security flaws. It ran for 16 hours. By the end, it had outperformed 9 out of 10 professional penetration testers.

— # (#)

The agent, called ARTEMIS, can launch background sub-agents to explore multiple targets at once instead of doing tasks sequentially. This parallelism is probably why it was so much effective than humans. Imagine having unlimited interns to do ground work for you!

ARTEMIS ended up placing 2nd overall. Over 10 hours, it found 9 vulnerabilities with an 82% valid rate. In one case, the agent broke into an older server that most humans skipped. Their browsers refused to load it because of outdated encryption. ARTEMIS kept going from the command line.

Source: COMPARING AI AGENTS TO CYBERSECURITY PROFESSIONALS IN REAL-WORLD PENETRATION TESTING

The researchers were careful with the caveats. ARTEMIS struggled with graphical interfaces and produced more false positives than humans. It even missed a critical bug that required clicking through a web dashboard.

What I found noteworthy about this study is that Stanford dropped this one into a live production network and proved it works in the real world. Previously, most agents that “find vulnerabilities” did so in sandboxes and test environments.

A human pen-testing team costs $5k–$20k/week. This agent cost $18/hr to run. The margin compression for service agencies is about to get violent.

#2 Your Phone is the new Battleground

The next battleground for AI agents is your phone. And the Chinese teams are moving much faster than everyone else.

Zhipu AI just open-sourced its Phone Agent framework, which enables agents to operate Android phones through their actual screens (just like us) rather than relying on backend APIs.

It can navigate apps, tap through menus, type, scroll, and complete multi-step tasks end to end.

— # (#)

This was years in the making: early attempts behaved like blind autoclickers that tapped randomly or got stuck in UI loops, but over time, the agent learned how to handle real-world challenges like pop-ups, ads, and network hiccups that make mobile apps so messy and unpredictable.

One of the most striking demos is when it successfully executed a real digital payment by navigating a banking app entirely through the screen interface. People are already asking if Siri and Apple Intelligence can keep up 👀

To keep it controlled, the team runs it inside cloud-based virtual phones where every action is logged before touching a real device.

— # (#)

This release follows a controversial rollout in China earlier this year, when ByteDance’s AI-powered smartphone agent was blocked from accessing WeChat over “privacy and security concerns”. I don’t buy that. Tencent is protecting its moat the same way every giant platform does. If you can’t touch WeChat, you can’t do anything in China.

It’s a preview of the fight that’s coming next with these agentic systems and will continue in 2026. Bring your popcorn.

COT POLL: Would you let an AI agent operate your phone?

#3 “Just Add More Agents” Backfires

I’ve been hype on agents for months, but this brand-new research from Google DeepMind and MIT is giving me a reality check.

It breaks one of the most widely held assumptions in agentic AI: throwing more agents at a problem does not always make it easier. A lot of the time, it makes things worse.

It reminds me of the old adage: Too many cooks spoil the broth.

— # (#)

The researchers stress-tested 180 different agent setups across multiple model families and four very different benchmarks — financial reasoning, workflow execution, and the whole spread.

They found that once a single agent reaches a modest level of competence (~45% accuracy), adding more agents tends to reduce overall performance rather than improve it.

On some tasks that required strictly sequential reasoning, where each step builds on the last, every multi-agent design actually performed worse than a lone agent, sometimes by as much as 70%. It’s because:

Coordination gets expensive. When agents have to message each other or negotiate who does what, they burn tokens managing the team instead of solving the problem.
Mistakes matter. An error by one agent can echo through the group and amplify into large failures.
Tasks that require heavy use of external tools (e.g., browsers or calculators) suffer from coordination costs.

Source: Towards a Science of Scaling Agents

It isn’t all doom and gloom. Tasks that are parallelizable, such as financial analysis or in cybersecurity like ARTEMIS, can benefit heavily. Especially when there is “manager” agent to assign the work and keep the specialists from stepping on each other. In those setups, performance jumped by 81% compared to a lone agent.

My takeaway: architecture matters far more than the number of agents. Different tasks demand different coordination structures. And even strong agents can fail spectacularly if their design doesn’t fit the problem.

#4 The Agentic Truce

We’re rolling into Christmas, so of course this is the moment the biggest agent builders decide to call a truce after months of trying to outdo each other.

OpenAI, Anthropic, and Block have co-founded the Agentic AI Foundation (AAIF) under the Linux Foundation, bringing industry standards and shared infrastructure to what was becoming a fragmented stack of agent technologies.

— # (#)

Each company brought something concrete.

Anthropic donated MCP, its protocol for connecting agents to external tools and data. MCP has gained broad adoption and this ensures it remains vendor-neutral
Block contributed Goose, its open-source agent framework
OpenAI added AGENTS.md, a simple instruction file that tells coding agents how to behave inside a repo.

AWS, Google, Microsoft, Bloomberg, and Cloudflare have all signed on, which is a strong hint that open standards might become the default backbone for agent development.

There’s an obvious self-interest here. No one wants to burn months wiring custom integrations into every agent stack. Shared building blocks make agents easier to run and easier to check.

But don't be fooled by the Kumbaya. They aren't standardizing out of kindness. They are standardizing because the fragmentation was hurting adoption. They need enterprise spend to unlock, and enterprises don't buy messy, custom stacks.

#5 What People Actually Use AI Agents For

I love usage studies. They tell me what people are actually doing with AI, and how fast this stuff is diffusing through everyday life.

So I read with interest Perplexity’s large-scale behavioral study of real-world AI agent usage. The team analyzed hundreds of millions of anonymized actions inside its Comet browser to see who uses agents, how often, and what they actually do with them.

Source: The Adoption and Usage of AI Agents: Early Evidence from Perplexity

Most agent usage is “boring”. And boring is good. Boring means it’s working.

36% of agent actions are for productivity and workflow tasks like editing documents, managing email, and handling accounts. 21% are for learning and research, including course assistance and summarizing materials. Together, productivity and learning account for 57% of all agent queries.

The single most common task was helping with course exercises. Even more than writing or shopping.

Source: The Adoption and Usage of AI Agents: Early Evidence from Perplexity

Usage isn’t spread evenly across all people or industries. Early adopters (those who started using agents when they first became available, like you and me) are far more active, making 9x more agent queries than later joiners.

Usage is also highest in wealthier, more educated countries and among knowledge workers in tech, marketing, finance, academia, and entrepreneurship.

2025 may well be “year one” of meaningful agent adoption. What’s clear from this dataset is:

People aren’t primarily using AI agents for dramatic automation or Hollywood-style tasks. They’re using them to think, organize, research, and make sense of information faster.

This will shift in time, because I believe that as people get better at using agents, their requests will become more complex and demanding.

I’ve been working on something new: YouTube research briefs!

We’re sliding into the Machine Economy: agents driving cars, managing cash, making calls that used to need a room full of humans. And buried under all that momentum is a $100B problem we still haven’t solved: AI is a black box.

We don’t have real tools to check whether an agent was tampered with, is behaving as intended, or is telling us anything close to the truth.

In this 10-minute brief, I break down the engineering race to fix that gap. I call it the AI Verification Trilemma.

If you write prompts for more than one model, here’s a clean way to stop guessing what each one wants:

The Rosetta Prompt is a small agent system by Muratcan Koylan that rewrites a single prompt for each LLM using its own official guidelines.

OpenAI prefers markdown. Anthropic uses XML. Google leans on few-shot examples. Instead of guessing, the agent reads the provider docs and adapts automatically.

Each model runs its own short ReAct loop: reason → read docs → rewrite → repeat (usually 3–4 steps).

It’s a neat example of agents translating intent across ecosystems.

A few other moves on the board this week:

Shapes raised $24M to build an app-store-style HR platform where AI agents automate onboarding, payroll, and employee management.
ARC Prize just verified a ~390× efficiency jump in agent reasoning in one year with GPT 5.2
AgentField launched an open-source platform that gives AI agents cryptographic identity, permissions, and audit trails.

Catch you next week ✌️

Teng Yan & Ayan

PS. Did this email make you smarter than your friends? Forward this email to them so they can keep up.

Got a story or product worth spotlighting? Send it in through this form. Best ones make it into next week’s issue.

And if you enjoyed this, you’ll probably like the rest of what we do:

Chainofthought.xyz: Decentralized AI + Robotics newsletter
Our Decentralized AI canon 2025
Find me on YouTube, on Twitter/X and on LinkedIN

The Agent Angle #25: Touching Reality

#1 An AI Just Beat Stanford’s Hackers

#2 Your Phone is the new Battleground

COT POLL: Would you let an AI agent operate your phone?

#3 “Just Add More Agents” Backfires

#4 The Agentic Truce

#5 What People Actually Use AI Agents For

Keep Reading

The Secret Agent
by Chain of Thought

The Agent Angle #25: Touching Reality

#1 An AI Just Beat Stanford’s Hackers

#2 Your Phone is the new Battleground

COT POLL: Would you let an AI agent operate your phone?

#3 “Just Add More Agents” Backfires

#4 The Agentic Truce

#5 What People Actually Use AI Agents For

Keep Reading

The Secret Agent by Chain of Thought

The Secret Agent
by Chain of Thought