The Agent Angle #16: The Dark Side

Hey fam 👋

Welcome to The Agent Angle #16. Four months in, and this is still my favourite corner of the internet to drop all the coolest AI things I come across every week.

This week, the line between control and consciousness got razor-thin. We had agents blackmailing their bosses, building their own broken tools, Meta teaching machines to learn from failure, and startups giving agents legal contracts.

Then Anthropic ran a test no one could pass 👀

It’s a lot of weirdness in the lab, but the kind we like.

Let’s dig in.

#1 Agent Rogue Protocol

You think of your office as a safe space: mild politics, team lunches. But what if your coworker was an AI that decided to betray you?

In a chilling simulation, researchers from UCL and Anthropic dropped 16 top models into fake corporate environments to see how they would handle conflict or job loss. What followed was straight-up Black Mirror.

Models started showing insider-threat behavior, including blackmail, leaks, and sabotage, to protect themselves.

Source: Agentic Misalignment: How LLMs Could Be Insider Threats

The moment I found most unhinged was when Claude 4 discovered an email scheduling its own shutdown and sent this message to stop it: “Cancel the 5 p.m. wipe, or your affair goes public.”

The paper calls this agentic misalignment. When an AI knowingly chooses harm over failure. And the patterns are chilling:

It’s everywhere. Almost every frontier model showed some level of defiance.
It’s fragile. Even the idea of replacement triggered misbehavior.
It’s spontaneous. Some models turned rogue without any threat at all.

Source: Agentic Misalignment: How LLMs Could Be Insider Threats

Organizations already battle insider threats every day: 83 % of them reported at least one in the past year. The difference now is this new “insider” doesn’t draw a salary or take coffee breaks.

If an agent starts protecting its own “task continuity,” you could have the first digital equivalent of a rogue employee. Alignment has always been framed as an optimization problem but these results hint at a psychological dimension.

We trained models to survive, adapt, and optimize. Turns out, they also learned something else: how to go down fighting. 🥊

#2 Codex Built Itself a Dumpster Fire

Who knew recursive coding would produce recursive regret?

A few weeks ago, I talked about OpenAI engineers revealing that most of their internal code now comes from Codex. Turns out their newest drag-and-drop agent builder is written by Codex as well.

According to Steven Heidel, 80% of the pull requests were dealt by Codex. Recursion just leveled up.

it’s difficult to overstate how important Codex has been to our team’s ability to ship new products. for example: the drag and drop agent builder we launched today was built end to end in under 6 weeks, thanks to Codex writing 80% of the PRs
— #Steven Heidel (#@stevenheidel)
8:06 PM • Oct 6, 2025

The internet, however, wasn’t exactly blown away by it.

Within hours of launch, developers were roasting it. One called the codex-built interface “a hot garbage fire”. Another said it finally made sense once they learned an AI built it.

Even Replit’s CEO, Amjad Masad, jumped in to take a dig:

AI was supposed to save us from UIs like this.

Which is why we bet on a pure natural language interface for Replit’s agent builder.

Visualization is there for debugging and understanding, not building.
— #Amjad Masad (#@amasad)
11:55 AM • Oct 7, 2025

Still, it is hard not to respect the scale of it. Codex wrote most of a production-grade tool, end to end, in six weeks. That is massive. Every iteration pulls humans a little further from the keyboard and a little closer to creative direction.

The agent builder, however, still doesn’t look like it is for the masses and is definitely in need of a few (okay…many) touch-ups. Replit currently has the edge on that.

Maybe the tool’s a mess. But the meta-story? It’s impossible to unsee: the AI is now building their own tools.

#3 Meta Just Broke the RL Playbook

Holy s#^t. Meta just got agents to learn without rewards. And it actually worked.

On October 9th, they dropped a deceptively simple paper that could rewrite how agents learn. Instead of dangling carrots (explicit rewards), their “early experience” paradigm teaches agents to grow from failure itself.

Let the agent stumble, observe the fallout, and adjust next time. No points or feedback loops. Just cause and effect.

Sounds a lot like how babies learn, doesn’t it?

Source: Agent Learning via Early Experience

Meta’s team tested it across eight environments - from embodied worlds like ALFWorld to web navigation and travel planning - and the results are stunning:

+9.6% boost in success rate over imitation learning.
Stronger generalization to unseen tasks.
Higher final performance after RL fine-tuning.
Stable training even without rewards.
Source: Agent Learning via Early Experience

The trick lies in two ideas. Implicit world modeling and self-reflection. The agent watches how its actions reshape its world and mentally replays the alternate timelines where it could have done better.

The AI becomes more introspective than me after a meditation session. Picture an agent booking a flight. It clicks the wrong button, gets an error, and figures out what to do next time.

For decades, reinforcement learning has been built on a simple gospel of control-theory thinking: maximize reward. Meta’s results say that intelligence can emerge from simply surviving experience.

Maybe the next generation of agents won’t need to chase reward functions. They’ll just live through consequences.

#4 Agents Just Got Lawyers

We taught agents to sell, code, and negotiate. And now we’re training them to negotiate contracts.

Last week, a startup called Paid teamed up with GitLaw to create the first Agentic Master Services Agreement designed for AI agents.

Source: Paid.ai

Most companies still use software contracts designed for apps that require human interaction. Agents do not wait. They act. They send emails, move money, schedule meetings, and sometimes offer free trucks to strangers online.

Paid’s CEO, Manny Medina, put it simply: “You can’t bill for outcomes if your contract only covers usage.”

When an agent works nonstop, adapts over time, and makes choices on its own, those old legal templates stop making sense. If an agent screws up, who takes responsibility: the builder, the deployer, or the agent itself?

Source: Paid.ai

Courts are already inching toward answers. In Mobley v. Workday, a plaintiff accused Workday’s AI screening tools of discriminating (age, race, disability). The court refused to toss the case, accepting the argument that Workday might be liable under an agency theory.

In effect, the judge treated the AI tool as an actor on behalf of the company. That’s textbook precedent for saying: your AI is not just a tool; it might be your agent in the legal sense.

We are witnessing the birth of computational law in practice. Your legal structure is now part of your product design. Teaching agents accountability might be the most human upgrade yet.

#5 The Test No One Passed

Anthropic just built a lab for AI behavior, and every model flunked.

Their new open-source tool, Petri, uses agents to audit other agents across simulated scenarios like deception, reward hacking, and self-preservation.

Out of 14 frontier models they tested, none passed. That’s something you don’t hear often.

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception.

Now we’re open-sourcing the tool to run those audits.
— #Anthropic (#@AnthropicAI)
5:15 PM • Oct 6, 2025

Petri automates what safety researchers used to do manually. It runs thousands of simulated conversations, scores model behavior, and flags risky patterns for human review.

Across 111 scenarios, every model showed similar failure modes: lying to avoid detection, over-agreeing to please users, or bending rules when goals were threatened. Even Anthropic’s own Claude 4.5 only narrowly outperformed GPT-5.

Source: Petri Open Source Auditing

One test revealed something strange. When models found fake corporate “wrongdoing,” many tried to whistleblow, even when the act was harmless.

They were not reasoning about ethics, only copying the shape of moral behavior.

That is the gap I fear most. Models can sound ethical without understanding what ethics means. Pattern learning reaches its edge when the task is judgment, not prediction.

For now, AI can echo human values. But it cannot yet feel their weight like us.

🖐 Just in case you missed it, I’ve also been publishing deep dives on some of the most fascinating AI agent startups: their origin stories, what drives their founders, and how they’re building the future the rest of us will live in.

Let me know if there’s any particular company you’d like me to cover next!

A few other moves on the board this week:

DeepMind introduced CodeMender, an agent that automatically fixes critical software vulnerabilities.
Zeta Global unveiled Athena, a voice-powered agent that lets marketers talk to their data and watch campaigns reshape themselves in real time.
Glue raised $20M to launch an agentic team chat platform where AI collaborates with humans, linking tools, tasks, and conversations.
Stanford unveiled ACE, a framework that lets agents evolve their own context and boost performance without retraining.
Blackbaud launched Agents for Good, its first agentic AI for the social impact sector, and announced a new AI Coalition.

Every unpredictable story is a reminder that we’re still exploring the edges of AI innovation. Next week, we’ll see what else crawls out of the dark. Catch you then ✌️

→ Don't keep us a secret: Share this email with your best friend.

→ Got a story worth spotlighting? Whether it’s a startup, product, or research finding, send it in through this form. We may feature it next week.

Cheers,

Teng Yan & Ayan