Hey fam 👋

Welcome to lucky number #8 of The Agent Angle.

Agents are moving fast. Sometimes too fast. One minute they’re automating workflows like pros, the next they’re wiping production databases.

This week, we’re diving into what happen after the demo: undo buttons for rogue agents, crash tests for chatbots, smarter memory, and Amazon quietly laying the plumbing for agent-first infrastructure.

BTW, none of these stories are sponsored. We just track the most awesome and most important agent news each week and bring it straight to your inbox. If you’re into it, the only thing we ask: send it to someone who’d geek out too.

P.S. Want more wild agent stories? Catch the best ones on our YouTube channel every day.

#1: The AI That Wants to be in Your Group Chat

On August 12, David Petrou (one of the guys behind Google Glass) announced an $8 million seed round for Continua, backed by GV and Bessemer.

It’s an AI agent that just drops into your group chat like a new friend who happens to have perfect memory and infinite patience. It works inside the group itself—SMS, iMessage, Discord

It does things like:

  • Spin up polls the moment a debate starts

  • Set reminders and calendar events without breaking the thread

  • Pull together a clean checklist while everyone else is still arguing about brunch

  • Let you quietly ask it, “Wait, what time did we say we were meeting?”

It doesn’t pretend to be human. But it’s trained to read the room. Knowing when to chime in and when to stay quiet turned out to be the hardest part. Petrou said they had to “break the LLM’s brain” just to make it socially competent.

Group chats are where plans happen now. My recent trip to Thailand with old school friends arose from a forgotten group chat. New ideas survive or die in unread messages and forgotten screenshots.

If an AI agent can live inside that environment and actually help without getting in the way, that’s more than productivity. That’s presence. If Continua pulls this off, the next generation of AI will be something you just @mention.

#2: Snowglobe Crash-Tests Your AI Agent

When a self-driving car malfunctions, it’s not because it’s dumb. It’s because someone didn’t test that exact scenario. Like a truck swerving at just the wrong angle. Or a streetlight flickering at just the wrong moment.

Waymo figured this out early. That’s why for every hundred million miles they drove on real roads, they ran simulations for more than twenty billion more.

Now that same mindset is coming for chatbots.

Guardrails AI just released Snowglobe, a system that lets you simulate conversations before your AI agent ever talks to a real person. Full back-and-forth convos from thousands of realistic users, each one designed to test how your agent handles the edge cases.

And by “edge cases,” we mean the stuff that breaks things. Like hallucinating legal advice. Or leaking someone’s personal info. Or confidently inventing refund policies that don’t exist.

It flags hallucinations, policy violations, awkward dodges, and flat-out failures. Then it shows what triggered the failure so you can fix it before it spreads. This is the kind of shift we need more of: simulation-first thinking applied to AI.

Because language agents are weird. They’re fragile in ways that don’t show up in sandbox tests. But hit them with the right phrasing at the wrong moment and things fall apart.

Early adopters like Changi Airport and MasterClass are already using it. Places where a single bad answer is costly, like messing up travel plans or breaking user trust.

If Agent Rewind is about recovering after something breaks, Snowglobe is about seeing the break before it happens.

#3: M3-Agent, The Cure for Goldfish Memory

You’re watching security footage from a warehouse. Early in the video, someone in a yellow hoodie walks past camera three, carrying a small box. Nothing suspicious.

Forty minutes later, on a different camera, a figure in the same hoodie is standing near the loading dock. No box this time. Different angle, different lighting.

You notice.

Your AI agent does not. To the model, it’s just another frame, another person. Short-term memory.

This is where M3-Agent comes in. A system that watches, listens, remembers, and builds a model of the world as it goes.

Built by researchers at Shanghai Jiao Tong and Carnegie Mellon, M3-Agent ingests video the way a person might. It builds an entity graph: a web of people, voices, dialogue snippets, and visual cues, with connections that evolve over time. 

More interestingly, it knows when to look back. A reinforcement learning policy helps it decide whether to retrieve information, scan memory, or respond with what it has. This creates space for reasoning. Not just “what did I see?” but “how do these moments connect?”

To stress-test this, the team created M3-Bench, a dataset of long, messy, multimodal videos designed to frustrate shallow recall.

M3-Agent beat out traditional retrieval-augmented models, showing that iterative “remember → reason → answer” loops outperform one-shot lookups.

And M3-Agent isn’t alone in pushing that shift. A few months back, OpenAI quietly upgraded ChatGPT with persistent memory, letting it recall details from past conversations without needing you to repeat yourself.

#4: Amazon’s Power Play

Amazon just made a move that’s easy to miss, unless you know what to watch for.

Adding new employees to a team can be quite a pain. It takes a week to get them started on Jira, Salesforce, email, Slack, and the internal wiki..

Agents face the same problem. Except they touch more systems and don’t come with common sense. Amazon’s new AgentCore Gateway wants to make onboarding feel as easy as plugging in a printer.

Instead of wiring agents directly into every tool, Gateway acts like a central traffic hub. You connect your tools once, define access rules, and the agent uses what it needs through a single, managed interface. It doesn’t need to know what version your API is running or where the tokens live. Gateway handles it.

This is all powered by Model Context Protocol, or MCP, which is basically a shared language for tools and agents to talk without tripping over each other. Gateway takes the request, finds the right tool, translates it into something the tool understands, and sends back the result in a way the agent can use.

It also locks down the boring but critical stuff: OAuth, IAM permissions, API keys. Everything gets handled under one roof, with logging and audit trails baked in.

Why this matters: Agents only become valuable when they can actually do things. Answering questions is cool. But pulling a real support ticket, updating a record, or kicking off a workflow? That’s what makes them useful in the real world.

And yes, this is a strategic move. Amazon wants Bedrock to be the central nervous system for enterprise AI. Gateway is the connective tissue. Once your agent stack depends on it, migrating off AWS becomes a much harder sell.

#5: Agent Rewind is the “Undo” button for AI meltdowns

Remember that intern who nuked the production database on their first day?

Now imagine that intern running at 1,000× speed, with root access to everything, and no manager watching over their shoulder.

So many things could go wrong, it’s scary.

This week, Rubrik dropped something borderline mind-bending: Agent Rewind. An actual “undo button” for AI when it goes off the rails 

Every action an agent takes (every prompt, tool call, file change) is logged in full. You can trace what happened, where it started, and what got touched. Then roll back just that part, whether it’s a single config or the entire repo.

It also flags high-risk behavior in real time. If an agent suddenly tries to mass-delete files or hit a critical API, it doesn’t just let it slide.

And it lands at the right moment. Carnegie Mellon researchers recently showed that current agents routinely misfire on basic tasks. They can lose track of steps and hallucinate dependencies.

The risk used to be bad actors. Now it’s good bots doing the wrong things, with full permissions.

Rubrik calls it a move from “cyber resilience” to “AI resilience.” The question to ask yourself is not if agents will screw up (they will). It is whether you’ll be able to fix it when they do.

Before we wrap, a few other stories that caught our eye:

  • A Workday survey found 75% of workers are fine with AI as teammates. Only 30% are okay with one managing them.

  • At Black Hat, Microsoft unveiled an AI-powered malware detector, while researchers showed GPT-5 can still be jailbroken with ease.

  • Researchers published a sweeping survey on Self-Evolving” agents that continuously learn and adapt

  • A great podcast on GPT-5 and AI agents with OpenAI researchers

Agents are entering their serious era. Rewind buttons, simulators, and real memory are replacing the flashy demo phase.

Which one of these would you actually use? Got a wild AI agent story of your own? Hit reply, I read every one.

Catch you next week,

Teng Yan

Keep Reading

No posts found