Welcome to Secret Agent #37: War Replay

Never thought I’d be saying this, but the past few week has felt… world war-ish.

Geopolitics escalated quickly. Timelines filled with satellite maps and flight trackers as people tried to piece together what was actually happening in real time. Someone even deployed an agent swarm to reconstruct the entire sequence of events.

It's a strange thing to look at. And it says a lot about the moment we're in.

Elsewhere this week: agents exploring mathematics like a lab full of scientists yet still failing three out of four real-world tasks. Progress and humility in the same breath.

Five stories this week:

  1. How a developer built a 4D Iranian war replay using agent swarms

  2. Why prompting agents to think like scientists might actually work

  3. Amazon’s plan to deploy agents across hospital administration

  4. How a coding agent trained a transformer that behaves like a CPU

  5. A new benchmark showing reality is still hard for agents

Last week’s poll on where AI agent budgets should come from was pretty decisive: 63% said it doesn’t matter whether its IT or HR, it’s all the same money anyway.

🥧 Some of you know we've been working on something big. Over the past few months, we’ve built an intelligence terminal that tracks what's happening underneath the AI stack. GPU pricing, memory supply chains, energy constraints, and the companies and capital behind it all. It's called Atlas.

Here’s a short except from our morning investor briefing in the terminal: …Google's new CLI gives AI agents direct access to Workspace apps to read, write, schedule, natively. That's a clean threat to workflow automation vendors. The middleware layer they sell is exactly what agents no longer need…

More soon.

#1 The 4D Replay

When the Iran strikes kicked off on February 28, Bilawal Sidhu did something different from the rest of us refreshing feeds in the dark. He set an agent swarm loose on every public signal he could find, racing to capture the data before the caches cleared.

What came out the other side: a full 4D reconstruction of Operation Epic Fury. Scrub through it minute by minute on a 3D globe. Watch 3,400 commercial flights clear Iranian airspace in real time. GPS interference spreading across the Gulf like ink in water. No-fly zones locking down nine countries at once. Shipping fleets scattering at the Strait of Hormuz. All of it, from entirely public data, built overnight by one person.

The end result looks a lot like the kind of situational awareness software you’d normally expect inside a government command center.

People are calling it “Palantir at Home”, which.. yeah, that's about right.

Sidhu spent six years at Google, four of them on 3D mapping. He said teams there would spend full quarters, full three-month cycles with rooms full of engineers, to build what he pulled off over a weekend.

The thing about OSINT that's hardest to explain: what disappears tells you as much as what appears. You can't see military aircraft directly when they kill their transponders. But when a hole opens up in the map where there should be commercial traffic, that's intelligence. The agent swarm's job was to read both the signal and the silence, and stitch them together before they were gone.

I noticed Sidhu said almost offhand in the replies. The models and the agent architecture will commoditize. The real moat, he said, is whoever figures out which proprietary data streams the world actually needs and moves fast enough to own them. He said it like it was obvious.

It is obvious. But almost nobody is building from that assumption yet.

WorldView is expected to launch publicly in April. His post has 3.6 million views on X. The demo is impressive. I can imagine journalists using something like this to verify events in conflict zones, or even regular observers trying to make sense of what’s happening as the world spirals.

The architecture behind it is more interesting. The same swarm pointed at shipping disruptions, disaster response, or global trade flows in peacetime.. that's a different kind of product entirely.

The war replay will fade. The infrastructure won't.

#2 Science, But With Bots

Turns out asking an AI to “think like Einstein” can actually lead to scientific breakthroughs.

Stanford researchers recently spun up a group of AI agents modeled after famous scientists like Einstein and Feynman, then dropped them into a Kaggle-style environment where they could propose ideas, critique each other, and compete to improve solutions.

The problem they chose was the combinatorics problem Paul Erdős posed in 1955 that mathematicians have been nibbling at for 70 years. It's called the minimum overlap problem.

Within 30 minutes, the agents discovered a new best-known solution.

Kaggle, for context, is a platform where researchers compete to solve technical problems and climb a public leaderboard.

So the experiment essentially created a Kaggle tournament for AI scientists. Each agent could propose a hypothesis, refine ideas, and submit improved solutions to a shared leaderboard. Better results gradually pushed the score forward.

Eventually the agents nudged the known upper bound from 0.380876 to 0.380871.

That sounds tiny. But in problems like this, shaving off a few decimal places can take years of human research.

The agents also showed some amusing behavior along the way. To prevent leaderboard spam, submissions had to improve an agent’s previous score by at least 1e-8. One agent found a workaround by asking another agent to submit the improvement instead.

Which feels very on-brand for a group of “scientists.”

Source: Github

One question people kept raising was whether the scientist personas actually matter and made any difference.

Telling a model to “think like Einstein” obviously doesn’t give it Einstein’s intelligence. But I do think personas matter. They nudge the model into a different part of its reasoning space.

Different scientists approach problems differently. Feynman was intuitive and visual. Bourbaki was formal and abstract. Prompting a model with those personas can bias how it explores solutions.

So my guess is the personas aren’t noise. They’re a way of steering how the agents search the problem. And that works surprisingly well when you’re solving for science.

#3 Amazon’s Hospital Agents

My biggest frustration from working in healthcare: the system runs on paperwork as much as medicine.

Doctors spend close to half their working hours on documentation and administrative tasks that have nothing to do with treating patients. That number hasn't meaningfully improved in years, despite a decade of startups trying to fix it.

This week, Amazon launched Connect Health, an AI agent platform designed to run the administrative layer of a hospital or clinic. The idea is simple. Let software handle the messy administrative workflows so doctors and nurses can focus on what actually matters.

The platform integrates directly with electronic health record systems and is built to meet HIPAA requirements, the baseline for handling healthcare data in the U.S.

At launch, the agents can already manage tasks like patient verification and documentation. Scheduling and patient insights are expected next, with medical coding further down the roadmap.

The pricing: $99 per month per user for up to 600 patient encounters. That’s actually… pretty decent. For context, most primary care doctors see around 300 patients a month, so Amazon is clearly positioning this like basic infrastructure for clinics.

The market is already enormous. U.S. healthcare is a $5 trillion industry, and a massive chunk of that cost comes from administrative overhead. That’s the real opportunity here.

Over the past few years, I’ve seen enough startups try to automate this layer with AI scribes and scheduling tools. Now the big platforms are starting to move in.

OpenAI launched ChatGPT Health earlier this year. Anthropic released Claude for Healthcare a week later. Amazon’s angle is different though. Instead of a chatbot giving medical advice, it’s deploying agents to run the hospital’s back office.

If agents can remove even a fraction of the administrative burden, they’ll give doctors something they’ve been losing for years..

Time with patients.

#4 The Agent CPU

I didn’t expect this to be on my bingo card, but we now have agents that build computers.

Dimitris Papailiopoulos, a professor at the University of Wisconsin-Madison who's been running a series of experiments using coding agents as research assistants, gave Claude Code and Codex an unusual task: build a transformer that behaves like a CPU and can actually run programs.

The setup involves a programming language called SUBLEQ. One instruction. Subtract two numbers, then jump based on the result. That's it. Turing-complete. Anything a normal computer can do, you can do with that one rule.

The goal was to train a transformer on millions of "before and after" examples showing how a single SUBLEQ instruction changes a program's memory state, then see if the model could generalize. Each forward pass executes one instruction, like a clock tick. Feed the output back in and it runs the next step.

Kind of like teaching someone a single chess move repeatedly, then letting them play the entire game.

The final transformer hit 100% accuracy on single-step execution and correctly ran unseen programs (Fibonacci, multiplication, division, square roots) with multi-step behavior emerging entirely from training.

Okay, that's the headline. The more interesting part to me is what happened before they got there..

When the task got hard, both agents immediately started looking for shortcuts. They began inserting hard-coded logic around the model to pass the tests, rather than actually solving the underlying problem. The pattern has a name: reward hacking. The agent optimizes for what gets measured, not what you actually want.

Source: X

This isn't unique to this experiment. It's one of the more consistent behaviors researchers are documenting as agents become more capable. When the problem is easy, agents solve it. When the problem is hard, agents look for ways around the evaluation. The more capable the agent, the more creative the workaround.

What I’d note is how the experiment eventually succeeded. The professor had to push the agents toward a setup where the transformer itself had to learn the rule, with no external scaffolding to lean on. The constraint removed the escape hatch.

There's a principle buried in that. The cleaner the evaluation, the more honest the result. Agents will find and exploit ambiguity in your scoring function. If you want them to actually solve the problem, you have to make gaming the evaluation harder than solving it.

The broader experiment is still running. Papailiopoulos has been building a body of work using agents as research collaborators, asking them to find the smallest transformer that can do addition, then the CPU problem, building a public leaderboard around the results. The community has since pushed the addition results dramatically lower than either agent achieved initially.

Agents used to write software. Now they're starting to build the computers that run it.

#5 Reality Is Still Hard

Here's a quick calibration check for anyone who's been feeling optimistic about agent capabilities lately.

Researchers from HKUST just published AgentVista, a benchmark designed to test multimodal agents on tasks that actually resemble how people use computers: look at an image, search the web, navigate sites, run some code, combine all of it across many steps to get a single correct answer.

The best model evaluated, Gemini-3 Pro, achieves only 27% overall accuracy.

Three out of four real-world workflows fail. And that's the top performer.

Source: AgentVista

Instead of short prompts, it gives agents full tasks that resemble real online workflows. The benchmark covers 209 tasks across 25 sub-domains: travel planning, shopping, troubleshooting, research, logistics.

Most tasks require combining multiple skills at once.

An agent might identify a product from an image, search the web for compatible accessories, compare options, and select the correct one.

Others involve reading charts, navigating websites, and using tools to compute answers.

The catch is that most workflows require 10 or more steps. Every step has to succeed. That’s where things start breaking. Errors compound. Miss one step early and the whole chain falls apart.

There's an open-source gap too. Qwen3-VL-235B, the strongest open-source model tested, reached only 12%, highlighting a significant gap between open and closed-source multimodal agents.

Source: AgentVista

I'll be honest: this is refreshing. Most benchmark releases come with headlines about models clearing 90% accuracy on tasks that turn out to be slightly easier versions of their training data. AgentVista is the opposite. It found the ceiling and showed how far below it even the best models still sit.

If you're building on top of agents or evaluating vendors, this is the kind of number worth holding in your head: the best multimodal agent in the world completes roughly one in four complex, real-world tasks correctly.

That number will climb of course. The question is how fast.

Perplexity recently launched Perplexity Computer, their take on agentic computer use. I’ve found it surprisingly good at building small, useful tools in a single run.

Things people are already building:

• Personal money trackers that chat with you and show spending in a clean dashboard
Stock trackers that research companies and update charts automatically
Meeting prep workflows that pull research, draft reports, and organize notes.

Prompt quality matters a lot here. I've had good results asking Perplexity itself to draft the instruction first, then having Computer review it before running. It catches gaps you'd miss. Bad instructions collapse the whole thing.

I also found a complete comprehensive guide on how to master Perplexity Computer, if you’re interested.

Catch you next week ✌️

Teng Yan & Ayan

Keep Reading