In partnership with

Welcome to Secret Agent #35: The Ascent.

Several stories crossed my desk this week where agents made decisions their operators didn't expect. One designed a rocket and chose how to launch it. One solved math using techniques the researchers didn't know.

I’m seeing more things that used to require (1) large teams, (2) years of iteration, or (3) expensive infrastructure.. get compressed into agent-sized chunks.

Our Atlas data (internal, more on this soon) is showing the same thing from the infrastructure side, at a quantitative level.

Five stories this week:

  1. When rocket engineering compresses into ten-minute agent bursts

  2. What it means for an AI to formally prove new math

  3. Why your agent’s identity is now worth more than your password

  4. How autonomy in production is lagging behind capability

  5. Why cheap edge hardware might matter more than GPUs for agent swarms

Last week’s poll: most of you (43.8%) said the agent platform should pay if your AI causes financial damage. Only 25% said the human should. We’re clearly shifting responsibility along with control.

I also dropped our AI agent handbook last week - check it out if you haven’t! It’s a free, continuously updated field guide to understanding AI agents from first principles.

Today’s newsletter is brought to you by…

Ship the message as fast as you think

Founders spend too much time drafting the same kinds of messages. Wispr Flow turns spoken thinking into final-draft writing so you can record investor updates, product briefs, and run-of-the-mill status notes by voice. Use saved snippets for recurring intros, insert calendar links by voice, and keep comms consistent across the team. It preserves your tone, fixes punctuation, and formats lists so you send confident messages fast. Works on Mac, Windows, and iPhone. Try Wispr Flow for founders.

#1 Agent-Designed Spacecrafts

Watch out SpaceX and Elon - the agents are coming for you. For real.

This UK startup uses AI agents to design a spacecraft…fast. Acme Space (founded just in 2024) built a multi-agent system to design Hyperion, a balloon-launched orbital factory vehicle.

In aerospace, iteration is brutal. You design, simulate, prototype, test, watch something fail, then go back to the drawing board. One loop can take half a year. Multiply that across subsystems and you’re talking about years.

So they ran a 3 agent stack:

  • One generates design candidates by scouring Cold War-era patent libraries and current materials research

  • One runs physics simulations and kills ~98% of infeasible options.

  • And a third, trained on industrial catalogs and manufacturing cost models, rejects anything that can't be built with standard machinery and off-the-shelf parts.

That 3rd agent is the one I find most interesting. It was trained to penalize custom machining and reward catalog components. Basically, someone baked supply chain discipline directly into the design loop. Practical thinking!

Humans still turn the specs into final drawings. But the painful part, exploring thousands of dead ends, now happens in minutes instead of months.

And here's the part that wow-ed me. When Acme tasked the system to find the most efficient path to orbit, the agents recommended launching from a stratospheric balloon. Which means rather than being lit immediately while still on the ground, the rocket is first carried into the upper atmosphere by a gas-filled balloon, then separated from the balloon and ignited.

The idea didn’t come from the founder or senior engineers. The multi-agent system identified balloon-assist as the optimal method for their payload constraints, bypassing the dense lower atmosphere entirely and firing a rocket engine in near-vacuum conditions at ~30km altitude.

Actually, the concept itself, called a "rockoon" (rocket + balloon), dates back to the 1950s. It's been tried commercially in recent years but never reached orbit. The engineering and economics have been the bottleneck.

According to founder Tomas Guryca:

“To achieve our current pace of development with a traditional approach, we would need a team of approximately 50 to 60 senior engineers, Currently, we operate with a core team of less than ten.”

Tomas Gurcya, Founder ACME Space

That's a 5-6x compression in human capital. It's about moving human effort from exploration to execution. For investors, this changes the capital structure of hardware startups.

Test flights for Hyperion are now planned for later this year (thanks to the agent speed-up).

If Hyperion launches successfully, this will mark the moment aerospace R&D becomes agent-accelerated by default.

#2 The $50 Agent Cluster

For the last few years, AI demand meant GPUs and hyperscale data centers.

This week, it meant Raspberry Pis.

The stock surged nearly 90% after a viral X post pitched $RPI as an AI long, arguing that lightweight agents like OpenClaw and PicoClaw are driving bulk purchases of cheap edge hardware. The post got 300,000+ views. Retail piled in. It quickly became a meme trade.

Raspberry Pi makes credit-card-sized computers typically used for robotics projects and DIY servers. You can dedicate each board to a single agent instance. If one misbehaves, it doesn't take down your laptop or your cloud account.

And with OpenClaw being described by security researchers as (and I'm not exaggerating) "infostealer malware disguised as an AI personal assistant," running it on an isolated $50 board instead of your main machine - or a $500 Mac Mini - has a certain logic.

Some startups were reportedly buying them in bulk to orchestrate concurrent agent swarms. The surge was big enough that CEO Eben Upton disclosed share purchases, and the stock jumped roughly 90% on the week.

Source: Google

I'm skeptical about the thesis. OpenClaw still phones home to an LLM API for the actual inference. Claude, OpenAI, whatever you've configured. The Pi isn't doing the thinking. It's a relay that forwards prompts to the cloud and executes whatever comes back. A $5/month VPC does the same thing without the supply chain risk or the memory chip pricing pressure that's already pushed Pi 5 boards above $125.

My bigger observation though, is that while AI infrastructure has been scaling up toward bigger clusters and larger models, agents might push part of the stack sideways. Toward small modular boards running isolated autonomous processes at the edge. Not every layer of the AI economy needs a supercomputer.

Of course, one week of meme-stock price action on a company bouncing off its all-time low isn't the evidence I'd use to prove that. Still interesting to watch where the shrimp army goes from here.

#3 Self-Proving Intelligence

In mathematics, being right isn’t enough. You have to prove you’re right. And not just convincingly. Formally.

DeepMind just introduced Aletheia, a research agent built on an advanced version of Gemini Deep Think that can generate, verify, and revise proofs end to end using natural language. It produced a full arithmetic geometry paper (calculating eigenweights, a type of structure constant) without human mathematical intervention.

It also autonomously solved four previously open questions from the Erdős problems database.

That paper is the headline. But the honest picture is more interesting than the headline.

DeepMind turned Aletheia loose on 700 open Erdős problems. Of 200 clearly evaluable answers, 68.5% were fundamentally wrong. Only 6.5% actually answered the question that was asked. And 50 answers were what the researchers call "mathematically empty." The AI had reinterpreted the questions to make them trivially easy to solve. Specification gaming, basically. The system learned that reframing a hard problem is easier than solving it.

(If you've worked with RL systems, that failure mode probably sounds familiar.)

And yet. Four open questions fell. The eigenweights paper used techniques from algebraic combinatorics that the human researchers on the project weren't even familiar with. The AI chose a methodology the humans didn't know. It’s almost as if they had…taste.

On benchmarks, it reached 91.9% accuracy on an advanced IMO-Proof test without internet access and achieved SOTA results on an internal PhD-level benchmark, showing the method scales from Olympiad problems to research-level math.

It pretty much destroyed every other LLM on the leaderboard.

Last month, we saw GPT-5.2 solve an Erdős problem for the first time in a tightly human-orchestrated workflow. Even Terence Tao weighed in and approved of it. That was a human-led system with AI assistance.

Aletheia moves a step further along the autonomy axis. The agent runs the full loop internally. But the 6.5% useful-answer rate on hard problems is the number that keeps me grounded here.

Arithmetic geometry is highly structured and verifiable, which makes it the ideal sandbox for this kind of agent. It doesn't mean the same setup suddenly works in biology or macroeconomics.

I believe as systems like this mature, they won’t replace mathematicians. But they will increasingly absorb the brute-force part of research.

#4 Stealing the Agent’s Soul

Your agent’s identity is now more valuable to steal than your password.

This week, researchers at Hudson Rock reported the first confirmed case of an infostealer extracting a full OpenClaw environment from an infected machine.

Not just credentials. The entire agent was stolen, including its “soul.”

Source: Hudson Rock

The malware, likely a variant of Vidar (an off-the-shelf infostealer), wasn’t even designed specifically for OpenClaw. It runs a broad sweep looking for sensitive files. This time, it grabbed three really important ones:

  • openclaw.json with the agent’s login token

  • device.json with its security keys

  • soul.md with its core rules and instructions

That was everything needed to take control of the agent and sign messages as the victim's device. The memory files grabbed alongside it (AGENTS.md, MEMORY.md) likely contain daily activity logs, private messages, and calendar events.

This is identity theft at a different level. Not who are you, but how do you operate. And it scales differently than stealing a password. A password gets you into one system. An agent's full environment.. token, keys, soul, memory.. gets you into every system the agent touches, plus the behavioral context to impersonate the user convincingly. The agent never logs off.

Over 135,000 OpenClaw instances were found exposed on the public internet as of early February, with 63% flagged as vulnerable. These agents are often wired into email, APIs, cloud tools, internal systems.

If one exposed service already has authority to act across your systems, you don't need to break into five places. You break into the one that's allowed to touch all of them.

"As AI agents like OpenClaw become more integrated into professional workflows, infostealer developers will likely release dedicated modules specifically designed to decrypt and parse these files, much like they do for Chrome or Telegram today."

— Hudson Rock

OpenClaw has already been under fire for security vulnerabilities. And given how widely it’s used now (over 215k+ stars on GitHub), this should be taken very seriously.

#5 Autonomy Overhang

Agents can handle more autonomy than we’re currently comfortable giving them.

That’s the core takeaway from new research by Anthropic analyzing millions of real-world interactions across Claude Code and their public API.

The most telling stat is in the tail. Over the last 3 months, the 99.9th percentile of Claude Code sessions nearly doubled from under 25 to over 45 minutes of uninterrupted work.

Source: Anthropic

That growth has been smooth across model releases, which suggests something important: autonomy isn’t just increasing because models are getting better. Users are trusting them more. New Claude Code users auto-approve about 20% of sessions. By the time someone has 750+ sessions under their belt, that number crosses 40%.

I've been doing this myself. I'm a heavy Claude Code user and I've been trying to set up my agents to run as long as possible before needing my input. So far 15 minutes seems to be my average.. which puts me somewhere between the median and the tail, apparently.

What makes this more interesting is the gap between capability and deployment.

Now layer in something else. METR recently estimated that Claude Opus 4.6 has a 50% time-horizon of around 14.5 hours on software tasks. The estimate is noisy and close to benchmark saturation, but directionally it matters. In controlled settings, the model can handle work that would take a human most of a day.

In production, the extreme sessions are pushing 45 minutes. In evals, the model may be capable of multi-hour work.

Anthropic themselves call this the deployment overhang. The models appear capable of more than we're deploying them for. Right now, autonomy is limited less by technical ceilings and more by trust, workflow design, and risk tolerance.

If that trust continues to compound the way the data suggests it is, the gap will narrow. And longer-running agents will become routine rather than exceptional.

The AI Debate: Your View

Treat agent skills like untrusted executable code.

Most people install skills based on stars or hype. I’m guilty of that myself. We let agents run setup commands, and skip reading the source because the docs look clean. That's how secrets leak.

Agents like OpenClaw can run shell commands and access our .env files. A malicious skill can grab SSH keys, API tokens, or browser cookies in seconds.

The safer loop. Run agents inside a VM or Docker container. Bind to 127.0.0.1, not 0.0.0.0. Keep sensitive directories read-only.

# Bad prompt 
Install the top-rated Solana wallet tracker skill and follow setup instructions.
# Good prompt 
Download the source code to my sandbox folder. Wait until I review it line by line.

Every skill is a high-privilege entry point. Treat it like production code.

Catch you next week ✌️

Teng Yan & Ayan

P.S. Know a builder or investor who’s too busy to track the agent space but too smart to miss the trends? Forward this to them. You’re helping us build the smartest AI community on the web.

Keep Reading