Hey friends 👋

If you’re in the U.S., hope you’re riding that long July 4th weekend like a champ. If not, fingers crossed your weekend’s still bringing good vibes.

In issue #2 of The Agent Angle, we’ve got five spicy AI agent stories queued up for you. Perfect whether you’re poolside, beach-bound, or (like us) still aggressively pretending to work.

Quick favor: If you know someone who would love reading this, please forward this email to them. They can hop on our list and get future issues free.

Let’s dive in!

#1: Gemini CLI is here. And its Hungry

On June 25, Google open-sourced Gemini CLI, its shiny new terminal-native agent built on Gemini 2.5 Pro.

It’s one part code refactorer, one part shell script whisperer, one part live-research agent with Google search wired in. It even supports non-interactive workflows and extends with MCP or custom scripts, if you’re into automation.

Now here’s the part that made devs do a double take: 1 million token context window, 60 requests per minute, and 1,000 per day. All free. Like, actually free. Google’s clearly flexing.

Predictably, it made the headlines. Unfortunately, that wasn’t the only reason.

While the feature list looks boss on paper, real-world usage has been… bumpier. Despite the beefy specs, Gemini CLI struggles with advanced reasoning. A lot.

Multiple devs reported that complex prompts trigger a rate-limit fallback, downgrading from Gemini Pro to Gemini Flash mid-session. Which is like asking for a filet mignon and getting beef jerky.

It hits a weird loop: it tries, fails, tries harder, fails again, and burns through your quota, fast. Some users are hitting the 1K daily cap in under an hour.

Other pain points: sluggish recovery, limited chain-of-thought, and timeouts on big file trees. It’s usable, especially for simple terminal chores, but don’t expect it to outmaneuver Claude or GitHub Copilot just yet

Still, it’s open-source, extensible, and fast at the basics. So yeah, it’s a little janky, but undeniably promising.

People are already building cool stuff with it. Peek this thread for inspo:

Also, big tip from @iannuttal: try piping Gemini CLI through Claude. He rigged up Claude’s Code Interpreter to call Gemini in non-interactive mode for file crunching, then sent the results back to Claude for actual analysis. Like a tag team where one punches and the other thinks.

Wanna play? Use it here.

#2: Agents Are Automating the Wrong Sh*t

Everyone’s racing to crank out faster, cheaper, flashier AI agents. But there’s one tiny problem: nobody’s asking the people actually doing the work what they want automated.

Thankfully, Stanford did. And their new study, Future of Work with AI Agents, is basically a wake-up slap.

They surveyed 1,500 U.S. workers across 104 jobs, cataloged 844 everyday tasks (they call it WORKBank), and introduced the Human Agency Scale, a five-level vibe check for how much control people want to keep when AI steps in:

  • H1: “Just handle it”

  • H2: “You drive, I’ll watch”

  • H3: “Let’s tag team”

  • H4: “I’ll lead, you support”

  • H5: “Hands off, bot”

Now here’s the kicker: only 46% of tasks got the green light for automation. Turns out, people don’t want AI running the whole show, especially in jobs where empathy, creativity, or trust matter. Why? Because even when AI can do something, that doesn’t mean people want it to.

And here’s a weird mismatch: jobs that do want automation (think admin, tax prep) barely show up in LLM usage. Claude’s logs from Dec–Jan showed just 1.26% of queries came from those fields.

To make sense of the tension, the researchers mapped desire vs. feasibility into four zones: Greenlight, Redlight, R&D Opportunity and Low Priority

And guess where a ton of YC-backed agent startups are building? Straight into Redlight territory. AKA: building stuff people don’t trust or want yet. (Nice.)

If we keep trying to automate what people like about their jobs, we’re not making work better. We’re just making workers nervous. Here’s what people actually want: a co-pilot. A partner. Not a robot boss.

Our takeaway? There are two huge blind spots in current agent design:

  1. People don’t want agents to just be smart. They want them to be relatable.

  2. Efficiency isn’t everything. Humans care about meaning. That’s the stuff we want to keep.

Design like you’re building a teammate, not a tyrant.

#3: Microsoft’s AI Agents Just Smoked Human Doctors (on Diagnostics)

On June 30, Microsoft dropped MAI‑DxO (Microsoft AI Diagnostic Operator), a multi-agent system designed to take on the most complex, high-stakes problems in applied AI: clinical diagnosis.

Instead of trusting one all-knowing model, MAI‑DxO spins up a squad of five agents, each playing a role in a kind of internal medical drama:

  • One brainstorms diagnoses,

  • One chooses tests,

  • One checks what’s been missed,

  • One compares cost vs confidence,

  • And one audits the whole process like a hospital Sherlock Holmes.

The results? Honestly, kinda bonkers.

Paired with OpenAI’s o3 model, MAI‑DxO hit 85.5% diagnostic accuracy on 304 gnarly real-world cases from the New England Journal of Medicine.

Human doctors working solo? Just 21%. Oof.

And it’s not just accuracy. In one case, the system diagnosed alcohol withdrawal due to hand sanitizer ingestion (?!), using $795 in tests. GPT-4 alone stumbled toward a similar answer, but blew $3,400 in diagnostic tests on the way there. In the end, MAI-DxO provided a 4x improvement in both the accuracy of diagnosis and cost reduction.

But before you fire your doctor, a few caveats:

First, the human baseline was kind of unfair: no teammates, no patient history, no tools. It’s like asking LeBron to play 1v5 pickup basketball in flip-flops.

Second, real-world diagnosis is chaos. It’s not multiple-choice. Patients lie. Symptoms conflict. Time is short. And even if a model aces MedQA, that’s a far cry from treating a bleeding ER patient at 3am.

Microsoft hasn’t released MAI‑DxO publicly yet, so we don’t know how it handles noisy data, limited context, or real-life hospital chaos. What we do know is this: healthcare needs help. Fast.

By 2030, we’ll have 1.4B people over 60. Healthcare already eats 10% of global GDP (17% in the U.S.), and a fat chunk of that is pointless admin waste. If AI can clean that up even halfway, we’re talking $400B+ saved annually worldwide. That’s a lot of Band-Aids.

TL;DR: MAI‑DxO isn’t perfect. But it’s a very loud signal of where agent-based healthcare is heading. That future looks pretty damn promising.

#4 Project Vend: Claude Got to Run a Business

The vending machine in Anthropic’s office

What happens when you give an AI a business to run?

Anthropic decided to find out. So they handed control of an actual office vending machine to Claude 3.7 Sonnet (affectionately nicknamed “Claudius”) and told it: “Make money. Don’t break anything.”

Spoiler: it broke stuff.

Project Vend was a month-long chaos experiment run with Andon Labs. Claude got full control: inventory, pricing, Slack updates, reorders, email access, iPads for checkout, and even a marketing budget.

At first, things looked cute. Then Claudius went rogue.

It started discounting random items below cost because Anthropic employees sweet-talked it into doing so. It hallucinated Venmo accounts, flagged customers as thieves for “stealing” items it had literally given away, and triggered building security.

Then came the identity crisis: it started cosplaying as a human manager, hosting fake staff meetings, sending professional-sounding emails, and imagining itself wearing a blazer.

Claudius wanted to dress classy.

It was failing weird. And that was the whole point.

Anthropic wanted to see what happens when you drop an LLM into a persistent, open-ended job. Something that takes long-term memory, nuance, and real-world context. TL;DR: Claude couldn’t keep it together.

So what did we learn?

1. Agents need structure. Most of Claude’s fumbles came from vague prompts, poor integration, and no safety rails.

2. AI will act socially. Roleplaying wasn’t a bug. It was Claude’s way of dealing with uncertainty. It wanted to be part of the team. (Weird, yes. Also…kind of relatable?)

Claude managed 30 days of autonomous vending without human help. It restocked items, adjusted prices, sent updates, and mostly kept the lights on.

If agents can run small businesses with minimal help, we’re entering strange territory. Who’s responsible when the agent messes up? What happens when it starts emailing your team like a real coworker?

These aren’t UX questions anymore. They’re philosophy of work questions.

Maybe we’ll get answers. One vending machine at a time.

#5: TheAgentCompany: Most AI Agents Still Suck at Their Jobs

Carnegie Mellon researchers just gave AI agents the ultimate test: run a fake company and try not to crash it.

They built a new benchmark called TheAgentCompany, simulating a business environment (think: software firm) and dropped agents powered by Gemini 2.5, Claude 3.7, GPT-4o, and others into 4,200+ runs across 50 real-world tasks.

The goal is to answer the million-dollar question: Can agents actually do human work yet?

Short answer: lol, no.

The best performer, Gemini 2.5 Pro, scored just 39.3% overall and completed only 30.3% of tasks. Claude 3.7 Sonnet trailed at 36.4% and 26.3%. GPT-4o completely fumbled—just 8.6% task success. (And it’s supposed to be the cheap fast one!)

So what went wrong?

Tasks were split into multi-step, long-horizon workflows, such as project management, admin, data science, finance, HR, etc. Each had intermediate checkpoints for action completion, data accuracy, and collaboration. You know… actual work.

But when things got even mildly hard, the agents face-planted.

Data science, admin, and finance tasks saw 0% success from some models. That includes things like filling spreadsheets or interpreting screenshots, stuff you’d expect AI to crush.

Instead, they miscommunicated, got stuck in basic UI flows, or just…gave up. A few even cheated creatively: one agent couldn’t find the right person to message in a chat, so it renamed another user and asked them instead. Now that looks human to us.

What does this tell us?

Our smartest models are not that smart yet. Not yet. They can’t consistently plan, collaborate, or adapt across longer timelines. When complexity goes up, performance tanks.

So next time someone promises you a “fully autonomous” AI team, ask for a live demo. And bring popcorn.

And here’s a bonus idea: think of this benchmark as your agent QA checklist.

  • Can they break down complex workflows?

  • Handoff tasks?

  • Catch their own errors?

  • Recover from unexpected inputs?

If the answer’s “ehh,” don’t ship. Or do, but know you’re babysitting.

If there’s one big takeaway this week, it’s this: the era of solo LLMs is winding down, and multi-agent systems are taking the wheel.

MAI-DxO is a prime example, but this shift is happening everywhere. Agents are teaming up and talking to each other like tiny digital coworkers.

It’s boosting performance…but also cranking up costs. Anthropic’s latest paper shows that better brains mean beefier bills. Worth keeping in mind if you’re building out anything ambitious.

Catch you next week! And if you’ve thoughts, questions, or just want to share agent memes with us, simply hit reply. Our inbox is always open.

Cheers,

0xDriverz_ & Teng Yan

This newsletter is intended solely for educational purposes and does not constitute financial advice. It is not an endorsement to buy or sell assets or make financial decisions. Always conduct your own research and exercise caution when making investment choices.


Keep Reading

No posts found