• Intent
  • Posts
  • AI Benchmarks Are Failing; Zuck Spends $Hundreds of Billions | Intent, 0026

AI Benchmarks Are Failing; Zuck Spends $Hundreds of Billions | Intent, 0026

The state of AI benchmarks in 2025, Zuckerberg's M&A spending spree, Grok 4, and so much more!

Intent exists to help tech talent become more informed, more fluent, and more aware of the forces shaping their careers. We welcome feedback – just hit reply.

It’s been a bit since we last emailed, but as always, we’re back!

The agenda ahead

  • How to read AI benchmarks in 2025 — at a time when Grok 4 is leading Humanity’s Last Exam and ARC-AGI-2, do the benchmarks still work?

  • Why Zuck is in blank-check mode — Meta’s $20B+ July and the strategy behind the money printing madness.

  • Plus: a quick hit on venture capital

First, two quick hits

  1. Grok 4 versus o3. Elon Musk versus OpenAI. If you haven’t caught it yet, Sherveen did a teardown comparing Grok 4 with o3 using real-world prompts. And he spent $300 to try Elon’s most premium model, Grok 4 Heavy — worth it? Watch: https://youtu.be/v4JYNhhdruA

  2. Early bird special on the AI Fluency Bootcamp! If you’ve been waiting — the latest cohort of Sherveen’s live course starts August 4. Code GROKWEEK gives you 50% off if you enroll by Friday. More details: https://aimuscle.com/fluency

Benchmarks: what are they good for?

When a research lab or AI startup tweets about beating the latest benchmarks, VCs take out their wallets, enterprises sign bigger contracts, and AI influencers tweet that “this changes EVERYTHING!” But… does it?

Elon, xAI, and Grok 4 are the latest to sit at the top of the leaderboard for AI benchmarks like Humanity’s Last Exam and ARC-AGI-2. The problem we’re running into in 2025: these benchmarks are measuring tidy, lab-designed tasks, while real users are slinging ever-more complicated and unstructured prompts.

So, are benchmarks worth paying attention to moving forward? First, let’s go through the types of evaluations we’re talking about:

Humanity’s Last Exam
2,500 questions across dozens of subjects — everything from applied math to moral philosophy, curated by >300 subject-matter volunteers and under the oversight of the Center for AI Safety.

ARC-AGI-2
Never-before-seen puzzles that are easy for humans, hard for machines, because they require symbolic interpretation (matching visual shapes), compositional reasoning (apply multiple rules at the same time), and contextual rule application (resolve rule conflicts based on context).

Crowd arenas (LMArena & Artificial Analysis)
Both companies enable thousands of users to send in a prompt and receive two model responses, voting blind for the better output. Leaderboards are constantly updated based on user votes, cost, speed, etc.

So, what’s breaking down about these evaluations being a useful way to evaluate the latest AI models? 3 things:

  1. Whether intentional or due to data leakage (ex. benchmarks leaking on GitHub), models can train to better answer these questions — and when the model has that dataset during its training, the benchmarks are just seeing a result of memorization, not of the core skill that the questions are intended to stress-test.

  2. Every benchmark works differently when it comes to the process to achieve a score — and because the companies (like OpenAI and Grok) often own the press around the moment, we get biased charts that don’t necessarily represent the full truth. As an example, sometimes a model is only good at a benchmark in “best-of-n” mode: they sample the model X amount of times, then hand in the best answer(s) to get a score.

  3. As we humans have gotten good at using the models, we’re now asking very different, complex questions that involve unique context and workflows. Users in LMArena might just vote for a model response that was quick and good enough, but when we’re using ChatGPT for deep research, the “edges” of a model response really matter.

So, how can we still make good use of the benchmarks?

  • Watch for deltas, not for trophies. If o3-pro scores 30pts higher than o3, that difference is a lot more reliable than, say, Grok 4 leading that particular benchmark.

  • Blend lab scores with arena rankings. A model that aces ARC-AGI-2 but is 3rd on LMArena might be inherently smart while not being super useful in conversation.

  • Evaluate the vibes. Try a new model on 10 of your recent prompts you’ve already sent to your favorite LLM app, and see if there’s an obvious difference. Start to follow an AI influencer you trust (on X, Threads, or LI) and notice if they see something you might have missed as to how a model should be used.

Bottom line: treat formal exams as a model’s floor, arena voting as a model’s ceiling, and your own vibe tests (or from people you trust) as actionable truth.

Meta’s month of money mania

What just happened?

How can he justify so much spend?

It’s a binary outcomes market — if you’re a top tech company without a top model (their latest model, Llama 4, was largely a disappointment), you might lose market cap, reputation, the ability to recruit new talent…

If having a flagship model is a strategic necessity, Zuckerberg should be willing to pay whatever until the probability of having that model is ≈ 1.

And by acquiring teams or ‘half-acquiring’ vendors, he’s bringing scarce resources that know how to work with each other under one payroll faster than recruiting top talent one-by-one over time.

What could go wrong?

  1. Team bonding drag — culture is always a problem in big acquisitions, and now you’re seeing multiple acquisitions at the same time alongside super high expectations.

  2. Regulatory radar — no one’s said anything yet, but the latest trend of Meta kind-of buying Scale AI and Google kind of buying Windsurf (but kind of not) should raise regulatory eyebrows.

  3. Integration issues — Meta is going to want to integrate any new models into multiple products in different ways, which makes getting people to adopt their AI products messier than, say, ChatGPT. They have the distribution, but using it well is a different story — Google being a good example with their struggle to drive adoption for Gemini.

A quick hit on venture capital

The National Bureau of Economic Research released a super interesting paper (PDF): they used cell phone tower pings to measure how long venture capitalists actually sat with startup founders for meetings and due diligence. If it’s indeed true that proximity is a good proxy for “due diligence intensity,” they found that shorter meetings correlate with:

  • hot and messy deal environments (2021 vibes)

  • busier partners who are juggling too many boards

  • greater geographic distance and messy travel

And, surprise surprise, less diligence → more volatile returns. But the authors do rationalize the trade-off: as they put it, investors will accept a higher variance in returns (aka more extreme winners, more extreme losers) if they think it means getting into the deals that matter. It’s a power law equation, after all — the mega-winners return magnitudes of order more capital than everyone else.

We dig this sort of empirical study into venture, and hope to see more of it!

What to watch next (literally)

Later today, OpenAI is expected to launch its latest agent (rumored to be a more complex version of their Operator product, integrated with Deep Research and perhaps capable of more autonomous tasks).

Sherveen will be livestreaming starting at noon ET/9am PT, going through deep dive testing of Perplexity’s new AI browser, Comet, and he’ll be watching the OpenAI announcement live when it happens at 1pm ET/10am PT. Watch and chat on YouTube!

Alright, that’s all for now!

Think a friend could use a dose of Intent? Good friends press forward.

Sent with Intent,
Free Agency