Elena' s AI Blog

Claude Haiku scored zero. GPT-5.5 scored 70%. The new benchmark explains why

29 May 2026 (updated: 29 May 2026) / 27 minutes to read

Elena Daehnhardt


Cloud and AI systems illustration


TL;DR:
  • - OpenAI reportedly filed a confidential S-1 IPO prospectus with the SEC on 22 May 2026, led by Goldman Sachs and Morgan Stanley, targeting a public listing as early as September 2026 at a valuation above $1 trillion — which would be the largest IPO in history.
  • - Pope Leo XIV released Magnifica Humanitas on 25 May 2026, a 42,300-word encyclical on artificial intelligence and human dignity, signed on the 135th anniversary of Rerum Novarum and presented alongside Anthropic co-founder Christopher Olah.
  • - Anthropic published the first Project Glasswing update on 22 May 2026: Claude Mythos Preview and approximately 50 partners have found more than 10,000 high- or critical-severity vulnerabilities in critical software; Cloudflare reported 2,000 bugs found with a false-positive rate better than human testers.
  • - Illinois passed SB 315 unanimously through the House on 27 May 2026, mandating annual independent third-party audits and 72-hour incident reporting for frontier AI developers with over $500 million in revenue; Governor Pritzker indicated he will sign it.
  • - Datacurve released DeepSWE on 26 May 2026, a 113-task coding benchmark that stretches the frontier spread to 70 points: GPT-5.5 leads at 70%, Claude Opus 4.7 sits at 54%, and Claude Haiku 4.5 collapses from 39% on SWE-Bench Pro to 0% — suggesting benchmark contamination, task overfitting, or a much sharper real-world drop-off than SWE-Bench Pro makes visible.
  • - Cognition closed a $1 billion+ round on 27 May 2026 at a $26 billion post-money valuation, with Devin's ARR growing from $37 million to $492 million in 12 months; the implied 53× ARR multiple reflects labour-displacement pricing rather than traditional SaaS.
  • - Zhipu AI (Z.ai) launched the GLM-5.1 high-speed API on 22 May 2026 at 400 tokens per second on a model trained entirely on Huawei Ascend chips, offering the first production evidence at scale that sovereign hardware stacks outside the Nvidia ecosystem can meet agentic workload demands.

Introduction

A quieter week for model releases — no major frontier launch from any of the large labs. What it delivered instead was seven stories that map the territory around the models: who funds them, who governs them, how to benchmark them honestly, what they can do to critical infrastructure, and whether hardware sovereignty is becoming real rather than aspirational.

I will take them in the order they landed.

In this issue:

  1. OpenAI and Anthropic move toward public-market scrutiny
  2. Pope Leo XIV frames AI as a question of human dignity
  3. GLM-5.1 pushes inference speed on Huawei chips
  4. Anthropic’s Project Glasswing shifts the security bottleneck to patching
  5. Illinois passes third-party audit rules for frontier AI
  6. DeepSWE exposes a wider gap between coding models
  7. Cognition’s $26B valuation shows how investors price AI coding agents

Finance and Capital Markets

1. OpenAI moves toward a trillion-dollar IPO

The big questions OpenAI's trillion-dollar IPO filing may finally answer — Fortune, 22 May 2026

OpenAI IPO 2026: What the Confidential Filing Means — Nerd Level Tech, 22 May 2026

US funds set aside cash as SpaceX and OpenAI prepare to go public — Reuters, 27 May 2026

Multiple outlets, including Fortune, CNBC, Bloomberg, and Reuters, reported on Friday, 22 May, that OpenAI was moving toward a confidential S-1 IPO filing with the Securities and Exchange Commission, with some reports saying the filing had already been made and others describing it as imminent. Goldman Sachs and Morgan Stanley are reported to be leading the deal. The target is a public listing as early as September 2026 at a valuation above $1 trillion. If it prices at that level, it will be the largest IPO in history.

A confidential S-1 keeps the paperwork sealed until roughly 15 days before the public roadshow. OpenAI works around SEC comments and financial disclosure requirements, keeping them away from public scrutiny. The first substantive public signal — the SEC’s preliminary feedback — typically arrives within 30 days of the filing, which would put it in late June. The financial context that the prospectus will eventually have to disclose: OpenAI’s current private valuation is $852 billion, set in March 2026 when it closed a $122 billion funding round. Annualised revenue hit $25 billion in February. At the same time, multiple analysts have noted that OpenAI was losing significantly more than it was earning in Q1 2026, a mismatch the S-1 risk disclosures are obliged to address.

Anthropic to Close Over $30 Billion Round as Soon as Next Week — Bloomberg, 22 May 2026

Anthropic In Talks to Raise $30 Billion at $900 Billion Valuation — Bloomberg via Yahoo Finance

On the same day, Bloomberg reported that Anthropic was set to close a $30 billion-plus round as soon as the following week, co-led by Sequoia Capital, Dragoneer, Altimeter Capital, and Greenoaks at roughly $2 billion each, at a $900 billion-plus pre-money valuation — briefly making it the world’s most valuable private AI startup, ahead of OpenAI’s $852 billion March benchmark. Anthropic is separately reported to be targeting an October 2026 IPO at a valuation above $900 billion.

The prospect of two frontier AI labs going public within months of each other is unprecedented and will force a level of financial transparency that the AI industry has largely avoided until now.

Why this matters

A public S-1 means audited financials, risk disclosures, and quarterly earnings calls. For the first time, the claims that frontier AI labs make about their own businesses — revenue growth, compute costs, path to profitability — will face the formal scrutiny that public investors demand. That is a different kind of accountability than benchmark performance, and for many practitioners, it will be more illuminating. The question of whether any frontier AI lab has a durable, profitable business model is one that the IPO filings will force into the open. The answer, whatever it is, will shape how enterprises, governments, and developers think about the long-term reliability of the platforms they are building on.


AI Ethics

2. The Pope publishes 42,300 words on artificial intelligence

Magnifica Humanitas — Holy See, 25 May 2026

Pope Leo's 'Magnifica humanitas': AI must serve humanity — Vatican News, 25 May 2026

Pope Leo to present his encyclical on AI alongside Anthropic co-founder — National Catholic Reporter

On 25 May, Pope Leo XIV released Magnifica Humanitas — “Magnificent Humanity” — addressed to the 1.4 billion members of the Catholic Church and, the document states explicitly, “to all Christians and to men and women of goodwill.” It is 42,300 words long. It was signed on 15 May, the 135th anniversary of Pope Leo XIII’s Rerum Novarum — the labour-rights encyclical that defined Catholic social teaching through the twentieth century. The parallel is deliberate and not subtle.

The document was presented at the Vatican Synod Hall alongside Anthropic co-founder Christopher Olah. It covers the impact of artificial intelligence on human relationships, creative work, labour, the concentration of power, and autonomous weapons.

Its central premise is that technology is never neutral. It “takes on the characteristics of those who devise, finance, regulate, and use it.” The document is not hostile to AI — it states explicitly that technology is not “inherently evil.” What it argues at length is that the current trajectory — rapid capability development, concentrated ownership, insufficient governance, erosion of human agency in automated systems — fails the test of serving the common good. It draws an explicit parallel to the Industrial Revolution: the same structural concentration of power, the same risk of workers becoming instruments of production rather than subjects with dignity, the same need for active social doctrine rather than passive acceptance of what markets produce.

The reference to autonomous weapons is unambiguous: the document calls for international frameworks to prohibit weapons systems that make lethal decisions without meaningful human oversight.

Why this matters

I will be direct: I did not expect to be writing about a papal encyclical in a developer-focused post. But Magnifica Humanitas is not a pastoral letter about screen time. It is a structured engagement with the political economy of AI development from an institution that has 135 years of practice articulating what rapid technological change costs people who cannot control it. The Olah connection adds a layer worth noting: one of Anthropic’s founders stood at the Vatican to present this document, suggesting that at least some people building the world’s most capable AI systems take the ethical critique seriously enough to engage with it publicly, on a Tuesday, alongside cardinals.

This document will not change what any lab ships next quarter. It will shape the moral vocabulary that politicians, regulators, and, eventually, courts use to describe what is at stake. That is a slower, more durable process.


Open Models and Inference

3. GLM-5.1 hits 400 tokens per second — and it runs on Huawei chips

Zhipu AI Launches GLM-5.1 High-Speed API: 400 Tokens/s — Pandaily, May 2026

GLM-5.1 Reaching 400 Tokens/s: When Inference Speed Becomes the New Scaling Law — Yage.ai, 22 May 2026

ByteDance developing custom CPU chips to support AI rollout — Reuters, 28 May 2026

On 22 May, Zhipu AI (now trading as Z.ai) opened the GLM-5.1 high-speed API at 400 tokens per second. Human reading speed is typically 3–5 tokens per second. This API outputs text more than 80 times faster than you can read it.

GLM-5.1 itself is not new this week. Z.ai released the API on 27 March and the open-weight version on 7 April 2026 on Hugging Face under the MIT licence, for free. It is a 744-billion-parameter Mixture-of-Experts model with 40 billion active parameters per token and a 200,000-token context window, trained entirely on Huawei Ascend 910B chips using the MindSpore framework — no Nvidia hardware at any stage. What is new this week is the inference infrastructure behind it: the high-speed deployment is a distinct build optimised for throughput at a scale no major LLM provider has publicly offered before.

I would treat the “fastest among major global providers” claim with appropriate caution — inference benchmarks are heavily dependent on hardware configuration, request size, and batching strategy, and independent verification is still in progress. What is verifiable is that the API is live and being tested by practitioners now.

What makes this structurally interesting is the combination: open weights, MIT licence, frontier-adjacent coding performance, and a high-speed inference tier — all from a model trained without a single American chip. Z.ai completed a Hong Kong IPO in January 2026, having trained on 100,000 Huawei Ascend chips. GLM-5.1’s high-speed API is the first production evidence at this scale that sovereign hardware stacks — built entirely outside the Nvidia ecosystem — can achieve the inference performance levels required by serious agentic workloads.

Why this matters

The conversation about AI sovereignty has mostly focused on data residency and model weights. GLM-5.1’s high-speed API introduces a third dimension: can non-US hardware stacks deliver inference performance competitive with US infrastructure at production scale? If the 400 tokens/s figure holds up under independent benchmarking, the answer is yes. For European enterprises evaluating whether sovereign AI infrastructure is a real option or a political aspiration, this is the most concrete evidence yet that it is the former. For the US export-control policy, it is a signal that the compute restriction strategy is working more slowly than intended.


Security

4. Project Glasswing: 10,000 vulnerabilities found — patching is now the bottleneck

Project Glasswing: An initial update — Anthropic, 22 May 2026

On 22 May, Anthropic published the first substantive update from Project Glasswing, its collaborative effort with approximately 50 partners to secure critical software before increasingly capable AI can be weaponised against it. The headline figure: Claude Mythos Preview has helped partners find more than ten thousand high- or critical-severity vulnerabilities across the most systemically important software in the world in the first month.

A few of the specific results disclosed are worth reading carefully rather than skimming. Cloudflare found 2,000 bugs across its critical-path systems — 400 of them high or critical severity — with a false-positive rate that Cloudflare’s own team describes as better than human testers. Mozilla found and fixed 271 vulnerabilities in Firefox 150 while testing Mythos Preview, more than ten times the number found in Firefox 148 using Claude Opus 4.6. The UK’s AI Security Institute reports that Mythos Preview is the first model to solve both of its cyber range simulations — end-to-end autonomous execution of multistep cyberattacks — without human assistance.

The deeper signal in the update is not the raw count. It is this sentence: “Progress on software security used to be limited by how quickly we could find new vulnerabilities. Now it’s limited by how quickly we can verify, disclose, and patch the large numbers of vulnerabilities found by AI.” That is a structural shift. Vulnerability discovery has effectively been automated. What has not been automated — and cannot easily be — is the coordination work: triage, coordinated disclosure, patch development, distribution, and the 90-day window before a disclosed vulnerability can be safely published without putting users at risk. Anthropic notes that downstream effects are already evident: the latest Palo Alto Networks release included over five times as many patches as usual, and Microsoft has publicly stated that the number of patches it releases will “continue trending larger for some time.”

The update is also candid about what it cannot yet disclose. Because of the 90-day coordinated disclosure convention, the detailed findings cannot be published while patches are still being deployed. The vulnerability counts are aggregates; the specifics will become visible in the coming months.

Why this matters

This is the most practically significant developer story of the week and the one that received the least mainstream coverage. Every organisation that runs critical software is now operating in an environment where a sufficiently capable AI, given access and time, can find vulnerabilities faster than any human team. That is not a future scenario. It is the current state as of May 2026 for at least one deployed model. The question is not whether this capability exists — it does. The question is whether the defensive use of the same capability (Glasswing’s approach) can outrun the offensive use. Anthropic’s position is that it can, with structured coordination. I think that is probably right, but “the bottleneck has shifted to patching” is not a reassuring sentence if you are a security team with a finite number of engineers and a growing queue of critical-severity issues to work through.


Governance

5. Illinois passes the first mandatory third-party audit law for frontier AI

Illinois Legislature passes historic AI bill that would require third-party safety audits — NBC News, 27 May 2026

Illinois lawmakers send AI frontier model safety bill to Gov. Pritzker — Transparency Coalition, 27 May 2026

On 27 May, the Illinois House of Representatives voted unanimously — 110 to 0 — to pass SB 315, the Artificial Intelligence Safety Measures Act. The Illinois Senate had passed it 52–5 on 22 May. Governor JB Pritzker indicated publicly that he plans to sign it.

SB 315 applies to frontier AI developers with more than $500 million in annual revenue. Its three main requirements: annual independent third-party audits of safety practices, a 72-hour window to report AI safety incidents to state officials, and public disclosure of safety frameworks and risk assessments. Importantly, the bill creates no private right of action — companies that fail to comply face regulatory consequences, not tort liability from affected parties. The audit obligations begin in January 2028.

Both Anthropic and OpenAI spoke in favour of the bill. Anthropic’s head of state and local government relations, Cesar Fernandez, described SB 315 as formalising practices that leading labs already follow voluntarily. OpenAI said that as AI systems become more capable, “clear expectations around safety, transparency, incident reporting, and accountability matter.” Illinois is the third state to set frontier model standards, following New York’s RAISE Act and California’s SB 53 — but it is the first to mandate independent third-party audits.

The standard industry objection — raised by NetChoice among others — is that mandatory third-party audits are an “impossible compliance burden” because no recognised auditing standards, certified auditors, or established methodologies yet exist for frontier model safety audits. That objection has some technical merit, but it also describes a vacuum that this legislation may itself help fill. If the world’s largest AI labs are legally required to be audited by January 2028, the auditing profession will develop the standards it needs to meet that requirement.

Why this matters

The US federal government has explicitly declined this month to regulate frontier AI at the national level. Illinois has moved in the opposite direction, with unanimous bipartisan support in the House. The practical implication flagged by several observers is a compliance-convergence effect: no company of OpenAI’s or Anthropic’s scale will build a separate, Illinois-only safety audit pipeline. It will audit everything and apply the standards across the board. Illinois SB 315, if signed, may therefore function as a de facto national baseline despite being state legislation. That is exactly how GDPR worked: a single jurisdiction with sufficient market power set a standard that became the effective global norm. Illinois is not the EU, but the mechanism is the same.


Benchmarks

6. DeepSWE breaks the coding leaderboard — and Claude Haiku falls to zero

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole — VentureBeat, 27 May 2026

DeepSWE leaderboard — BenchLM.ai

On 26 May, Datacurve released DeepSWE, a 113-task software engineering evaluation spanning 91 open-source repositories and five programming languages. The headline result: GPT-5.5 leads at 70%, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. From there, the drop is steep — Claude Sonnet 4.6 at 32%, Gemini 3.5 Flash at 28%, and a long tail into the single digits.

That spread matters. On SWE-Bench Pro, the same frontier models cluster within roughly 30 points of each other, making it nearly impossible for engineering teams to decide which agent will actually perform better in their codebase. DeepSWE stretches the range to 70 points and makes the hierarchy legible.

The structural reasons are worth understanding. SWE-Bench Pro tasks require an average of 120 lines of code added across 5 files. DeepSWE’s reference solutions average 668 lines across 7 files — 5.5× more code — but with shorter prompts (2,158 characters versus 4,614). The model gets less instruction and is expected to produce significantly more output, which is a much closer analogue to how a developer actually delegates work to an AI assistant.

The most striking finding is Claude Haiku 4.5, which scores 39% on SWE-Bench Pro and 0% on DeepSWE. That collapse is not a rounding error — it suggests either benchmark contamination, task overfitting, or a much sharper drop-off in real engineering ability than SWE-Bench Pro makes visible.

The “benchmark loophole” in the headline requires accurate framing. Claude Opus was not doing anything wrong — it was finding valid engineering solutions that inlined logic rather than following the original author’s specific implementation. The issue was that DeepSWE’s test verifier checked for the reference implementation’s specific symbols rather than correctness. A model that solved the problem in a different but equally valid way got marked as failing. That is a verifier design flaw, not a model flaw. Datacurve claims an overall sub-1% erroneous verdict rate; this is one of the edge cases they found. It is worth noting because it affects any engineering team’s interpretation of the results: the Opus number may be slightly understated.

Why this matters

If you have been using SWE-Bench Pro scores to decide which coding agent to deploy, DeepSWE suggests you reconsider. The leaderboard ordering does not change dramatically at the top — GPT-5.5 led on SWE-Bench Pro, too — but the gap between top and middle is much wider than the previous benchmark implied, and the mid-tier collapse is important information for teams evaluating whether a cheaper, faster model is “good enough” for production engineering tasks. In my view, it usually is not, and DeepSWE makes that argument with actual numbers rather than intuition.


AI Capital

7. Cognition raises $1 billion at a $26 billion valuation — and the multiple tells you something

AI coding startup Cognition raises $1B at $25B pre-money valuation — TechCrunch, 27 May 2026

AI Coding Startup Cognition Raises $1 Billion at $26 Billion Value — Bloomberg, 27 May 2026

On 27 May, Cognition — the company behind Devin, the autonomous AI software engineer — closed a $1 billion+ round at a $26 billion post-money valuation, led by Lux Capital, General Catalyst, and 8VC, with participation from Founders Fund, Ribbit Capital, and Atreides. The round closed just eight months after Cognition raised $400 million at a $10.2 billion post-money valuation in September 2025 — a 2.5× step-up in under a year.

The revenue figure alongside it: Cognition’s annualised run rate has grown from $37 million in May 2025 to $492 million today — a 13-fold increase in 12 months, with enterprise usage of Devin growing 50% month-on-month over the past 6 months. Customers include Mercedes-Benz, NASA, Goldman Sachs, and Santander. The company has now raised more than $2.5 billion in total.

The multiple is the analytically interesting detail. At $492 million ARR and a $26 billion valuation, the implied multiple is approximately 53× revenue. Public SaaS comparables range from 8× to 15×. That gap is not irrational — it reflects a different theory of value. Traditional vertical SaaS sells productivity to the human in the seat; the total addressable market caps at a fraction of the salary that the human allocates to software tools, typically a few thousand dollars per year. An AI coding agent that substitutes for or substantially augments a software engineer is being priced against a different line item entirely — the $200,000–$300,000 fully loaded annual cost of that engineer. The 53× multiple is a labour-displacement multiple, and the investors writing these cheques believe the category will be priced that way at scale.

Why this matters

Cognition’s round closes the same week that DeepSWE demonstrates GPT-5.5 outperforms every other model on hard-coding tasks by a 16-point margin. That juxtaposition is not coincidental. The capital flowing into AI coding agents reflects a bet that the capability gap DeepSWE just documented is large enough, and durable enough, to support a $26 billion independent company alongside the labs themselves. Whether that bet is correct depends heavily on whether GPT-5.5’s lead on DeepSWE translates to real-world engineering productivity at the enterprise scale Devin is targeting. Nobody knows yet. But the combination of the benchmark result and the funding round tells you that both investors and customers believe it does — or will soon.


Closing Thoughts

Seven stories. Capital markets, ethics, inference infrastructure, cybersecurity, legislation, benchmarks, and another round of capital. No major model release from any frontier lab.

The absence of a headline model launch makes the other signals easier to read. The OpenAI IPO is about whether AI labs have sustainable business models. The Pope’s encyclical addresses whether anything other than commercial incentives shapes how this technology develops. Glasswing is about the same capability, being used to attack and defend simultaneously. Illinois SB 315 is about who audits the people making these decisions. DeepSWE is about whether we can even measure what the models can do. Cognition’s rise is about what investors believe the capability is worth at production scale. And GLM-5.1’s 400 tokens per second is about whether the answer to “who can run frontier AI” is still “primarily American companies on American hardware.”

None of those questions was answered this week. All of them moved.

Let me know what you think.


References

  1. The big questions OpenAI’s trillion-dollar IPO filing may finally answer — Fortune
  2. OpenAI IPO 2026: What the Confidential Filing Means — Nerd Level Tech
  3. US funds set aside cash as SpaceX and OpenAI prepare to go public — Reuters
  4. OpenAI Files Confidentially for Trillion IPO — Grey Journal
  5. Magnifica Humanitas — Holy See
  6. Pope Leo’s ‘Magnifica humanitas’: AI must serve humanity — Vatican News
  7. Pope Leo to present his encyclical on AI alongside Anthropic co-founder — National Catholic Reporter
  8. Pope Leo Uses First Major Papal Text to Warn About Dangers of AI — TIME
  9. GLM-5.1 Reaching 400 Tokens/s: When Inference Speed Becomes the New Scaling Law — Yage.ai
  10. Zhipu AI Launches GLM-5.1 High-Speed API: 400 Tokens/s — Pandaily
  11. GLM-5.1: #1 Open Source AI Model? Full Review — Build Fast With AI
  12. ByteDance developing custom CPU chips to support AI rollout — Reuters
  13. Project Glasswing: An initial update — Anthropic
  14. Illinois Legislature passes historic AI bill that would require third-party safety audits — NBC News
  15. Illinois lawmakers send AI frontier model safety bill to Gov. Pritzker — Transparency Coalition
  16. Illinois Senate SB 315: Frontier AI Annual Audits and 72-Hour Reporting — Tech Jacks Solutions
  17. DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5 — VentureBeat
  18. DeepSWE leaderboard — BenchLM.ai
  19. AI coding startup Cognition raises $1B at $25B pre-money valuation — TechCrunch
  20. AI Coding Startup Cognition Raises $1 Billion at $26 Billion Value — Bloomberg
  21. Cognition raises $1B at $26B valuation for AI coding agent — The Next Web
desktop bg dark

About Elena

Elena, a PhD in Computer Science, simplifies AI concepts and helps you use machine learning.




Citation
Elena Daehnhardt. (2026) 'Claude Haiku scored zero. GPT-5.5 scored 70%. The new benchmark explains why', daehnhardt.com, 29 May 2026. Available at: https://daehnhardt.com/blog/2026/05/29/the-ai-coding-leaderboard-just-broke-here-s-what-it-means/
All Posts