NEWS

Firefox 423-Patch Surge Makes AI Bug Hunting a Release Test

Published

1 month ago

May 31, 2026

Firefox vulnerabilities became Mozilla’s patch factory problem in April: Mozilla, the Firefox maker, fixed 423 security bugs, including 271 vulnerabilities that it says Claude Mythos Preview found for Firefox 150. The lesson for security teams is direct: artificial intelligence can now surface old browser flaws faster than release teams can comfortably absorb them.

That sounds like a victory lap for Anthropic, the AI company behind Claude. The browser maker’s own write-up points to a messier shift: discovery has stopped being the scarce step. Verification, duplicate handling, engineer attention, and clean releases now decide how much risk comes off the table.

The Number Mozilla Had To Explain

The April surge was not a single advisory with one clean count. In the May technical account, the Firefox security team said the month closed with 423 fixed security bugs across releases, led by the Mythos batch in Firefox 150 and followed by more fixes in Firefox 149.0.2, 150.0.1, and 150.0.2.

Bucket	Count	Where It Landed	Why It Matters
Mythos findings for Firefox 150	271	Firefox 150	Main proof that agentic model testing can feed a production browser release.
External reports	41	April Firefox security releases	Shows the usual researcher channel stayed active during the AI surge.
Other internal findings	111	April Firefox security releases	Split roughly among other Mythos fixes, other models, and conventional fuzzing.

The severity mix stops the number from becoming a vanity metric. Of the announced Firefox 150 batch, 180 were rated sec-high, 80 sec-moderate, and 11 sec-low. The team’s definition of sec-high covers flaws that can be triggered through normal user behavior, such as visiting a web page, even if a full browser compromise would usually need more than one bug.

Firefox vulnerabilities show AI bug hunting moving into browser releases.

A Harness Replaced the One-Off Audit

The important change was not that a model read code. Security teams have been testing large language model audits for years. The break came when the model got a feedback loop: read a target file, form a bug theory, build a proof of concept, run it, and throw away the idea if the program would not break.

That fixed the old slop problem. Static-analysis trials using GPT-4 and Claude Sonnet 3.5 produced too many plausible but wrong reports to scale. The later work used agentic harnesses, meaning software wrappers that let the model run tools, edit tests, use debuggers, and check whether a crash or sanitizer report proves the issue.

Assign each model run to a narrow target, often a specific file or function.
Require a reproducible test case before a report enters the security queue.
Deduplicate against known issues before engineers spend review time.
Track triage, patch review, uplift decisions, and release timing in the same workflow.

The earlier Firefox collaboration with Claude Opus 4.6 gives the scale-up some context. In the prior Firefox security collaboration, Anthropic said Opus 4.6 scanned nearly 6,000 C++ files, submitted 112 unique reports, and produced 22 vulnerabilities over two weeks. Mythos turned that kind of experiment into a release-management event.

The Bugs Cut Across Old Browser Layers

The public examples were not neat textbook mistakes. One opened report, Bug 2024437 in Bugzilla, describes a heap use-after-free (UAF, a memory error where code keeps using storage after it has been freed) tied to the HTML <legend> element. The issue sat in code old enough to have survived years of ordinary testing and review.

Another example involved Extensible Stylesheet Language Transformations (XSLT, an XML transformation feature) and reentrant calls to key() that could make a hash table free its backing store while a raw pointer still pointed into it. A table-layout bug used rowspan=0 behavior and more than 65,535 rows to overflow a 16-bit bitfield. The point was not clever syntax. The point was reach: old browser subsystems still expose fresh combinations.

Those examples matter because browsers are layered machines. HTML parsing, JavaScript, IndexedDB, WebAssembly, WebRTC, networking, layout, and sandbox code all touch untrusted input. Fuzzers are strong at hammering inputs until something crashes. The new harnesses add source-level reasoning about invariants, reentrancy, ownership, and cross-process trust.

Sandbox Escapes Shift the Stakes

A browser sandbox is designed to contain damage after a renderer process is compromised. Several of the disclosed examples lived near that boundary, including flaws involving inter-process communication (IPC, the message layer between separate browser processes) and parent-process state. These bugs are valuable to attackers because they can be chained with a renderer bug to move from a web page into a more privileged process.

That does not make every sec-high flaw a working exploit. The official Firefox 150 security advisory lists individual Common Vulnerabilities and Exposures records and large memory-safety rollups, while the technical account warns that many high-severity bugs still need other pieces to become a practical compromise.

One example involved a raw not-a-number (NaN, a special floating-point value) crossing an IPC boundary and being treated like a tagged JavaScript object pointer. Another involved an IndexedDB actor reference crossing a trust boundary and racing reference counts. Those are not bugs a product manager can understand from a crash count alone. They are the kind of boundary failures that make browser security hard.

The Bottleneck Moved to Triage

The second-order signal comes from the wider disclosure pipeline. On May 22, Anthropic’s coordinated vulnerability disclosure dashboard said the model had generated far more candidate findings than humans had pushed through review and patching. That gap is now the security story.

23,019 findings were listed as candidates from Mythos work.
1,900 findings had been reviewed by external security firms.
97 findings were marked as patched upstream.
88 advisories had received a CVE record or GitHub Security Advisory.

The coordinated disclosure dashboard calls independent human triage and review the rate-limiting step. That is the quiet warning for every maintainer watching the Firefox case. A model can make the pile bigger before it makes the product safer.

The Firefox team absorbed that pile with unusual resources. More than 100 people contributed code to the effort, with others testing fixes, reviewing patches, scaling the harness, and managing release flow. Smaller open-source projects may get the discovery shock without the staffing cushion.

CVE Accounting Made the Story Look Smaller

The public advisory confused some readers because Common Vulnerabilities and Exposures (CVE, public tracking records for disclosed security flaws) do not map one-to-one to internal bug counts. Internally found Firefox memory-safety issues can be grouped into rollup CVEs rather than published as hundreds of separate entries.

The company answered that mismatch in its FAQ. Three internal rollups in Firefox 150, CVE-2026-6784, CVE-2026-6785, and CVE-2026-6786, contained 154, 55, and 107 bugs respectively. That adds to 316, more than the announced Mythos figure, because the rollups also include bugs found by other internal methods.

That accounting matters for risk reading. A low public CVE count does not mean a thin security release. A high internal bug count does not mean hundreds of individually exploitable browser takeovers. The release sat between those two bad interpretations: a major hardening sprint, with public labels that lag the engineering reality.

Continuous Scanning Becomes the Test

The next version of this workflow cannot depend on a one-month scramble. The Firefox team says its current scanning has focused on selected files and functions, chosen through a mix of human judgment and automated signals. The planned step is continuous integration (CI, automated checks that run as code changes land), with patch-based scanning instead of only file-based hunts.

Project Glasswing gives that shift a wider industry shape. The Project Glasswing security initiative gives selected companies and critical software maintainers access to Mythos-class capability for defensive work, backed by up to $100 million in usage credits and $4 million in open-source security donations. The model itself is still restricted, which says plenty about the offensive risk.

For browser users, the near-term advice stays boring and important: update quickly. For software teams, the lesson is less comfortable. AI-assisted bug discovery rewards maintainers that can verify, patch, test, and ship at the same tempo that models can find flaws. Anyone missing that release muscle inherits a larger backlog, not an automatic defense.

If the CI plan catches fresh bugs before they merge, the 423-patch month becomes the messy prelude to a quieter default. If that work stays as a special surge, the next queue of machine-found bugs will land on the same human desks, only faster.