When AI Leaves the Lab Testing Frontier Models in Government Cyber Defence - Case study

From frontier models to front-line impact

We know AI is disrupting the cyber threat landscape. Recently released frontier AI systems such as Claude Mythos and GPT-5.5 brought a step-change in cyber capabilities, and the UK AI Security Institute (AISI)’s evaluations show these models getting better at cyber tasks very quickly.

However, evaluation in synthetic environments gives a limited understanding of real-world use. A high score on a benchmark does not necessarily translate into finding and fixing real vulnerabilities.

What we did

The Government Cyber Coordination Centre led a weekly, in-person series of hackathons which used frontier AI to scan public code repositories across government. Working closely with specialists from the AISI and NCSC, our goal was to find and mitigate previously unidentified vulnerabilities before they could be exploited. Rather than mandate a single approach, we gave teams model access and let them build their own tooling, noticing what worked each week and building on the best approaches.

The UK Government encourages new source code to be open by default, with specific and justified exceptions. In practice, that creates a degree of shared visibility that attackers can also exploit. However, this openness also limits duplication and leads to cleaner, more easily maintained code.

Code published in the open has also already passed extensive prepublication scrutiny, meaning it can be shared with frontier model providers with minimal additional review. This means that government departments can deploy new capabilities quickly and with confidence.

An adversarial chain that challenges itself. One team ran each public repo through a six-stage AI agent pipeline triage, validator, auditor, tracer, judge, summary. Each stage reads and challenges the last. In one case, the agent downgraded a finding once it established that a backup mechanism was in place. The pipeline was agentic, but the escalation was manual. This means a member of the team checked every line, re-verified exposure, and handled false positives.

Deterministic scanners feeding a model. Another team ran traditional scanning tools first (including Gitleaks, Trivy, Semgrep and Hadolint) to generate a ranked findings document. Three model stages were then layered on top a discovery stage that treated the scanner output as leads and read the source against OWASP and CWE frameworks, a chain-investigation stage that composed individual findings into attack paths via per-chain sub-agents, and a triage stage that confirmed the finding viability.

Codifying a multi-service audit into reusable skills. Another department developed five domain-specific Claude Skills. The Skills distil an organisation wide audit across hundreds of services into something repeatable. Skills enabled a reusable, scoped, and consistent approach across every repository and operator.

What we found

Participants identified 407 findings in total, including critical weaknesses exposing services to authentication bypass, data exposure and remote code execution. Some were already understood and mitigated by compensating controls while others were previously unknown. All critical weaknesses have been remediated, and no evidence of exploitation was identified for any finding.

AI models traced vulnerabilities across service boundaries, which traditional scanners can’t do, and linked business logic with technical detail. Departments prioritised validation and remediation through existing frameworks, patching critical and high-risk issues assessed as exploitable.

It cost us £13,000 in tokens to find these weaknesses, working across nine government organisations for the month.

Identifying Critical vulnerabilities One notable finding affected legacy GitHub Actions in a repository supporting a key government digital service. The issue allowed an external user to trigger a workflow chain by posting a specially structured comment on an open pull request. This bypassed the usual protections for pull requests from unknown contributors because the workflow was triggered by a comment, not by the pull request itself.

The impact was arbitrary remote code execution on the GitHub Actions runner. The workflow took content from the comment, passed it into deployment parameters, and used it in an environment substitution step that executed during the workflow. By placing executable content in the comment field, an external user could cause their input to run on the GitHub runner.

This created a route for malicious actors to potentially extract secrets and tokens available to the workflow, including the GitHub token used by the automation. With that level of access, the issue could support wider repository compromise, including manipulating pull requests, approving workflow activity, altering trusted contributor status, and exploit further secrets available to the automation environment.

What we learnt

Across teams, the common thread was structure. Models were used as components, using Skills, running in parallel across repositories, and a human expert kept in the loop on anything that mattered. We learnt that

Architecture matters the most. The strongest results came from using frontier models as tightly scoped components inside a structured pipeline. Breaking traditional vulnerability management workflows into discrete, task-specific harnesses let teams scale while controlling false positives and hallucination.
The model matters less than how it’s used. AISI’s research, borne out here, shows that with the right architecture and task design many near-frontier and frontier models perform comparably at scanning code. The best findings still lean heavily on human expertise in breaking the problem down and identifying wider context.
Triage is essential. Agents generate candidate findings far faster than humans can validate them. Poorly scoped runs burn tokens on low-value targets; weak review dumps the load onto stretched security teams. Careful upfront scoping and structured internal filtering of low-confidence findings kept human review focused. As in traditional vulnerability management, it’s not how many issues are found, but whether triage points limited resource where it matters.
Finding isn’t the same as fixing. Findings still had to enter the patch pipeline for remediation. AI shows promise here too, but today prioritisation, review and patch-generation all must integrate without overwhelming human-centred processes.

What next

GC3 will kick off a second phase of this pilot, with more departments, additional models, and an extension from public code to closed-source estates. Identifying vulnerabilities early on, raising the consistency of defensive practice, and helping departments share on proven techniques is how we put the Government Cyber Action Plan into practice.

AISI and NCSC’s involvement will also deepen as we continue to evaluate AI as a tool for cyber defence in applied settings, closing the gap between a theoretical benchmark and a real reduction in risk.

This pilot was a test of how government can adopt new capabilities responsibly, learn quickly, and share what works.

What's Hot

link road from M61 J1 to M60 J15 anti-clockwise | Southbound | Congestion

USA vs Paraguay ended in a wild fight when they played seven months ago… stars on both teams want revenge tonight

FBI searches office of Ohio group that supports voter registration efforts – UK Times

When AI Leaves the Lab Testing Frontier Models in Government Cyber Defence – Case study

Russia is not serious about peace and its war against Ukraine is increasingly unsustainable UK statement to the OSCE

The UK will continue to work with others to secure the Mechanism’s legacy in the delivery of justice for the victims of the atrocities UK Statement at the UN Security Council

UK to roll out Dutch-style employment support across Britain

Europe’s largest drone testing centre opens in Swindon to boost defence innovation

Disqualified director jailed for £3 million fraud which helped bankroll lavish lifestyle with chauffeur-driven Rolls-Royce

Crypto between tailwinds and headwinds