A lot of AI safety work never makes headlines, but it’s quietly shaping model deployment decisions at the highest levels. SecureBio’s biosafety evaluations have assessed the frontier models of leading labs, including Anthropic, OpenAI, Google DeepMind, and xAI.
These evaluations help shape our understanding of biorisks associated with advanced AI. While much of our work remains confidential, you can see references to our evals in public system cards, including Claude 3.7 Sonnet, Claude 4, GPT-4.5, o3-mini, o4-mini, and Gemini 2.5 Pro. Our work has also informed emerging AI risk management frameworks. One public-facing example is xAI’s risk management documentation.
We also developed and ran the Virology Capabilities Test (VCT), a first-of-its-kind benchmark that measures the ability of a model to provide expert-level practical assistance in work with viruses. It was covered in Time.
The recent CBRN risk assessment of the Claude 4 model family used not only VCT, but also two additional SecureBio benchmarks: A DNA synthesis screening evasion task set for LLM agents and a set of creative biology scenarios that serve as proxies for novel biology abilities. Anthropic’s CBRN evaluation also included the long-form virology tasks; a suite of evals, co-developed with Deloitte, Signature Science, and SecureBio, which tests end-to-end pathogen acquisition.
More is coming: we are developing further targeted evaluations, testing how AIs uplift human abilities on dual-use biology, researching promising mitigations, and conducting holistic safety assessments. We will publish our findings; we strongly support greater transparency and accountability in AI deployment.
None of this would be possible without our team, an extraordinary group of engineers and biologists working with diligence, precision, and purpose. They conduct world-class research with minimal resources and are some of the finest minds to work in this space.
Their motivation is simple but vital: to help ensure AI contributes to a future of profound scientific and human progress, while reducing the risk of catastrophic misuse of these models.
System cards featuring SecureBio evals and benchmarks
🔗 Claude 3.7 Sonnet
🔗 Claude 4
🔗 GPT-4.5
🔗 o3-mini
🔗 o4-mini
🔗 Gemini 2.5 Pro

