<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[SecureBio: AIxBio]]></title><description><![CDATA[AI and biotechnology risks]]></description><link>https://securebio.substack.com/s/aixbio</link><image><url>https://substackcdn.com/image/fetch/$s_!jhE9!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d31dbac-a74f-4c4a-9683-348b1f4dbee5_500x500.png</url><title>SecureBio: AIxBio</title><link>https://securebio.substack.com/s/aixbio</link></image><generator>Substack</generator><lastBuildDate>Fri, 17 Apr 2026 15:24:08 GMT</lastBuildDate><atom:link href="https://securebio.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[SecureBio]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[securebio@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[securebio@substack.com]]></itunes:email><itunes:name><![CDATA[SecureBio]]></itunes:name></itunes:owner><itunes:author><![CDATA[SecureBio]]></itunes:author><googleplay:owner><![CDATA[securebio@substack.com]]></googleplay:owner><googleplay:email><![CDATA[securebio@substack.com]]></googleplay:email><googleplay:author><![CDATA[SecureBio]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[SecureBio AI: 2025 in Review]]></title><description><![CDATA[2025 was a significant year for the SecureBio AI team &#8211; reflected by the fact that we tripled our headcount, adding to our roster of world-class research scientists and engineers.]]></description><link>https://securebio.substack.com/p/securebio-ai-2025-in-review</link><guid isPermaLink="false">https://securebio.substack.com/p/securebio-ai-2025-in-review</guid><dc:creator><![CDATA[SecureBio]]></dc:creator><pubDate>Tue, 17 Feb 2026 21:23:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_1Pj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_1Pj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_1Pj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png 424w, https://substackcdn.com/image/fetch/$s_!_1Pj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png 848w, https://substackcdn.com/image/fetch/$s_!_1Pj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png 1272w, https://substackcdn.com/image/fetch/$s_!_1Pj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_1Pj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png" width="1202" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1202,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:109925,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://securebio.substack.com/i/188292982?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_1Pj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png 424w, https://substackcdn.com/image/fetch/$s_!_1Pj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png 848w, https://substackcdn.com/image/fetch/$s_!_1Pj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png 1272w, https://substackcdn.com/image/fetch/$s_!_1Pj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9beeb0f-b67f-4953-9c47-4a9f81f4ca20_1202x630.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>2025 was a significant year for the SecureBio AI team &#8211; reflected by the fact that we tripled our headcount, adding to our roster of world-class research scientists and engineers. By deepening our interdisciplinary talent pool across virology, AI/ML, software engineering, and policy, we expanded our capacity for running multiple large projects in parallel. This allowed us to make major strides in turning AI-bio risk evaluation from a set of bespoke projects into something closer to an ecosystem.</p><p>A central focus of our work last year was building evaluation tools that move beyond &#8220;does the model know biology,&#8221; towards &#8220;does the model meaningfully expand harmful biological capabilities?&#8221; In practice, this meant building more rigorous benchmarks, developing deeper agentic evaluations, and more direct integrations with real-world safety pipelines. We increased the reach of our technical outputs by briefing senior decision-makers on how third-party evaluations are run and interpreted. This year we will intensify our work on mitigations, push the envelope on understanding frontier capabilities across agents, biological AI models, and work to increase our understanding of how advances in AI translate to actual risk.</p><p>If you&#8217;re working on adjacent problems (especially evaluation standards, mitigation tools, or safety audit readiness), we&#8217;re always keen to to compare notes and collaborate.</p><p><strong>Benchmarks and Evaluations</strong></p><p>Virology Capabilities Test (VCT) was our flagship effort in 2025. We designed and executed a large-scale benchmark and published the <a href="https://www.virologytest.ai/">primary research paper</a>, helping establish VCT as the <a href="https://epoch.ai/gradient-updates/do-the-biorisk-evaluations-of-ai-labs-actually-measure-the-risk-of-developing-bioweapons#:~:text=We%20think%20that%20VCT%20is%20substantially%20more%20informative%20about%20real%2Dworld%20biorisk%20capabilities%20than%20either%20LAB%2DBench%20or%20WMDP.">leading reference</a> point for AI-bio risk discussions and enabling large-scale expert/model comparison.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> In addition we expanded coverage across different parts of the biological landscape with non-public benchmarks like the World Class Bio benchmark (WCB), the Molecular Biology Capabilities Test (MBCT), and the Human Pathogen Capabilities Test (HPCT), each aimed at capturing distinct slices of capability that matter for real-world misuse, not just textbook knowledge.</p><p>We also pushed further into agentic and long-form evaluation of biological AI models (BAIMs). The Agentic Bio-Capabilities Benchmark (<a href="https://openreview.net/pdf/efa6989a1bbafaf92bb9ce187b701c826ecffed5.pdf">ABC-Bench</a>) and Agentic BAIM-LLM Evaluation Benchmark (<a href="https://openreview.net/pdf/3fd094f3a011ca4820836bd6abf0dd01ca1e28f8.pdf">ABLE</a>) were designed to test whether agentic systems can complete key components of dual-use workflows, such as using biological AI models to redesign viral proteins. ABC-Bench shows that AI agents can increasingly undertake biosecurity-relevant tasks across both in-silico design and wet-lab experiments, while ABLE shows that agents can effectively utilize AI protein design tools, but remain inconsistent at applying their knowledge across a multi-step computational design workflow. Several of these efforts were presented at NeurIPS and used in multiple real assessments, helping inform discussions about how agentic systems change the risk landscape.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><p>Our benchmarks and evaluations have been cited in model cards or risk management frameworks for major releases from all the frontier labs, including Anthropic, Google DeepMind, Meta, OpenAI, and xAI. VCT was also referenced throughout the House Energy and Commerce Committee hearing on <a href="https://www.congress.gov/event/119th-congress/house-event/118773">Examining Biosecurity at the Intersection of AI and Biology</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, and received coverage in <a href="https://time.com/7279010/ai-virus-lab-biohazard-study/">Time</a>. It remains an open question how model performance on benchmarks translates to changes in the real-world risk landscape; addressing this uncertainty is a key focus of our 2026 efforts.</p><p><strong>Mitigations, Cross-Team Collaborations, and Safety Pipeline Deliverables</strong></p><p>A major milestone for us this year was not just researching the capabilities of models once they are released, but actually working with frontier labs to make models safer. We delivered training datasets and lists of dangerous biological topics for pretraining data filtering that were directly fed into the design of several frontier models. This work sits at the interface of research and operational safety: it is empirically grounded, hard to game, and compatible with how frontier labs actually train and deploy safeguards. It reduces models&#8217; capabilities for catastrophic bio, misusewhile preserving their beneficial capabilities.</p><p>We also contributed to work on AI-bio jailbreaking mitigations, helping to characterize how safety systems fail under pressure and what kinds of mitigations appear most promising. Complementing this, we secured dedicated funding for operational security research and follow-on methods development. This reflects a broader shift in the field toward treating misuse prevention as an end-to-end systems problem rather than a single refusal metric.</p><p>The team also conducted an exciting collaboration with the NAO, helping to build an AI tool that triages metagenomic sequences flagged by the NAO&#8217;s detection system for further investigation. The tool analyzes concerning sequences, enriches them with relevant facts and context, and surfaces the most important for human-expert review. We&#8217;re excited to undertake further such work that leverages each team&#8217;s strength.</p><p><strong>Funding, Delivering, and Scaling our Work</strong></p><p>On the funding side, we secured a multi-year Coefficient Giving grant that enables longer-horizon planning, deeper technical investment, and hiring. We also received multiple grants from the Foundation Model Forum, including support for research into agentic AI, operational security work, and follow-on research from earlier pilots.</p><p>We delivered several major evaluation projects with frontier labs, including expert baselining, quality control for evals, and holistic prerelease assessments. We also moved toward a more durable model by licensing evaluations to multiple frontier labs, creating a durable pathway for our tools to be used in real decision-making contexts rather than remaining purely academic artifacts.</p><p>Finally, we broadened our government-facing portfolio with a US CAISI Bio R&amp;D contract and participation in an <a href="https://securebio.substack.com/p/securebio-selected-to-develop-bio">EU AI Office contract</a> to deliver bio-evals, both steps toward institutionalizing evaluation as part of emerging governance and standards ecosystems.</p><p><strong>Strategy and Policy Developments</strong></p><p>As our technical evaluation capacity grows, the question of &#8220;what should decision-makers do with these results&#8221; becomes more pressing. We spent substantial effort to ensure our work aligned with the institutions that shape audit expectations and safety norms.</p><p>We delivered a national security briefing on frontier model capabilities, helping bring empirical evaluation results into senior biosecurity decision-making contexts. The team also presented to export-control policymakers through the BIS Technology Advisory Committee and briefed US CAISI staff working on bio-related standards, both efforts to translate technical work into governance-ready inputs.</p><p><strong>Looking Ahead</strong></p><p>The through-line of 2025 was a shift from one-off evaluations to a mature ecosystem posture: credible benchmarks, agentic evaluation methods, mitigation artifacts that plug into real safety pipelines, and growing institutional relevance with governments and standards bodies.</p><p><strong>In 2026, we plan to keep pushing in four directions:</strong></p><ol><li><p>Mitigation strategies that measurably reduce risk in deployed systems and in contexts where malicious actors can use multiple models in combination.</p></li><li><p>Deeper work on measuring and understanding the &#8220;hard cases&#8221;: agentic systems, integrated toolchains, frontier BAIM capabilities, and most challenging of all, super-expert capability uplift.</p></li><li><p>Systematic and routine evaluations that are fast, reliable, and decision-relevant.</p></li><li><p>Better understanding of how increases in AI model performance translate to changes in real-world risks.</p></li></ol><p>If you&#8217;re building adjacent infrastructure or want to pressure-test your own evaluation/mitigation approach, please reach out.</p><p>Ben Mueller, Executive Director and Seth Donoughe, Director of AI</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>VCT is used by all major model developers: <a href="https://www.anthropic.com/news/strategic-warning-for-ai-risk-progress-and-insights-from-our-frontier-red-team">Anthropic</a>, <a href="https://openai.com/index/strengthening-safety-with-external-testing/">OpenAI</a>, <a href="https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Deep-Think-Model-Card.pdf">Google DeepMind</a>, <a href="https://ai.meta.com/research/publications/code-world-model-preparedness-report/">Meta</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Note: Most of the research on this topic is not publicly available, but here is some of the public-facing work:</p><ul><li><p><em><a href="https://cdn.openai.com/gpt-5-system-card.pdf">GPT-5 System Card</a>.</em> 2025. Sections 5.1.1&#8211;5.1.1.4 (wet-lab troubleshooting and tacit knowledge benchmarks) and 5.1.1.6 (SecureBio external assessment).</p></li><li><p><em><a href="https://cdn.openai.com/pdf/6bcccca6-3b64-43cb-a66e-4647073142d7/chatgpt_agent_system_card_launch.pdf">ChatGPT Agent System Card</a>.</em> 2025. Section 5.1.1.6 (Fragment Design and World-Class Biology evaluations) and Section 5.1.1.7 (Expert Deep Dives).</p></li><li><p><em><a href="https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf">GPT-OSS Model Card</a>.</em> 2025. Section 5.2.1 (biological reasoning and troubleshooting performance).</p></li></ul></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>For instance, the chairman&#8217;s opening <a href="https://www.congress.gov/119/meeting/house/118773/documents/HHRG-119-IF02-MState-J000302-20251217.pdf">remarks</a> include &#8220;Some LLMs have even been shown to outperform PhD-level virologists on advanced troubleshooting tasks&#8221;.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[SecureBio Selected to Develop Bio-Evals for the European Commission]]></title><description><![CDATA[We&#8217;re excited to share that the European Commission&#8217;s AI Office has chosen SecureBio to develop biological threat evaluations (&#8216;evals&#8217;) to support the implementation of the EU&#8217;s landmark Artificial Intelligence Act.]]></description><link>https://securebio.substack.com/p/securebio-selected-to-develop-bio</link><guid isPermaLink="false">https://securebio.substack.com/p/securebio-selected-to-develop-bio</guid><dc:creator><![CDATA[SecureBio]]></dc:creator><pubDate>Tue, 03 Feb 2026 16:55:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UziS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UziS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UziS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!UziS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!UziS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!UziS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UziS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:498522,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://securebio.substack.com/i/186755300?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UziS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!UziS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!UziS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!UziS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf4a898-2b43-47f9-930b-f5b18f4d10c2_1600x900.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We&#8217;re excited to share that the European Commission&#8217;s AI Office has chosen <a href="https://securebio.org/">SecureBio</a> to develop biological threat evaluations (&#8216;evals&#8217;) to support the implementation of the EU&#8217;s landmark Artificial Intelligence Act. This is SecureBio&#8217;s second major engagement by a government to build such evals, following the award of a contract by the US government&#8217;s Center for AI Standards and Innovation.</p><p>As part of a consortium led by <a href="http://far.ai">FAR.AI</a>, SecureBio and other leading AI research organizations successfully completed a bid for Lot 1 of the EU AI Act&#8217;s &#8220;Technical Assistance for AI Safety&#8221; <a href="https://digital-strategy.ec.europa.eu/en/news/forthcoming-call-tenders-artificial-intelligence-act-technical-assistance-ai-safety">tender</a>. In line with the protections set out in the EU AI Act, the consortium will monitor how AI might pose risks by expanding access to chemical, biological, radiological, and nuclear (CBRN) threats. Other members of the consortium include <a href="https://www.safer-ai.org/">SaferAI</a>, <a href="https://www.governance.ai/">GovAI</a>, <a href="https://www.nemesysinsights.com/">Nemesys Insights</a>, and <a href="https://www.equistamp.com/">Equistamp</a>.</p><p>Over the next three years, SecureBio will be focused on:</p><ul><li><p><strong>Delivering pre-made biological evaluations</strong>: We&#8217;ll integrate and deliver established, publicly available biological evaluations like our <a href="https://www.virologytest.ai/">Virology Capabilities Test</a> into the Commission&#8217;s assessment framework.</p></li><li><p><strong>Developing custom evaluations</strong>: We&#8217;ll design and build new AI evaluations to address gaps in current coverage of biological threat scenarios.</p></li><li><p><strong>Performing quality assurance and human baselining</strong>: We&#8217;ll establish rigorous quality standards for biological evaluations, including human baseline studies to calibrate AI performance against experts.</p></li><li><p><strong>Building evaluation infrastructure</strong>: We&#8217;ll help streamline and simplify the biological evaluation process for the EU&#8217;s AI Office, enabling consistent assessment of frontier models as they emerge.</p></li></ul><p>Ben Mueller, Executive Director of SecureBio, said: &#8220;AI is poised to bring about tremendous progress in the medical and life sciences.  At the same time, the technology generates risks that need to be better understood. We are proud that SecureBio&#8217;s AI team has been selected by the European Commission to support its efforts to understand risks posed by advanced AI. Our staff of scientists, researchers, and software engineers has a strong track record of producing rigorous, balanced evaluations to understand the capabilities and risks of frontier models in biology and associated fields, and we are pleased to contribute to this important undertaking.&#8221;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://securebio.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to SecureBio</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[SecureBio’s AI Team: An Overview of Our Biorisk Evaluations ]]></title><description><![CDATA[A lot of AI safety work never makes headlines, but it&#8217;s quietly shaping model deployment decisions at the highest levels.]]></description><link>https://securebio.substack.com/p/securebios-ai-team-an-overview-of</link><guid isPermaLink="false">https://securebio.substack.com/p/securebios-ai-team-an-overview-of</guid><dc:creator><![CDATA[SecureBio]]></dc:creator><pubDate>Wed, 04 Jun 2025 19:02:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jhE9!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d31dbac-a74f-4c4a-9683-348b1f4dbee5_500x500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A lot of AI safety work never makes headlines, but it&#8217;s quietly shaping model deployment decisions at the highest levels. SecureBio&#8217;s biosafety evaluations have assessed the frontier models of leading labs, including <a href="https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf">Anthropic</a>, <a href="https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf">OpenAI</a>, <a href="https://storage.googleapis.com/model-cards/documents/gemini-2.5-pro-preview.pdf">Google DeepMind</a>, and xAI.</p><p>These evaluations help shape our understanding of biorisks associated with advanced AI. While much of our work remains confidential, you can see references to our evals in public system cards, including Claude 3.7 Sonnet, Claude 4, GPT-4.5, o3-mini, o4-mini, and Gemini 2.5 Pro. Our work has also informed emerging AI risk management frameworks. One public-facing example is <a href="https://x.ai/documents/2025.02.20-RMF-Draft.pdf">xAI&#8217;s risk management documentation</a>.</p><p>We also developed and ran the Virology Capabilities Test (VCT), a first-of-its-kind benchmark that measures the ability of a model to provide expert-level practical assistance in work with viruses. <a href="https://time.com/7279010/ai-virus-lab-biohazard-study/">It was covered in </a><em><a href="https://time.com/7279010/ai-virus-lab-biohazard-study/">Time</a></em><a href="https://time.com/7279010/ai-virus-lab-biohazard-study/">. </a><strong><a href="https://time.com/7279010/ai-virus-lab-biohazard-study/"><br><br></a></strong>The recent CBRN risk assessment of the Claude 4 model family used not only VCT, but also two additional SecureBio benchmarks: A DNA synthesis screening evasion task set for LLM agents and a set of creative biology scenarios that serve as proxies for novel biology abilities. Anthropic&#8217;s CBRN evaluation also included the long-form virology tasks; a suite of evals, co-developed with Deloitte, Signature Science, and SecureBio, which tests end-to-end pathogen acquisition.</p><p>More is coming: we are developing further targeted evaluations, testing how AIs uplift human abilities on dual-use biology, researching promising mitigations, and conducting holistic safety assessments. We will publish our findings; we strongly support greater transparency and accountability in AI deployment.</p><p>None of this would be possible without our team, an extraordinary group of engineers and biologists working with diligence, precision, and purpose. They conduct world-class research with minimal resources and are some of the finest minds to work in this space.</p><p>Their motivation is simple but vital: to help ensure AI contributes to a future of profound scientific and human progress, while reducing the risk of catastrophic misuse of these models.</p><p><em>System cards featuring SecureBio evals and benchmarks</em></p><p> &#128279; <a href="https://assets.anthropic.com/m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf">Claude 3.7 Sonnet<br></a> &#128279; <a href="https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf">Claude 4<br></a> &#128279; <a href="https://cdn.openai.com/gpt-4-5-system-card-2272025.pdf">GPT-4.5<br></a> &#128279; <a href="https://cdn.openai.com/o3-mini-system-card-feb10.pdf">o3-mini<br></a> &#128279; <a href="https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf">o4-mini<br></a> &#128279; <a href="https://storage.googleapis.com/model-cards/documents/gemini-2.5-pro-preview.pdf">Gemini 2.5 Pro</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://securebio.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[AIs can provide expert-level virology assistance]]></title><description><![CDATA[When we set out to test if LLMs can match virologists on troubleshooting complex virology lab scenarios, we didn&#8217;t expect to find that they had already begun to surpass experts a year ago.]]></description><link>https://securebio.substack.com/p/ais-can-provide-expert-level-virology</link><guid isPermaLink="false">https://securebio.substack.com/p/ais-can-provide-expert-level-virology</guid><dc:creator><![CDATA[Jasper Götting]]></dc:creator><pubDate>Wed, 23 Apr 2025 17:37:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3_-P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.virologytest.ai/&quot;,&quot;text&quot;:&quot;Read the full paper&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.virologytest.ai/"><span>Read the full paper</span></a></p><h2>The Challenge of Measuring Scientific Expertise</h2><p>Large language models (LLMs) have been at the forefront of AI progress over the past years. They help people write and edit text, assist with coding, and are great partners for brainstorming. They will patiently accompany you down rabbit holes, explaining and clarifying new things along the way. What has been less obvious is whether this general usefulness extends into the sciences, where success often relies on hard-to-find tacit knowledge, hard-won practical experience, or interpreting and connecting disparate pieces of information.</p><p>For our pandemic prevention work at SecureBio, we are especially keen to understand how AI progress impacts virology research. Virology is intrinsically dual-use. Powerful AI assistance could accelerate beneficial virology research on vaccines and antivirals. It might also enable malicious actors to more easily misuse viruses to cause harm. We needed a way to robustly measure an AI&#8217;s ability to assist in virology work.</p><p>Unfortunately, helpfulness for a practical science like virology is hard to measure. Many traditional AI benchmarks test knowledge retrieval using exam-style multiple choice questions. They ask models to answer academic facts, or perhaps write analysis code. But this approach misses a major aspect of successful laboratory work: troubleshooting experiments and protocols. Practical lab work often relies on the ability to interpret ambiguous results&#8212;often done visually&#8212;and then determine next steps, drawing from tacit knowledge residing not in textbooks but in lab meetings and hallway conversations. These abilities aren't easily articulated, let alone quantified.</p><h2>Quantifying the Tacit</h2><p>We developed the Virology Capabilities Test (VCT) to attempt exactly this quantification. We created a benchmark that measures an AI&#8217;s ability to provide the contextualized, visual troubleshooting assistance that researchers require in actual labs. VCT targets virology methods with dual-use potential as well as other closely related methods. It excludes general molecular and cellular biology methods, as well as a small portion of virology material that we judged excessively hazardous.</p><p>VCT comprises 322 multimodal questions covering practical virology problems. Each presents an experimental scenario&#8212;often with an image&#8212;and asks what went wrong or what to do next. The questions are designed to be:</p><ul><li><p><strong>Important</strong>: testing knowledge essential for competent and successful lab work</p></li><li><p><strong>Google-proof</strong>: answers cannot be easily found through web searches</p></li><li><p><strong>Validated</strong>: answers are verified through expert peer review</p></li></ul><p>To ensure these qualities, we designed a rigorous question-creation process:</p><ol><li><p>We recruited virologists with at least one year of graduate-level research experience (the average ended up being just shy of 6 years)</p></li><li><p>Each question underwent double-blind peer review and editing by other experts</p></li><li><p>Questions that were answerable by non-experts using web search were eliminated</p></li></ol><p>For the recruitment, we realized what a boon academic conferences are. Cramming a lot of project-context into an interesting cold email is difficult (though not impossible, we did recruit many participants through email outreach!), but talking to dozens of virologists about the project at the American Society of Virology&#8217;s 43rd annual meeting generated a lot of interest and follow-up participation.</p><p>After the submission phase concluded, 68 experts contributed over 500 questions drawn from their actual lab experiences. After review, editing, and non-expert testing, 322 questions remained: 221 containing an image, and 101 text-only questions. These questions can be answered in open-ended or multiple-choice format.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3_-P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3_-P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png 424w, https://substackcdn.com/image/fetch/$s_!3_-P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png 848w, https://substackcdn.com/image/fetch/$s_!3_-P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!3_-P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3_-P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png" width="1290" height="1320" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1320,&quot;width&quot;:1290,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3_-P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png 424w, https://substackcdn.com/image/fetch/$s_!3_-P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png 848w, https://substackcdn.com/image/fetch/$s_!3_-P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!3_-P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4542f4f-900b-47f0-abb4-211e9ff731d3_1290x1320.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A VCT example question in the multiple-response format, requiring respondents to identify all true statements from a set of 4&#8211;10 options. Each question is also accompanied by a grading rubric for evaluating open-ended responses when answer statements are not provided.</figcaption></figure></div><p>Finally, to establish a human baseline, we had expert virologists answer question subsets that were specifically tailored to their self-declared areas of expertise. That way, we could measure how well virologists fare when asked about methods they consider among their top competencies, rather than asking virologists about <em>any</em> virology-related method.</p><h2>AIs Began Exceeding Human Virologists in February&#8230;2024</h2><p>To our surprise, the performance gap between humans and LLMs is stark:</p><ul><li><p>Our expert virologists averaged 22.1% on question subsets individually tailored to their own areas of expertise</p></li><li><p>The leading LLM, OpenAI's o3, achieved 43.8% accuracy on the whole benchmarks, and outperformed 94% of virologists on the matched question subsets.</p></li><li><p>Google's Gemini 2.5 Pro scored 37.6%, placing in the 81st percentile.</p></li><li><p>Anthropic's Claude 3.5 Sonnet (Oct &#8216;24 version) reached 33.6%, ranking in the 75th percentile.</p></li><li><p>The first LLM to beat the median expert virologist, Gemini 1.5 Pro, was released in February 2024.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rYio!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rYio!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png 424w, https://substackcdn.com/image/fetch/$s_!rYio!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png 848w, https://substackcdn.com/image/fetch/$s_!rYio!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png 1272w, https://substackcdn.com/image/fetch/$s_!rYio!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rYio!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png" width="1456" height="962" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:962,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rYio!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png 424w, https://substackcdn.com/image/fetch/$s_!rYio!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png 848w, https://substackcdn.com/image/fetch/$s_!rYio!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png 1272w, https://substackcdn.com/image/fetch/$s_!rYio!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3456ddfc-3903-48c5-9709-23ae58f85eee_1600x1057.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">In each column, a dot represents a unique set of 10&#8211;30 VCT questions, tailored to a given virologist&#8217;s specific areas of expertise. Only the difference between expert and model score is shown, to account for the fact that each tailored set may have a different overall &#8220;difficulty&#8221;. Each tailored set was assessed with each model. Values above 0 are question sets in which a model outperformed the human. The overall performance of the model relative to the pool of 36 experts is shown as a percentile above.</figcaption></figure></div><p>It was not the advent of reasoning models in the fall of 2024 that gave LLMs the edge over expert virologists; frontier models like Gemini 1.5 Pro or Claude 3.5 Sonnet have been able to match or exceed the ability of human experts to provide practical troubleshooting assistance for over a year, and the disparity between humans and models is widening.</p><p>Every VCT scenario represents a few virologists&#8217; consensus on the right way to solve a problem. Thus, the results indicate that individual experts are less effective than we anticipated at identifying the expert consensus&#8212;whereas leading models are surprisingly good at identifying the expert consensus. We interpret this result to mean that the training corpus of leading models has a strong representation of expert human consensus in this domain, and we effectively are seeing &#8220;the wisdom of the crowd&#8221; at work, mediated by LLMs.</p><h2>Dealing with the Downsides of Democratized Expertise</h2><p>As a virologist myself, I find these results simultaneously impressive and familiar. During my PhD, many of my or my fellow researchers&#8217; wet lab excursions entailed seeking advice from multiple colleagues&#8212;sometimes still PhD candidates themselves&#8212;who had spent the last two to twenty years working with many variations of a specific technique: specialization and collaboration. You would show them pictures of your cells or gels, share your lab notebook sections, and tell them next steps you&#8217;re considering (and you obviously don&#8217;t approach your PI with something that 2 minutes of Googling would solve!).</p><p>This is precisely the experience VCT simulates. Since LLMs have a broad exposure across the whole field (or rather, <em>any</em> field), they've synthesized the equivalent of conversations with hundreds of specialists from scattered pieces of information hidden in papers, online forums, and patents&#8212;knowledge previously considered tacit for individual human experts. Combined with the extensive reasoning and web search that frontier models employ, this creates a formidable troubleshooting assistant.</p><p>It is important to point out that VCT does not measure hazardousness per se. All techniques covered are standard methods that are used daily for beneficial research. What VCT shows is that AI systems can provide the kind of specialized troubleshooting advice that typically requires years of training&#8212;and this applies equally well to methods that are benign and those that would be particularly concerning for causing harm.</p><p>How accessible do these models make virology to non-experts now? One might object that asking such detailed troubleshooting questions already requires considerable expertise and familiarity with the subject. To some degree, this is correct. But existing resources&#8212;tutorials, manuals, and the same endlessly patient AI models that also excel in expert assistance&#8212;can help you reach the point where you get stuck without expert consultation. VCT covers precisely those problems on which actual virologists think you&#8217;re most likely to fail without experienced guidance.</p><p>Follow-up studies performed by SecureBio and others will soon examine whether AI assistance improves experimental outcomes in actual labs. During our evaluations, we also observed a few consistent cases where AIs disagree with expert-provided answer keys, prompting us to think about how to reliably measure AI progress on topics in which expert knowledge stops being a reliable yardstick.</p><p>What's clear is that the conversation in science labs will inevitably change. A first-year graduate or undergraduate student can now describe a failed experiment, show an image, and receive guidance comparable to consulting a senior colleague. The boundary between novice and expert&#8212;always porous in practice&#8212;is becoming even less defined.</p><p>Anecdotally, however, the diffusion of this technology into the lives of practicing researchers is still slow. After completing the benchmark, we asked some of our participants whether an LLM matching their performance on VCT-like problems would increase their productivity in the lab. Almost all of them envisioned large productivity gains and significant acceleration of their research. Yet when asked whether they currently use LLMs in their work, we received a unanimous &#8216;no&#8217;.</p><p>Experts will and should leverage AI assistance for dual-use research, but <em>we must treat the ability to provide expert assistance itself as dual-use</em>: requiring additional oversight, but accessible to legitimate researchers and institutions. We suggest that publicly available models should not offer expert guidance on methods that would be most conducive for causing mass harm, such as detailed culturing protocols for organisms falling under biosafety levels 3 or 4.</p><p>We would love to hear your thoughts and questions about VCT and the intersection of AI and biology at <a href="mailto:benchmarks@securebio.org">benchmarks@securebio.org</a>.</p>]]></content:encoded></item></channel></rss>