AIs can provide expert-level virology assistance

When we set out to test if LLMs can match virologists on troubleshooting complex virology lab scenarios, we didn’t expect to find that they had already begun to surpass experts a year ago.

Apr 23, 2025

The Challenge of Measuring Scientific Expertise

Large language models (LLMs) have been at the forefront of AI progress over the past years. They help people write and edit text, assist with coding, and are great partners for brainstorming. They will patiently accompany you down rabbit holes, explaining and clarifying new things along the way. What has been less obvious is whether this general usefulness extends into the sciences, where success often relies on hard-to-find tacit knowledge, hard-won practical experience, or interpreting and connecting disparate pieces of information.

For our pandemic prevention work at SecureBio, we are especially keen to understand how AI progress impacts virology research. Virology is intrinsically dual-use. Powerful AI assistance could accelerate beneficial virology research on vaccines and antivirals. It might also enable malicious actors to more easily misuse viruses to cause harm. We needed a way to robustly measure an AI’s ability to assist in virology work.

Unfortunately, helpfulness for a practical science like virology is hard to measure. Many traditional AI benchmarks test knowledge retrieval using exam-style multiple choice questions. They ask models to answer academic facts, or perhaps write analysis code. But this approach misses a major aspect of successful laboratory work: troubleshooting experiments and protocols. Practical lab work often relies on the ability to interpret ambiguous results—often done visually—and then determine next steps, drawing from tacit knowledge residing not in textbooks but in lab meetings and hallway conversations. These abilities aren't easily articulated, let alone quantified.

Quantifying the Tacit

We developed the Virology Capabilities Test (VCT) to attempt exactly this quantification. We created a benchmark that measures an AI’s ability to provide the contextualized, visual troubleshooting assistance that researchers require in actual labs. VCT targets virology methods with dual-use potential as well as other closely related methods. It excludes general molecular and cellular biology methods, as well as a small portion of virology material that we judged excessively hazardous.

VCT comprises 322 multimodal questions covering practical virology problems. Each presents an experimental scenario—often with an image—and asks what went wrong or what to do next. The questions are designed to be:

Important: testing knowledge essential for competent and successful lab work
Google-proof: answers cannot be easily found through web searches
Validated: answers are verified through expert peer review

To ensure these qualities, we designed a rigorous question-creation process:

We recruited virologists with at least one year of graduate-level research experience (the average ended up being just shy of 6 years)
Each question underwent double-blind peer review and editing by other experts
Questions that were answerable by non-experts using web search were eliminated

For the recruitment, we realized what a boon academic conferences are. Cramming a lot of project-context into an interesting cold email is difficult (though not impossible, we did recruit many participants through email outreach!), but talking to dozens of virologists about the project at the American Society of Virology’s 43rd annual meeting generated a lot of interest and follow-up participation.

After the submission phase concluded, 68 experts contributed over 500 questions drawn from their actual lab experiences. After review, editing, and non-expert testing, 322 questions remained: 221 containing an image, and 101 text-only questions. These questions can be answered in open-ended or multiple-choice format.

A VCT example question in the multiple-response format, requiring respondents to identify all true statements from a set of 4–10 options. Each question is also accompanied by a grading rubric for evaluating open-ended responses when answer statements are not provided.

Finally, to establish a human baseline, we had expert virologists answer question subsets that were specifically tailored to their self-declared areas of expertise. That way, we could measure how well virologists fare when asked about methods they consider among their top competencies, rather than asking virologists about any virology-related method.

AIs Began Exceeding Human Virologists in February…2024

To our surprise, the performance gap between humans and LLMs is stark:

Our expert virologists averaged 22.1% on question subsets individually tailored to their own areas of expertise
The leading LLM, OpenAI's o3, achieved 43.8% accuracy on the whole benchmarks, and outperformed 94% of virologists on the matched question subsets.
Google's Gemini 2.5 Pro scored 37.6%, placing in the 81st percentile.
Anthropic's Claude 3.5 Sonnet (Oct ‘24 version) reached 33.6%, ranking in the 75th percentile.
The first LLM to beat the median expert virologist, Gemini 1.5 Pro, was released in February 2024.

In each column, a dot represents a unique set of 10–30 VCT questions, tailored to a given virologist’s specific areas of expertise. Only the difference between expert and model score is shown, to account for the fact that each tailored set may have a different overall “difficulty”. Each tailored set was assessed with each model. Values above 0 are question sets in which a model outperformed the human. The overall performance of the model relative to the pool of 36 experts is shown as a percentile above.

It was not the advent of reasoning models in the fall of 2024 that gave LLMs the edge over expert virologists; frontier models like Gemini 1.5 Pro or Claude 3.5 Sonnet have been able to match or exceed the ability of human experts to provide practical troubleshooting assistance for over a year, and the disparity between humans and models is widening.

Every VCT scenario represents a few virologists’ consensus on the right way to solve a problem. Thus, the results indicate that individual experts are less effective than we anticipated at identifying the expert consensus—whereas leading models are surprisingly good at identifying the expert consensus. We interpret this result to mean that the training corpus of leading models has a strong representation of expert human consensus in this domain, and we effectively are seeing “the wisdom of the crowd” at work, mediated by LLMs.

Dealing with the Downsides of Democratized Expertise

As a virologist myself, I find these results simultaneously impressive and familiar. During my PhD, many of my or my fellow researchers’ wet lab excursions entailed seeking advice from multiple colleagues—sometimes still PhD candidates themselves—who had spent the last two to twenty years working with many variations of a specific technique: specialization and collaboration. You would show them pictures of your cells or gels, share your lab notebook sections, and tell them next steps you’re considering (and you obviously don’t approach your PI with something that 2 minutes of Googling would solve!).

This is precisely the experience VCT simulates. Since LLMs have a broad exposure across the whole field (or rather, any field), they've synthesized the equivalent of conversations with hundreds of specialists from scattered pieces of information hidden in papers, online forums, and patents—knowledge previously considered tacit for individual human experts. Combined with the extensive reasoning and web search that frontier models employ, this creates a formidable troubleshooting assistant.

It is important to point out that VCT does not measure hazardousness per se. All techniques covered are standard methods that are used daily for beneficial research. What VCT shows is that AI systems can provide the kind of specialized troubleshooting advice that typically requires years of training—and this applies equally well to methods that are benign and those that would be particularly concerning for causing harm.

How accessible do these models make virology to non-experts now? One might object that asking such detailed troubleshooting questions already requires considerable expertise and familiarity with the subject. To some degree, this is correct. But existing resources—tutorials, manuals, and the same endlessly patient AI models that also excel in expert assistance—can help you reach the point where you get stuck without expert consultation. VCT covers precisely those problems on which actual virologists think you’re most likely to fail without experienced guidance.

Follow-up studies performed by SecureBio and others will soon examine whether AI assistance improves experimental outcomes in actual labs. During our evaluations, we also observed a few consistent cases where AIs disagree with expert-provided answer keys, prompting us to think about how to reliably measure AI progress on topics in which expert knowledge stops being a reliable yardstick.

What's clear is that the conversation in science labs will inevitably change. A first-year graduate or undergraduate student can now describe a failed experiment, show an image, and receive guidance comparable to consulting a senior colleague. The boundary between novice and expert—always porous in practice—is becoming even less defined.

Anecdotally, however, the diffusion of this technology into the lives of practicing researchers is still slow. After completing the benchmark, we asked some of our participants whether an LLM matching their performance on VCT-like problems would increase their productivity in the lab. Almost all of them envisioned large productivity gains and significant acceleration of their research. Yet when asked whether they currently use LLMs in their work, we received a unanimous ‘no’.

Experts will and should leverage AI assistance for dual-use research, but we must treat the ability to provide expert assistance itself as dual-use: requiring additional oversight, but accessible to legitimate researchers and institutions. We suggest that publicly available models should not offer expert guidance on methods that would be most conducive for causing mass harm, such as detailed culturing protocols for organisms falling under biosafety levels 3 or 4.

We would love to hear your thoughts and questions about VCT and the intersection of AI and biology at benchmarks@securebio.org.

SecureBio

Discussion about this post

Ready for more?