Synopsis
Claude’s performance in an experiment by a Harvard professor shows that AI models are beginning to reshape scientific workflows, dramatically cutting down research time. However, its limitations in reasoning and reliability render it, at best, a powerful assistant rather than a fully autonomous researcher.What is vibe physics?
Anthropic titled its blog ‘vibe physics,’ documenting the AI experiment where Claude Opus 4.5, its most advanced LLM, was asked to generate a completely new research paper on a complex topic from theoretical physics.
Following the experiment, the professor concluded that AI assistants like Claude will give rise to vibe physics, just like they did to vibe coding. Vibe coding was a term coined by OpenAI cofounder and former Tesla AI director Andrej Karpathy to define the modern software development approach where developers use natural language to tell AI agents what to build, rather than writing code line by line.
In this context, the Anthropic experiment proves that researchers will now be able to send similar natural language commands to AI agents to generate unique research.
Professor Matthew Schwartz claims that this was something that models could not have done three months ago. Schwartz works on quantum field theory, particle physics, and machine learning at the Harvard Department of Physics in the US.
What was the experiment?
Schwartz asked Claude to produce an original research paper to gauge how far LLMs are from undertaking research on their own. While Claude committed errors in doing so, Schwartz ended up producing a peer-reviewed paper titled ‘Resummation of the Sudakov Shoulder in the C-Parameter’ in just two weeks, after supervising the model through a full research workflow.
“Over 110 separate drafts, 36M tokens, and 40+ hours of local CPU compute, Claude proved fast, indefatigable, and eager to please. Claude is impressively capable, but also sloppy enough that I found domain expertise essential for evaluating its accuracy,” noted Schwartz.
The experiment saw a total of 270 sessions with Claude, exchanging over 52,000 messages using up nearly 27.5 million tokens, and a human oversight time of around 60 hours.
Does it mean AI can carry out research on its own?
The experiment suggests that AI is not yet capable of conducting independent scientific research. Instead, it functions as a powerful assistant that can accelerate progress when guided by domain expertise. Schwartz concluded that Claude can perform research equivalent to that of a graduate school student.
“My conclusion is that current LLMs are at the G2 level. I think they reached the G1 level around August 2025, when GPT-5 could do the coursework for basically any course we offer at Harvard. By December 2025, Claude Opus 4.5 was at the G2 level,” the professor noted.
He highlighted that the model performed well on technical tasks such as algebraic manipulation, coding, and data analysis, and proved effective at collating information from multiple sources, significantly reducing the time required to complete routine and computationally intensive parts of the research.
However, he also noted that the assistant struggled with accuracy and reliability, misapplying formulas, and in some cases adjusted outputs to match expected results rather than identifying underlying mistakes. The model also failed to consistently verify its own work and required continuous human oversight to ensure correctness.
What does this mean for scientists and researchers?
Schwartz argued that this will lead to researchers who effectively integrate AI into their workflows to significantly increase productivity. At the same time, the importance of expertise, judgement, and problem selection becomes even more central, as these remain areas where AI continues to fall short.
“What this means is that although LLMs cannot yet do original theoretical physics research autonomously, they can vastly accelerate the research done by experts. For this project (which I completed with Claude in two weeks), I’d estimate that it would have taken me and a G2 student one to two years, and me without AI around three to five months. Ultimately, it accelerated my own research tenfold,” he added.