Article

Claude Just Beat Chemistry Software at Its Own Game

Erdeniz Korkmaz
4 min read
Claude Just Beat Chemistry Software at Its Own Game

Something shifted in analytical chemistry this week, and it's worth paying attention to whether you work in the lab or not.

Anthropic published a white paper showing that Claude Opus 4.7 can predict NMR spectra with accuracy that rivals, and in parts beats, ChemDraw and MestReNova. These are not obscure tools. They are the industry-standard software synthetic chemists use every day to analyse molecular structure, and they've been refined over decades. Claude, a general-purpose language model with no chemistry-specific fine-tuning, is now competitive with them.

NMR spectroscopy is one of the core analytical techniques in chemistry. Every novel compound gets one. The process is roughly this: you dissolve your molecule in a solvent, probe it with a magnetic field, and read the pattern of peaks it produces. Each peak corresponds to an atom or group of atoms. Matching those peaks to the proposed structure by hand is slow, careful work. Tools like ChemDraw and MestReNova were built specifically to help with this, and they're good at it.

So Anthropic set up a proper comparison. They pulled 20 compounds from ChemRxiv preprints published after the models' training cutoffs (to avoid data leakage), split them across four structural families, and asked each tool to predict where peaks would land on a 1D NMR spectrum. For hydrogen, Opus 4.7 averaged an error of ±0.079 ppm. The accepted tolerance window in chemistry is ±0.20 ppm. It's comfortably inside it. For carbon, Opus 4.7 and MestReNova effectively tied at ±1.37 and ±1.48 ppm respectively. On one trickier measure, the fine-grained sub-peak spacing within each peak, all three Claude models landed within half a hertz around 80% of the time. ChemDraw and MestReNova managed between 26 and 35%.

That's the forward task: given a structure, predict the spectrum. Then they tried the harder problem.

Structure elucidation is the reverse operation: given a spectrum and a molecular formula, work out what molecule produced it. This is what dedicated elucidation software typically needs 2D NMR data and specialist training to do. Claude does it from a standard 1D peak list and a molecular formula. On 8 simpler targets, Opus 4.7 got the structure right on every attempt. On 7 harder targets, with the starting-material structure provided as context, it returned the correct answer on all three runs for 4 of them. No existing off-the-shelf tool does the inverse problem from 1D data alone.

The model doing all of this is a general reasoner. No fine-tuning. It brings chemical knowledge from training, reasons through the problem step by step, and shows its work. A chemist can audit the output. That last part matters more than it sounds.

Here's why this matters beyond the chemistry labs. The pattern is the thing. General-purpose frontier models are increasingly competitive with specialist software that took decades and domain-specific investment to build. It happened with code. It's happening with legal research and contract review. Now it's happening with analytical chemistry. The frontier model improves continuously, generalises across tasks, and costs a fraction of the licensing fees on incumbent tooling.

For founders and product teams, the implication is real. If your users spend time on analytical translation work, moving data between representations of the same underlying thing, a well-built LLM pipeline may already be a viable alternative. You don't need to wait for a chemistry-specific fine-tune or a purpose-built vertical AI product to start capturing value from this.

This is exactly the kind of problem Dakik works on. If you're building in biotech, pharma, materials, or any domain where researchers spend significant time on structured analysis, we can help you work out what's feasible and build the integration properly. That might mean a RAG pipeline over your compound database, a custom reasoning agent that routes different analytical tasks to the right model and tool, or a workflow that puts AI-generated candidates in front of human reviewers at the right confidence threshold. We've built this sort of thing before, and we know what separates a compelling proof of concept from something that holds up in production.

The chemistry results are genuinely impressive, and Anthropic are clear about the limitations: 20 compounds is a small test set, 2D NMR is out of scope, and some scaffold types weren't covered. But the trajectory is obvious. This is one of the cleaner examples in recent memory of a general model stepping into specialist territory and doing real work there.

Worth watching if you build anything where domain expertise is currently the bottleneck.

Share