Large language models may speed drug discovery

Computational models have been a major time saver when it comes to predicting which protein molecules could make effective drugs, but many of those methods themselves take a lot of time and computing power.

Now researchers at MIT and Tufts have devised an alternative approach based on an algorithm known as a large language model, which can figure out which words (or, in this case, amino acids) are most likely to appear together. The model can match target proteins and potential drug molecules without the computationally intensive step of calculating each protein’s 3D structure from its amino acid sequence. The resulting system can screen more than 100 million drug-protein pairs in a single day.

The researchers tested their model by screening a library of about 4,700 candidate drug molecules for their ability to bind to a set of 51 enzymes. From the top hits, they tested 19 drug-protein pairs; the tests revealed that 12 had strong binding affinity, whereas nearly all of the many other possible pairs would have no affinity.

“Part of the reason why drug discovery is so expensive is because it has high failure rates,” says Rohit Singh, PhD ’12, a CSAIL research scientist and one of the lead authors of a paper on the work. “If we can reduce those failure rates by saying up front that this drug is not likely to work out, that could go a long way in lowering the cost of drug discovery.”