DNA-Encoded Libraries and AI for Compound Identification

An expert interview on the use of DNA-encoded libraries for connection detection.

from

Vendela Jagdt

An expert interview on the use of DNA-encoded libraries for connection detection with Eric Sigel, Co-Founder & CEO at Citadel Discovery, and Dr. Thomas Wollmann, CTO at Merantix Momentum. Moderated by Dr. Gillian Hertlein, Strategic Project Manager at Merantix Momentum.

Gillian: Thank you both for joining us today. I would like to welcome Eric Sigel, CEO of Citadel Discovery, he has over two decades of experience in drug discovery, including pioneering work in DNA encoded libraries and machine learning applications, and Thomas Wollmann, CTO at Merantix Momentum, who has a background in microscopy image analysis for systems biology and worked extensively in various areas of medical computer science.

Let's dive right into the topic of DNA-encoded libraries. Some of our audience may not be familiar with the concept. Eric, could you provide a simplified explanation of how traditional methods identify small molecules in drug discovery?

Eric: Thank you, Gillian. In traditional drug discovery, scientists search for small molecules that can interact with a target protein. They often rely on biochemical or cell-based assays to identify these molecules. It's a bit like finding a needle (the active molecules) in a haystack (large collections of molecules). These methods are effective, but they can be time-consuming and costly.

Gillian: Could you explain how DNA-encoded libraries work and how they have revolutionized the process?

Eric: Traditional DEL methods start with a specialized molecular headpiece, which has DNA and chemistry components. This headpiece is reacted with a range of chemical building blocks and an encoding strand of DNA in successive steps. This process, known as split and pool, allows us to create a library of compounds, ranging from millions to billions, with each compound encoded by DNA. These libraries can then be incubated with a protein of interest, and the molecules that bind to the protein are isolated, washed, and identified using next-generation sequencing like Illumina. This approach offers the advantage of testing a large number of compounds simultaneously, unlike traditional methods that test one compound at a time.

Gillian: Could you elaborate on how machine learning is used in DNA-encoded libraries?

Eric: Machine learning plays a crucial role in analyzing the vast amounts of data generated by DNA-encoded libraries. It excels at identifying statistical trends within complex datasets. When dealing with such extensive data, human analysts can't effectively process all the information. Machine learning steps in to filter and identify trends in the data, even at levels that may go unnoticed by human analysts. It can recognize patterns that are generalizable across the entire library, allowing us to predict molecules' likelihood to interact with the protein, which is especially useful for cost-effective molecule selection and optimization.

Gillian: Thomas, given your expertise in cell biology, could you share insights on how this technology could be applied in cell biology environments?

Thomas: Sure, the barcoding technique used in DNA-encoded libraries is also relevant in cell biology. It enables researchers to simultaneously examine phenotypes at the cellular level and sequence single cells. This approach opens up new possibilities for studying individual cells within a population, which can provide valuable insights. It allows researchers to analyze cells individually, considering their unique characteristics, which is particularly useful in the context of heterogeneous cell populations.

Gillian: Eric, despite the advantages of DNA-encoded libraries, the practice is not yet standard in all drug compound screens. What are some reasons people still prefer traditional high-throughput screening (HTS)(1) methods?

Eric: Traditional HTS has its merits, and there are reasons why some researchers opt for it. HTS provides data that closely resembles the output of interest, such as biochemical or cell-based assay readouts. Additionally, many researchers have well-established HTS capabilities, including large compound collections and robotic systems. However, DNA-encoded libraries are gaining ground due to their cost-effectiveness, ability to test multiple conditions simultaneously, and the potential for broad applicability across different target types. It's worth noting that DNA-encoded libraries may not be suitable for all targets, especially those that are challenging to purify or which are intrinsically disordered proteins(2).

Gillian: Now, let's talk about predictions. How do you make predictions about which compounds will bind to a protein, and how do you assess the accuracy of these predictions?

Eric: Predictions are a crucial aspect of the drug discovery process. When making predictions, it's essential to validate them through experimental testing. One approach is to retrospectively predict compounds that have been identified (with or without machine learning) and then compare these predictions to the actual experimental results. This allows us to gauge the accuracy of our machine learning models. Additionally, predictions are inherently tested when we select molecules based on these predictions and assess their binding to a protein. An iterative process (of prediction and testing) helps us refine and improve our models over time.

Gillian: A question for both of you. How do you see the future of drug identification with the integration of AI and machine learning?

Thomas: I see two major areas where AI and machine learning will continue to revolutionize compound identification. First, we'll see advancements in the design-make-test cycle, with the integration of automation and systems that enable faster experimentation and data collection. This will allow us to continuously improve our models and make predictions at a larger scale. Second, there's a growing opportunity to leverage historical data and contextual information from various sources, including literature and databases. With improved data mining techniques and ontological mapping, we can build better models for compound identification by integrating diverse datasets effectively.

Eric: I agree with Thomas, and I'd like to add a few more points. Firstly, the combination of computational chemistry and machine learning will become increasingly important. Researchers will explore the synergy between physics-based calculations and machine learning to understand molecular interactions better. Secondly, advancements in predicting protein structures, such as those seen with AlphaFold, will enable more accurate predictions of small molecule interactions with proteins. We'll also witness improvements in multi-parameter optimization, where models will consider both molecule potency and drug likeness. Lastly, logistical challenges, such as faster experiment turnaround times, will need to be addressed for machine learning and drug discovery to work seamlessly.

Gillian: How many years do you think it will take before we can rely on predictions to create compounds without extensive testing of libraries?

Eric: Predicting the exact timeline is challenging, but I believe we're making rapid progress. I wouldn't be surprised if, in a few years, we can identify compounds based on predictions for certain cases. Perhaps within a decade, it could become a regular practice. However, it's important to note that there will always be cases where testing is necessary due to the vast combinatorial complexity of molecule-protein interactions.

Thomas: I concur with Eric. We're making significant strides, and in the next few years, we might witness the ability to create compounds based on predictions in specific cases. Within a decade, it could become more routine. Nevertheless, there will always be scenarios where testing remains essential due to the complexity of molecular combinations.

Gillian: Thank you both for your responses. We appreciate your time and expertise.

(1) High-Throughput Screening (HTS): A drug discovery process that allows automated testing of large numbers of chemical and/or biological compounds for a specific biological target, for example through binding assays

(2) Intrinsically Disordered Proteins: These proteins lack a well-defined three-dimensional structure but do exhibit some dynamical and structural ordering.

‍

Subscribe to the Merantix Momentum Newsletter now.

The latest industry news, interviews, technologies and resources.

All articles

AI-Driven Compound Optimization in the Large Molecule Space

An expert interview on AI-Driven Compound Optimization in the Large Molecule Space.

How do you make 30 years of data accessible?

In conversation with Jonas Münch, who developed a RAG chatbot at Bayer that combines 30 years of internal data and clinical results in one place.

Article

Putting AI predictions into practice

An expert interview on the topic of AI-supported predictions.

Article

How do you make 30 years of data accessible?

In conversation with Jonas Münch, who developed a RAG chatbot at Bayer that combines 30 years of internal data and clinical results in one place.

Article

From Data to Impact - How industrial companies create real added value with AI

Solve production problems faster and increase efficiency with a clear data strategy and targeted use of AI