Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 42 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 217 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Tx-LLM: A Large Language Model for Therapeutics (2406.06316v1)

Published 10 Jun 2024 in cs.CL, cs.AI, cs.CE, and cs.LG

Abstract: Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist LLM fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.

Citations (6)

Summary

  • The paper presents Tx-LLM, which fine-tunes PaLM-2 on 709 datasets across 66 therapeutic tasks to advance drug discovery and development.
  • It employs a mix of zero-shot and few-shot learning techniques using SMILES strings, amino acid sequences, and text inputs for diverse data integration.
  • Tx-LLM achieves competitive to superior performance on 43 tasks, surpassing state-of-the-art benchmarks on 22 tasks and demonstrating positive cross-domain transfer.

Tx-LLM: A LLM for Therapeutics

Introduction

The paper introduces Tx-LLM, an innovative approach towards enhancing the drug discovery and development pipeline by leveraging LLMs. Tx-LLM is a fine-tuned variant of PaLM-2, designed to encode knowledge across diverse therapeutic modalities. The model is trained on a collection of 709 datasets covering 66 tasks, which span various stages of drug discovery and development, offering a generalist model capable of addressing a wide array of tasks with a single set of weights. Figure 1

Figure 1: Overview of Tx-LLM, showcasing the integration of diverse datasets and the training approach using TxT.

Methodology

Datasets

Tx-LLM leverages the Therapeutics Data Commons (TDC) to curate a comprehensive dataset, TxT, which includes information from small molecules, proteins, nucleic acids, and more. These datasets are formatted with specific instructions, context, questions, and answers to facilitate the fine-tuning process. Tasks are classified into binary classification, regression, and generation categories, utilizing representations such as SMILES strings and amino acid sequences.

Modeling and Training

Tuition of Tx-LLM begins with fine-tuning the PaLM-2 model using TxT. The training process involves a mixture of zero-shot and few-shot learning techniques, with prompts containing varying numbers of shots selected randomly or based on nearest neighbor datapoints. Performance is evaluated against state-of-the-art (SOTA) metrics for each dataset to gauge the model's efficacy.

Results

Tx-LLM demonstrated competitive to superior performance on 43 out of 66 tasks, surpassing SOTA in 22 tasks. The model particularly excels in datasets where molecular SMILES representations are combined with textual information, leveraging the context learned during pretraining. Figure 2

Figure 2: Comparison of Tx-LLM's performance against SOTA, illustrating its capabilities across multiple datasets.

Positive transfer between datasets of diverse drug types was evident, as training on broader datasets yielded improvements on molecular-specific tasks. The model also benefits from scaling, with larger variants outperforming smaller ones, and domain-specific fine-tuning contributing positively to the results. Figure 3

Figure 3: Demonstrates evidence of positive transfer across datasets with diverse drug types.

Discussion

The implications of Tx-LLM are profound, offering a unified model that can address a multitude of tasks across the therapeutic development pipeline. The positive transfer observed suggests that LLMs can serve as generalist models proficient in handling diverse biochemical data forms. However, challenges remain, particularly in representing small molecule datasets and ensuring data contamination is managed.

Tx-LLM's design and training methodology provide valuable insights into the potential for LLMs in drug development, particularly when equipped with carefully curated and diverse datasets. The model's ability to leverage text and sequence data effectively positions it as a promising tool for end-to-end applications in therapeutic development.

Conclusion

Tx-LLM represents a significant advancement in the use of AI for therapeutic development, illustrating the potential for LLMs to streamline various stages of the drug discovery pipeline. With further development and validation, Tx-LLM could play a crucial role in reducing the time and cost associated with bringing new therapeutics to market. The model's success underscores the importance of integrating diverse data types and leveraging LLMs' capabilities in understanding and predicting complex biochemical interactions.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.