ProGen2: Exploring the Boundaries of Protein Language Models (2206.13517v1)

Published 27 Jun 2022 in cs.LG and q-bio.QM

Abstract: Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein LLMs, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional finetuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. We release the ProGen2 models and code at https://github.com/salesforce/progen.

Authors (5)

Erik Nijkamp (22 papers)
Jeffrey Ruffolo (2 papers)
Eli N. Weinstein (5 papers)
Nikhil Naik (25 papers)
Ali Madani (11 papers)

Citations (220)

View on Semantic Scholar

Summary

The paper demonstrates that scaling protein language models reduces perplexity and enhances sequence modeling, yet does not uniformly improve fitness prediction.
The study employs models from 151M to 6.4B parameters, revealing that larger models better capture sequence trends but are not always optimal for function prediction.
The research highlights that diverse training data and model-data alignment are crucial for generating novel protein sequences with structural integrity.

ProGen2: An Examination of Protein LLM Scalability

The paper, "ProGen2: Exploring the Boundaries of Protein LLMs," presents a comprehensive paper on the scaling of protein LLMs to improve the efficacy of protein sequence understanding, generation, and fitness prediction. This research delineates the process, performance, and implications of training large-scale models, offering insights into the potential and limitations of protein LLMs.

The authors introduce a series of models, collectively named ProGen2, ranging in size from 151 million to 6.4 billion parameters, trained on a diverse dataset comprising over a billion protein sequences. These include genomic, metagenomic, and immune repertoire sequences. The development of these models aims to advance state-of-the-art capabilities in capturing evolutionary sequences, generating novel viable protein sequences, and predicting protein fitness without additional finetuning.

Key Results

Model Scale and Performance: The results demonstrate that larger models exhibit lower perplexity on held-out test sequences, indicating superior modeling of the training data distribution. However, larger model size does not consistently translate to improved fitness prediction, a finding aligned with existing research indicating the misalignment between training distribution and true fitness landscapes.
Generative Capabilities: ProGen2 models are capable of generating novel protein sequences with structural diversity. The research highlights examples where generated sequences deviate significantly from observed proteins, often maintaining structural integrity but introducing novel folds and functional sites.
Zero-shot Fitness Prediction: In zero-shot fitness prediction tasks, models trained on universal protein datasets generally outperform those trained specifically on immune data for predicting protein functionality, including binding affinity and other general properties. Noteworthy is the observation that smaller models sometimes outperform larger models in fitness prediction, suggesting complexities in mapping sequence probability to actual fitness landscapes.
Training Data Distribution: The paper underscores the importance of the distribution of the training dataset. It emphasizes the potential bias introduced by phylogenetic and sequencing artifacts that might misalign model predictions with true functional properties.

Implications and Future Directions

The implications of these findings are significant for the field of protein engineering. As the research illustrates, larger protein LLMs can expand the sequence space and offer novel generations that may surpass natural evolution in producing functionally diverse proteins. However, the mixed results in fitness prediction highlight the need for more nuanced data handling and model training approaches that better capture evolutionary pressures and functional fitness landscapes.

The authors suggest that as sequencing data becomes more vast and accessible, the quality and representativeness of these datasets will be paramount. Future research could focus on refining data selection and curating techniques to align more closely with functional annotation and experimental validation. Additionally, exploring hybrid modeling strategies that combine large-scale models with targeted finetuning or retrieval-augmented generation could bridge the gap between model output and functional performance.

Conclusion

The ProGen2 suite represents a significant advance in protein LLMing, pushing the boundaries of scale and application. While these models achieve impressive results in sequence distribution capture and novel protein generation, challenges remain in aligning these capabilities with practical protein design objectives. The paper encourages ongoing exploration into model-data alignment and highlights the collaborative potential between computational and experimental disciplines to harness the full potential of AI-driven protein insights.

Related Papers

GitHub

GitHub - salesforce/progen: Official release of the ProGen models (617 stars)