ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

Published 26 Feb 2024 in cs.CE and q-bio.BM | (2402.16445v2)

Abstract: LLMs have achieved remarkable performance in multiple NLP tasks. Under the premise that protein sequences constitute the protein language, Protein LLMs(PLMs) have advanced the field of protein engineering. However, as of now, unlike LLMs in NLP, PLMs cannot handle the protein understanding task and the protein generation task simultaneously in the Protein Language Processing (PLP) field. This prompts us to delineate the inherent limitations in current PLMs: (i) the lack of natural language capabilities, (ii) insufficient instruction understanding, and (iii) high training resource demands. To address these challenges, we introduce a training framework to transform any general LLM into a PLM capable of handling multiple PLP tasks. To improve training efficiency, we propose Protein Vocabulary Pruning (PVP) for general LLMs. We construct a multi-task instruction dataset containing 13 million samples with superfamily information, facilitating better modeling of protein sequence-function landscapes. Through these methods, we develop the ProLLaMA model, the first known PLM to handle multiple PLP tasks simultaneously. Experiments show that ProLLaMA achieves state-of-the-art results in the unconditional protein sequence generation task. In the controllable protein sequence generation task, ProLLaMA can design novel proteins with desired functionalities. As for the protein understanding task, ProLLaMA achieves a 62\% exact match rate in superfamily prediction. Codes, model weights, and datasets are available at \url{https://github.com/PKU-YuanGroup/ProLLaMA} and \url{https://huggingface.co/GreatCaptainNemo}.

Abstract PDF HTML Upgrade to Chat

References (58)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces a dual-stage training framework using LoRA to transform general LLMs into a state-of-the-art protein language model with multi-task capabilities.
It achieves superior results in protein sequence generation and property prediction, recording high pLDDT and TM-scores that surpass existing models.
Leveraging natural language instructions, ProLLaMA bridges NLP and protein science, paving the way for advances in drug discovery and synthetic biology.

Overview of ProLLaMA: A Multi-Task Protein LLM

The paper introduces ProLLaMA, an innovative protein LLM (ProLLM) designed for multi-task protein language processing. Unlike traditional ProLLMs which primarily focus on single tasks, typically de novo protein sequence generation, ProLLaMA addresses a broader spectrum of tasks by incorporating a training framework that extends a general LLM's (LLM) capabilities to protein sequences. This approach leverages advancements made in NLP LLMs to the protein language domain, resolving inherent limitations such as the lack of multi-task capabilities and insufficient understanding of natural language instructions.

Key Contributions and Results

The paper outlines the architectural and methodological advancements in the ProLLaMA model, which are crucial for its multi-task proficiency:

Training Framework: The authors propose a two-stage training framework to transform general LLMs into ProLLMs. This involves continual learning on protein language data and subsequent instruction tuning. Notably, the training strategy employs Low-Rank Adaptation (LoRA), enhancing scalability and maintaining efficiency by reducing computational overhead during training.
Multi-Task Capability: ProLLaMA excels in multiple protein-related tasks, such as unconditional and controllable protein sequence generation and protein property prediction. It achieves state-of-the-art performance in these tasks, showcasing its ability to handle complex queries and generate proteins with specific desired functionalities based on user instructions.
Numerical Performance: The ProLLaMA model demonstrates strong numerical results. It achieves high scores in terms of pLDDT and TM-score for protein sequence generation, even outperforming existing ProLLMs in metrics indicating structural plausibility and similarity to known protein structures. Similarly, in property prediction tasks, the model achieves nearly perfect accuracy across multiple protein superfamily categories.
Natural Language Integration: By retaining and utilizing its natural language processing capabilities, ProLLaMA effectively handles instruction-driven tasks, which are not feasible with current ProLLMs. This model provides an important bridge between NLP and protein language processing domains, leveraging natural language instructions to extend its applicability.

Implications and Future Directions

ProLLaMA presents significant implications for computational biology and biotechnology, aligning with contemporary needs in drug discovery and synthetic biology. Its enhanced functionality allows researchers to explore protein engineering with higher precision, driven by natural language instructions. The adaptability of ProLLaMA to integrate additional tasks through scalable training frameworks suggests a compelling avenue for further research, potentially facilitating the broader incorporation of AI models in protein science and expedited biotechnological advancements.

Moreover, this research underscores the importance of interdisciplinary strategies in advancing domain-specific LLMs. The methodology sets a precedent for future AI developments in scientific domains, where models are expected to handle diverse tasks seamlessly. Future developments might explore refining the natural language understanding of ProLLMa further, enabling even more complex protein engineering tasks and the seamless integration of additional functional instructions.

In conclusion, ProLLaMA represents a significant step forward for protein LLMs, emphasizing the power of multi-tasking capabilities and efficient resource usage. This research suggests extensive potential for ProLLaMA in practical applications and highlights its contribution to bridging NLP techniques with scientific inquiries in proteomics.

Markdown Report Issue