Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation (2405.20313v2)

Published 30 May 2024 in cs.LG and q-bio.BM

Abstract: Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein LLM to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces FoldFlow-2, a method that combines sequence conditioning via a pre-trained language model with SE(3)-equivariant flow matching to generate protein backbones.
It employs invariant point attention and a multi-modal fusion trunk to integrate structural and sequence data, achieving impressive designability and novel structure metrics.
FoldFlow-2 outperforms benchmarks in motif scaffolding and sequence-to-structure prediction, highlighting its potential impact in computational drug discovery.

Sequence-Augmented $SE(3)$ -Flow Matching For Conditional Protein Backbone Generation

"Sequence-Augmented $SE(3)$ -Flow Matching For Conditional Protein Backbone Generation" introduces a novel method, FoldFlow-2, designed for generating conditional protein backbones using a combination of sophisticated architectural elements and a robust flow-matching framework. This paper comprehensively tackles the highly complex problem of rational protein design—an essential aspect of contemporary computational drug discovery.

The principal contributions of FoldFlow-2 are its ability to leverage protein sequence conditioning via a pre-trained LLM and its integration within an $SE(3)$ -equivariant flow-matching framework. This capability is crucial for generating proteins that fold correctly and exhibit desired structural and functional properties. Below is a detailed exploration of the architecture, dataset, empirical results, and implications of this model.

Technical Framework and Methodology

Model Architecture

FoldFlow-2's architecture consists of three core components:

Structure and Sequence Encoder:
- The encoder employs the invariant point attention (IPA) transformer to process structural inputs, taking advantage of $SE(3)$ -equivariance.
- Sequence inputs are encoded using a large pre-trained protein LLM (ESM2-650M), enabling the model to benefit from the ingrained biological heuristics learned from a vast corpus of protein sequences.
Multi-Modal Fusion Trunk:
- This trunk combines the encoded structure and sequence representations into a joint latent space. Utilizing LayerNorm ensures stable interactions between different modalities.
Geometric Decoder:
- The decoder, based on an IPA transformer, projects the fused representations back into an $SE(3)$ -equivariant space, generating the structures required for further analysis and applications.

Loss Function and Flow Matching

The paper employs a flow-matching loss function defined over the $SE(3)$ group, ensuring that the generated backbones maintain spatial invariances critical for protein synthesis. The loss function optimizes both rotational and translational components of the protein frames, pushing the generated samples to fit the true data distribution as closely as possible.

Dataset Construction and Empirical Setup

Dataset Augmentation: The authors curated a dataset significantly larger than the standard PDB, integrating filtered, high-quality synthetic structures derived from the SwissProt database. This augmentation proved essential for diversifying training data.
Training Dynamics: Employing an effective mix of true and synthetic structures and sophisticated masking strategies, the training phase ensures the model can generalize to unseen sequences effectively.

Experimental Results

Unconditional Generation

Designability: FoldFlow-2 exhibits exceptional performance with a nearly perfect designability fraction, surpassing all existing models.
Novelty and Diversity: The model generates a significantly higher fraction of novel and diverse structures, proven by stringent TM-score analyses and cluster evaluations. Specifically, FoldFlow-2's ability to produce a variety of secondary structures, including $\beta$ -sheets and coils, highlights its practical utility.

Conditional Tasks

Motif Scaffolding: The model seamlessly handles complex scaffolding tasks, achieving perfect scores on existing benchmarks and outperforming competitors in new, more biologically relevant challenges such as VHH scaffolding.
Protein Folding: While originally designed for generative tasks, FoldFlow-2 also exhibits strong performance in sequence-to-structure prediction, rivaling specialized folding models.

Implications and Future Directions

Practical Applications

FoldFlow-2's success in generating highly designable and novel proteins has significant ramifications for computational drug discovery. Specifically, its ability to condition generation on sequences makes it applicable to designing proteins with specific functional properties—crucial for tackling complex diseases like COVID-19 and cancer.

Theoretical Contributions

On a theoretical level, the integration of flow matching within an $SE(3)$ -equivariant framework and the use of a LLM-conditioned architecture represent substantial advancements in the generative modeling landscape. These innovations could spur further research into multi-modal fusion techniques and more efficient, scalable model architectures for protein generation.

Conclusion

"Sequence-Augmented $SE(3)$ -Flow Matching For Conditional Protein Backbone Generation" marks a significant step forward in protein design via generative models. FoldFlow-2 not only sets new benchmarks across multiple metrics but also broadens the horizon of what can be achieved through conditional generative modeling in the biological and biochemical domains. Future work could investigate further scalability, applicability to other biological systems, and enhancements via reinforcement training methodologies, potentially leading to even more diversified and functional protein designs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Guillaume_hu/status/1798072477551309202

https://twitter.com/chaitjo/status/1797905501382230188

https://twitter.com/DreamFoldAI/status/1798065765725405639

https://twitter.com/bose_joey/status/1865222402336047406

https://twitter.com/ml4proteins/status/1820163588620726504

https://twitter.com/bose_joey/status/1798079210629771673

YouTube

Show All Videos