Emergent Mind

Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

(2405.20313)
Published May 30, 2024 in cs.LG and q-bio.BM

Abstract

Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.

Schematic of generating protein backbones using a generative flow model and inverse folding model.

Overview

  • The paper introduces FoldFlow-2, a novel model for generating conditional protein backbones, combining advanced architectural components and a robust flow-matching framework.

  • FoldFlow-2 leverages protein sequence conditioning via a pre-trained language model and integrates within an SE(3)-equivariant flow-matching framework to generate proteins with correct folding and desired structural and functional properties.

  • The model showcases exceptional performance in unconditional protein generation and surpasses existing models in various conditional tasks, demonstrating significant implications for computational drug discovery and further advancements in generative modeling.

Sequence-Augmented $SE(3)$-Flow Matching For Conditional Protein Backbone Generation

"Sequence-Augmented $SE(3)$-Flow Matching For Conditional Protein Backbone Generation" introduces a novel method, FoldFlow-2, designed for generating conditional protein backbones using a combination of sophisticated architectural elements and a robust flow-matching framework. This paper comprehensively tackles the highly complex problem of rational protein design—an essential aspect of contemporary computational drug discovery.

The principal contributions of FoldFlow-2 are its ability to leverage protein sequence conditioning via a pre-trained language model and its integration within an $SE(3)$-equivariant flow-matching framework. This capability is crucial for generating proteins that fold correctly and exhibit desired structural and functional properties. Below is a detailed exploration of the architecture, dataset, empirical results, and implications of this model.

Technical Framework and Methodology

Model Architecture

FoldFlow-2's architecture consists of three core components:

  1. Structure and Sequence Encoder:

    • The encoder employs the invariant point attention (IPA) transformer to process structural inputs, taking advantage of $SE(3)$-equivariance.
    • Sequence inputs are encoded using a large pre-trained protein language model (ESM2-650M), enabling the model to benefit from the ingrained biological heuristics learned from a vast corpus of protein sequences.
  2. Multi-Modal Fusion Trunk:

    • This trunk combines the encoded structure and sequence representations into a joint latent space. Utilizing LayerNorm ensures stable interactions between different modalities.
  3. Geometric Decoder:

    • The decoder, based on an IPA transformer, projects the fused representations back into an $SE(3)$-equivariant space, generating the structures required for further analysis and applications.

Loss Function and Flow Matching

The paper employs a flow-matching loss function defined over the $SE(3)$ group, ensuring that the generated backbones maintain spatial invariances critical for protein synthesis. The loss function optimizes both rotational and translational components of the protein frames, pushing the generated samples to fit the true data distribution as closely as possible.

Dataset Construction and Empirical Setup

  • Dataset Augmentation: The authors curated a dataset significantly larger than the standard PDB, integrating filtered, high-quality synthetic structures derived from the SwissProt database. This augmentation proved essential for diversifying training data.
  • Training Dynamics: Employing an effective mix of true and synthetic structures and sophisticated masking strategies, the training phase ensures the model can generalize to unseen sequences effectively.

Experimental Results

Unconditional Generation

  • Designability: FoldFlow-2 exhibits exceptional performance with a nearly perfect designability fraction, surpassing all existing models.
  • Novelty and Diversity: The model generates a significantly higher fraction of novel and diverse structures, proven by stringent TM-score analyses and cluster evaluations. Specifically, FoldFlow-2's ability to produce a variety of secondary structures, including $\beta$-sheets and coils, highlights its practical utility.

Conditional Tasks

  • Motif Scaffolding: The model seamlessly handles complex scaffolding tasks, achieving perfect scores on existing benchmarks and outperforming competitors in new, more biologically relevant challenges such as VHH scaffolding.
  • Protein Folding: While originally designed for generative tasks, FoldFlow-2 also exhibits strong performance in sequence-to-structure prediction, rivaling specialized folding models.

Implications and Future Directions

Practical Applications

FoldFlow-2's success in generating highly designable and novel proteins has significant ramifications for computational drug discovery. Specifically, its ability to condition generation on sequences makes it applicable to designing proteins with specific functional properties—crucial for tackling complex diseases like COVID-19 and cancer.

Theoretical Contributions

On a theoretical level, the integration of flow matching within an $SE(3)$-equivariant framework and the use of a language model-conditioned architecture represent substantial advancements in the generative modeling landscape. These innovations could spur further research into multi-modal fusion techniques and more efficient, scalable model architectures for protein generation.

Conclusion

"Sequence-Augmented $SE(3)$-Flow Matching For Conditional Protein Backbone Generation" marks a significant step forward in protein design via generative models. FoldFlow-2 not only sets new benchmarks across multiple metrics but also broadens the horizon of what can be achieved through conditional generative modeling in the biological and biochemical domains. Future work could investigate further scalability, applicability to other biological systems, and enhancements via reinforcement training methodologies, potentially leading to even more diversified and functional protein designs.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube