Diffusion Language Models Are Versatile Protein Learners (2402.18567v2)

Published 28 Feb 2024 in cs.LG and q-bio.BM

Abstract: This paper introduces diffusion protein LLM (DPLM), a versatile protein LLM that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training makes DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022). Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioner, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance. Code is released at \url{https://github.com/bytedance/dplm}.

References (123)

Citations (15)

View on Semantic Scholar

Summary

The paper introduces the Diffusion Protein Language Model (DPLM), which unifies generative and predictive tasks by leveraging discrete diffusion tailored for protein sequences.
The model employs an iterative mask-predict denoising process and two-stage training, leading to improved foldability, structural novelty, and sequence diversity.
DPLM outperforms existing protein language models on downstream tasks and offers versatile conditional generation for advanced protein design applications.

Diffusion LLMs for Protein Sequence Modeling: DPLM

Overview

The paper presents Diffusion Protein LLM (DPLM), a scalable protein LLM leveraging discrete diffusion probabilistic modeling for both generative and predictive tasks on protein sequences. DPLM is pre-trained on evolutionary-scale protein data and demonstrates strong performance in unconditional sequence generation, representation learning for downstream tasks, and versatile conditional generation scenarios. The approach generalizes language modeling for proteins by integrating the expressiveness of transformer-based LMs with the iterative refinement and global receptive field of diffusion models, specifically tailored for discrete sequence data.

Discrete Diffusion Framework for Protein Sequences

DPLM is built upon a discrete diffusion probabilistic model, where the forward process incrementally corrupts protein sequences by masking tokens according to a noise schedule, and the reverse process iteratively denoises to reconstruct the original sequence. The model operates directly on the categorical space of amino acids, avoiding the limitations of continuous relaxations for discrete data. The training objective is a reweighted cross-entropy over masked positions, unifying masked language modeling (MLM) and autoregressive LM (AR-LM) paradigms as special cases.

Implementation Details

Architecture: DPLM adopts transformer architectures with model sizes up to 3B parameters, mirroring ESM2 configurations for direct comparison.
Pre-training: Models are trained on UniRef50 (∼45M sequences, ∼14B tokens), with sequence truncation to 1024 tokens for long proteins. Training employs a two-stage strategy: initial MLM pre-training followed by diffusion objective adaptation, which improves convergence and generative quality.
Sampling: Generation proceeds via iterative mask-predict denoising, starting from a fully masked sequence. At each step, top-k positions (by log-probability) are unmasked and updated, with Gumbel-Max trick applied to enhance diversity and avoid mode collapse.

Generative Capabilities

Unconditional Generation

DPLM generates protein sequences with high foldability, as measured by ESMFold pLDDT scores (>80 across lengths), and produces structurally novel and diverse samples. The model outperforms both MLM and AR-LM baselines in foldability, novelty (lower pdb-TM scores for long sequences), and diversity (lower inner-TM scores). Scaling the model size further improves performance, especially for long proteins.

Conditional Generation

DPLM supports multiple conditioning modalities:

Partial Sequence Conditioning: Enables motif scaffolding and infilling tasks by fixing specified residues and generating the remainder, outperforming EvoDiff in success rate and number of solved problems.
Cross-modal Conditioning: Incorporates structure information via adapter-tuning with expert encoders (e.g., GVP-Transformer), enabling inverse folding and structure-aware sequence design. Exposure bias is mitigated by training on draft sequences generated by the structure encoder.
Plug-and-play Classifier Guidance: Integrates discriminative models (e.g., secondary structure predictors) for controllable generation. Guidance is implemented via first-order Taylor expansion on the probability simplex, allowing flexible steering of generation towards desired properties without retraining.

Representation Learning and Downstream Tasks

DPLM provides superior sequence embeddings for a range of predictive tasks, including thermostability, metal ion binding, protein-protein interaction, EC/GO annotation, and localization. Fine-tuned DPLM models consistently outperform ESM2 and approach the performance of structure-aware models (e.g., SaProt), despite being trained solely on sequence data. The diffusion pre-training, with variable masking ratios, forces the model to capture deeper contextual dependencies, enhancing representation quality.

Comparative Analysis

DPLM advances over prior protein LMs (ESM2, EvoDiff) by unifying generative and predictive capabilities in a single framework. Unlike EvoDiff, which relies on order-agnostic autoregressive diffusion and MSA-based parameterization, DPLM employs a principled discrete diffusion approach, supports efficient conditioning, and achieves strong representation learning. The model also avoids the computational overhead of Monte Carlo or Gibbs sampling required for generation with MLMs.

Performance Metrics and Scaling

Foldability: pLDDT > 80 for generated sequences across lengths.
Novelty: Lower pdb-TM scores for long sequences, indicating structural novelty.
Diversity: Lower inner-TM scores, reflecting diverse structural outputs.
Downstream Tasks: DPLM (650M) achieves top accuracy and Fmax scores across multiple benchmarks, surpassing ESM2 and matching structure-aware baselines.
Conditional Tasks: Higher success rates in motif scaffolding and competitive performance in inverse folding (AAR, scTM, pLDDT).

Resource Requirements and Deployment

Training: Large-scale pre-training requires substantial compute (batch sizes up to 1M tokens, 100K updates), but two-stage training mitigates convergence issues.
Inference: Iterative denoising is parallelizable and supports flexible conditioning. Adapter-tuning for cross-modal tasks is parameter-efficient, requiring only the adapter to be trained.
Deployment: DPLM can be integrated into protein design pipelines for de novo generation, motif scaffolding, and structure-aware sequence design. Plug-and-play guidance enables rapid adaptation to new property constraints.

Limitations and Future Directions

Conditional Generation: Extension to broader modalities (MSA, ligand, antigen) and more complex property guidance (symmetry, binding affinity) is warranted.
Long Contexts: Incorporation of long-context modeling techniques could enable handling of very long proteins, DNA, or RNA sequences.
Structure Integration: Joint modeling of sequence and structure, potentially via multi-modal diffusion frameworks, could further enhance performance.
Instruction Tuning and RL: Adapting instruction-following and reinforcement learning paradigms from LLMs may unlock new capabilities in protein design.

Conclusion

DPLM establishes discrete diffusion as a robust probabilistic framework for protein language modeling, achieving state-of-the-art generative and predictive performance. Its versatility in conditioning, strong representation learning, and scalable architecture position it as a foundational model for AI-driven protein research. Future work should focus on expanding conditional capabilities, integrating structural modeling, and leveraging advances from general-purpose LLMs to further enhance protein design and understanding.