Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

124 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Diffusion on language model encodings for protein sequence generation (2403.03726v2)

Published 6 Mar 2024 in cs.LG, cs.AI, and q-bio.BM

Abstract: Protein sequence design has seen significant advances through discrete diffusion and autoregressive approaches, yet the potential of continuous diffusion remains underexplored. Here, we present DiMA, a latent diffusion framework that operates on protein LLM representations. Through systematic exploration of architectural choices and diffusion components, we develop a robust methodology that generalizes across multiple protein encoders ranging from 8M to 3B parameters. We demonstrate that our framework achieves consistently high performance across sequence-only (ESM-2, ESMc), dual-decodable (CHEAP), and multimodal (SaProt) representations using the same architecture and training approach. We extensively evaluate existing methods alongside DiMA using multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins. DiMA consistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching LLMs. The model demonstrates versatile functionality, supporting conditional generation tasks including protein family-generation, motif scaffolding and infilling, and fold-specific sequence design. This work provides a universal continuous diffusion framework for protein sequence generation, offering both architectural insights and practical applicability across various protein design scenarios.

References (60)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a diffusion-based framework that uses language model embeddings to generate realistic protein sequences.
It combines diffusion processes with advanced language representations to effectively capture complex protein sequence patterns.
Experimental results demonstrate significant performance gains over traditional methods, highlighting its potential in protein design research.

A Comprehensive Overview of Submission and Formatting Instructions for ICML 2024

Introduction

The annual International Conference on Machine Learning (ICML) stands as a pivotal gathering for scholars, researchers, and professionals within the machine learning domain to exchange insights, progress, and forecasts about the discipline's trajectory. The 2024 submission guidelines serve as a beacon of structure for prospective contributors, delineating the requisite formatting, submission procedures, and ethical considerations indispensable for deliberation in the conference proceedings. This blog post aims to simplify and elucidate these cardinal regulations, ensuring authors navigate the submission landscape with efficacy and compliance.

Electronic Submission and Preparations

Submissions for ICML 2024 pivot entirely on an electronic interface, dismissing any form of email or hard copy submissions. In a novel twist, appendices must be amalgamated with the main manuscript and references into a singular file for submission, adhering strictly to a PDF format. This consolidation aims at steering clear of oversight during the review process. Paramount details include:

PDF Format Exclusivity: The manuscript, inclusive of appendices, must abide strictly to a PDF format.
Page Limitations: An 8-page limit is enforced on the main body of the paper, with appendices and references permitted additional space. Authors of accepted papers will have the leverage to expand the main body by an extra page in their final submission.
Author Anonymity: Initial submissions must obscure author identities, a decree supporting the double-blind review ethos of ICML.

Style and Formatting Nuances

The document’s stylistic and typographic elements bear significant weight. The adherence to a 10 point Times font throughout the textual content is mandatory, punctuated by exacting specifications regarding figure captions, table placements, and the encapsulation of references. Critical notations entail:

Font Integrity: The mandatory use of Type-1 fonts to avert complications in the transcoding of the document.
Element Positioning: Specific directives on the placement and formatting of figures, tables, and references to maintain consistency and readability.
Reference Formatting: A chronological ordering in citations, with a comprehensive detailing including page numbers where feasible, ensuring a uniform presentation of the bibliography.

Ethical Compliance and Anonymity

ICML’s staunch commitment to ethical scholarliness is evident through its stringent policies on simultaneous submissions, which are summarily rejected if found to be under consideration elsewhere. The anonymity clause extends to censoring any form of author identification within the submission text, fostering an unbiased review process. Additionally, any form of prior work by the authors should be cited in a manner that preserves the review's blind nature.

Evaluation Matrices and Ablation Studies

The guidelines underscore a distinctive emphasis on rigorous empirical evaluation, with a performance comparison table delineated in the paper serving as a template. The inclusion of an ablation paper serves not only to benchmark the proposed DiMA model against prevailing architectures but also highlights the incremental enhancements afforded by various model iterations, evidencing a methodical approach to model refinement.

Implications and Theoretical Contributions

While maintaining a detached narrative, it’s paramount to underscore the paper’s theoretical and practical implications within the machine learning community. The quantitative leaps in performance metrics postulated by the DiMA model accentuate its potential for improving predictive accuracies in protein sequence modeling. Moreover, the theoretical underpinnings detailed in the model's architecture propose a novel paradigm that may spur further research within the domain.

Conclusion and Future Directions

In sum, the ICML 2024 submission and formatting instructions provide a detailed blueprint for authors to follow, ensuring their work is presented in a coherent and standardized manner. The guidelines are designed to facilitate a fair and rigorous review process, encouraging the submission of high-quality papers that advance the field of machine learning. Through adherence to these guidelines, authors can contribute to the rich tapestry of knowledge that ICML represents, pushing the boundaries of what is possible in machine learning research.

PDF Markdown

Tweets

https://twitter.com/Pastel/status/1765648552544329731