End-to-End Learning on 3D Protein Structure for Interface Prediction (1807.01297v5)

Published 3 Jul 2018 in q-bio.BM, cs.LG, and stat.ML

Abstract: Despite an explosion in the number of experimentally determined, atomically detailed structures of biomolecules, many critical tasks in structural biology remain data-limited. Whether performance in such tasks can be improved by using large repositories of tangentially related structural data remains an open question. To address this question, we focused on a central problem in biology: predicting how proteins interact with one another---that is, which surfaces of one protein bind to those of another protein. We built a training dataset, the Database of Interacting Protein Structures (DIPS), that contains biases but is two orders of magnitude larger than those used previously. We found that these biases significantly degrade the performance of existing methods on gold-standard data. Hypothesizing that assumptions baked into the hand-crafted features on which these methods depend were the source of the problem, we developed the first end-to-end learning model for protein interface prediction, the Siamese Atomic Surfacelet Network (SASNet). Using only spatial coordinates and identities of atoms, SASNet outperforms state-of-the-art methods trained on gold-standard structural data, even when trained on only 3% of our new dataset. Code and data available at https://github.com/drorlab/DIPS.

Authors (4)

Raphael J. L. Townshend (6 papers)
Rishi Bedi (1 paper)
Patricia A. Suriana (1 paper)
Ron O. Dror (10 papers)

Citations (92)

View on Semantic Scholar

Summary

The paper introduces SASNet, a siamese 3D convolutional network that uses voxelized protein surfacelets to predict interaction interfaces.
It leverages the extensive DIPS dataset to overcome limitations of manual feature engineering by capturing hierarchical protein structures and flexibility.
Experimental results demonstrate high CAUROC scores and robust generalization, highlighting promising advancements for protein engineering and drug development.

An Overview of End-to-End Learning on 3D Protein Structure for Interface Prediction

The paper "End-to-End Learning on 3D Protein Structure for Interface Prediction" presents a novel approach to tackling the protein interface prediction problem by leveraging extensive structural data and end-to-end learning methods. The focus is on predicting protein interactions through a model called the Siamese Atomic Surfacelet Network (SASNet), which utilizes raw 3D coordinates and atomic identities, departing from traditional methods that rely on hand-crafted features.

Key Contributions and Methodology

The authors introduce the Database of Interacting Protein Structures (DIPS), which comprises 42,826 binary protein interaction structures, significantly larger than previous datasets such as Docking Benchmark 5 (DB5). Traditional methods struggle with scalability and robustness when applied to this new dataset. In contrast, SASNet, an end-to-end learning model, addresses these challenges.

SASNet operates on voxelized representations of protein "surfacelets," capturing the local atomic environments around each amino acid. It processes these inputs using a three-dimensional convolutional neural network (Conv3D), tied together in a siamese-like manner. This approach bypasses the need for labor-intensive feature engineering and leverages the CNN's ability to capture hierarchical patterns in protein structures.

Experimental Results and Analysis

In empirical evaluations, SASNet achieves superior performance on the paired interface prediction task compared to existing methods, as demonstrated by its high CAUROC scores. Notably, while competing methods falter when trained on DIPS, SASNet's performance remains robust, indicating its effectiveness in learning beyond simple shape complementarity, accounting for protein flexibility implicitly captured across different conformations.

The hyperparameter analysis conducted reveals that SASNet's performance scales positively with increased dataset and grid sizes, suggesting further improvements are attainable with expanded computational resources. Interestingly, the model maintains competitive performance even when pruned of examples with close structural relationships to DB5, attesting to its generalization capabilities.

Theoretical and Practical Implications

The successful application of SASNet to a substantially larger and more diverse dataset than previously available holds promise for advancing protein interface prediction. This approach could impact protein engineering and drug development, where understanding protein interactions is critical. The insight that proteins' hierarchical structures and local interactions align well with the design of CNNs could steer future research towards employing deep learning frameworks for other challenges in structural biology, potentially extending beyond protein interfaces to single-molecule studies or novel protein design.

Future Outlook

Going forward, it would be instructive to explore the adaptation of end-to-end learning models to incorporate temporal data, reflecting dynamic protein conformational changes, or integrate additional data types like cryo-electron microscopy to encapsulate lower-resolution structures. Moreover, investigating the hierarchical patterns learned by SASNet could further illuminate the underlying principles of molecular interactions, offering pathways to new theoretical developments in computational biology and bioinformatics.

In conclusion, the paper demonstrates the potential of end-to-end learning to redefine the boundaries of protein interface prediction, setting the stage for further explorations into the utility and adaptability of deep learning techniques in understanding complex biological systems.

PDF Markdown

Related Papers

GitHub

GitHub - drorlab/DIPS: Database of Interacting Protein Structures (DIPS) (101 stars)