DeepSF: deep convolutional neural network for mapping protein sequences to folds (1706.01010v1)

Published 4 Jun 2017 in cs.LG and q-bio.BM

Abstract: Motivation Protein fold recognition is an important problem in structural bioinformatics. Almost all traditional fold recognition methods use sequence (homology) comparison to indirectly predict the fold of a tar get protein based on the fold of a template protein with known structure, which cannot explain the relationship between sequence and fold. Only a few methods had been developed to classify protein sequences into a small number of folds due to methodological limitations, which are not generally useful in practice. Results We develop a deep 1D-convolution neural network (DeepSF) to directly classify any protein se quence into one of 1195 known folds, which is useful for both fold recognition and the study of se quence-structure relationship. Different from traditional sequence alignment (comparison) based methods, our method automatically extracts fold-related features from a protein sequence of any length and map it to the fold space. We train and test our method on the datasets curated from SCOP1.75, yielding a classification accuracy of 80.4%. On the independent testing dataset curated from SCOP2.06, the classification accuracy is 77.0%. We compare our method with a top profile profile alignment method - HHSearch on hard template-based and template-free modeling targets of CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is 14.5%-29.1% higher than HHSearch on template-free modeling targets and 4.5%-16.7% higher on hard template-based modeling targets for top 1, 5, and 10 predicted folds. The hidden features extracted from sequence by our method is robust against sequence mutation, insertion, deletion and truncation, and can be used for other protein pattern recognition problems such as protein clustering, comparison and ranking.

Citations (187)

View on Semantic Scholar

Summary

The paper presents DeepSF, which directly classifies protein sequences into 1195 folds using a deep 1D-CNN architecture, bypassing traditional alignment methods.
DeepSF achieved 80.4% accuracy on SCOP 1.75 and improved recognition rates by up to 29.1% compared to HHSearch on challenging targets.
The robust feature extraction method effectively handles sequence variances, promising enhancements in protein structure prediction and bioinformatics analyses.

Analyzing DeepSF: A Deep Convolutional Neural Network for Protein Fold Recognition

The computational recognition and classification of protein folds represent a longstanding challenge in structural bioinformatics. Traditional methods have largely depended on sequence homology comparison to predict a target protein's fold based on that of a known template protein; however, these approaches often fail to clarify the direct relationship between sequence and fold. The paper "DeepSF: deep convolutional neural network for mapping protein sequences to folds" presents DeepSF, a novel deep learning methodology aimed at addressing these limitations by classifying protein sequences directly into 1195 known folds using a deep one-dimensional convolutional neural network (1D-CNN).

Methodology

DeepSF incorporates the significant learning capacity of deep learning models to automatically extract fold-related features from protein sequences of variable lengths. By employing a sophisticated architecture consisting of 10 convolutional layers, the network eschews the traditional sequence alignment-based techniques. Key input features include sequence data, profile data derived from position-specific scoring matrices (PSSMs), predicted secondary structures, and solvent accessibility predictions, aggregating into multidimensional feature vectors for each residue. The system's design surpasses earlier machine learning approaches by accommodating direct classification into a far more extensive number of fold categories.

Results and Comparative Performance

The developers rigorously trained and validated DeepSF using datasets from SCOP 1.75, SCOP 2.06, and CASP experiments. The results were indeed promising: DeepSF achieved a classification accuracy of 80.4% on SCOP 1.75 and 77.0% on the independent SCOP 2.06 dataset. Comparisons with HHSearch, a leading profile-profile alignment method, reveal a substantial improvement of 14.5%-29.1% in fold recognition accuracy on template-free targets and 4.5%-16.7% on hard template-based targets.

These findings denote that DeepSF's fold-related features are considerably resilient to sequence variances such as mutations, insertions, deletions, and truncations. This characteristic holds significant promise in addressing other protein analysis tasks, such as clustering, comparison, and structure prediction, by potentially enhancing model robustness against evolutionary changes.

Implications and Future Directions

The implications of DeepSF are noteworthy both practically and theoretically. Practically, the ability to directly classify sequences into known folds with considerable accuracy without leveraging alignment-based techniques could advance the state of computational protein structure prediction. Theoretically, DeepSF’s feature extraction offers new insights into the sequence-structure map, ultimately pushing the envelope in elucidating the fundamental structure-function paradigms in proteins.

This paper exemplifies the profound potential that deep learning holds for bioinformatics, providing a template for future work in AI-driven biological applications. Anticipated advancements could involve increasing and refining training datasets or incorporating additional structural information to further bolster model performance and broaden applicability. Furthermore, DeepSF may inspire analogous techniques across other domains where traditional comparison methodologies face similar limitations.

In essence, DeepSF is a significant stride toward achieving more direct and accurate protein fold recognition, echoing a broader trend in leveraging cutting-edge AI methods to tackle deeply complex biological problems.