Automated design of collective variables using supervised machine learning

Published 28 Feb 2018 in stat.ML, cs.CE, and q-bio.BM | (1802.10510v2)

Abstract: Selection of appropriate collective variables for enhancing sampling of molecular simulations remains an unsolved problem in computational biophysics. In particular, picking initial collective variables (CVs) is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even begin to pick starting coordinates for investigation? This remains true even in the case of simple two state systems and only increases in difficulty for multi-state systems. In this work, we solve the initial CV problem using a data-driven approach inspired by the filed of supervised machine learning. In particular, we show how the decision functions in supervised machine learning (SML) algorithms can be used as initial CVs (SML_cv) for accelerated sampling. Using solvated alanine dipeptide and Chignolin mini-protein as our test cases, we illustrate how the distance to the Support Vector Machines' decision hyperplane, the output probability estimates from Logistic Regression, the outputs from deep neural network classifiers, and other classifiers may be used to reversibly sample slow structural transitions. We discuss the utility of other SML algorithms that might be useful for identifying CVs for accelerating molecular simulations.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (112)

View on Semantic Scholar

Summary

The paper repurposes supervised machine learning algorithms to automate collective variable selection, improving sampling efficiency in molecular simulations.
It shows that using SVM, LR, and DNN models enables reversible sampling of slow conformational transitions in systems like alanine dipeptide and Chignolin.
The study paves the way for integrating multiclass classification and online optimization methods to advance automated CV design in computational biophysics.

Overview: Automated Design of Collective Variables Using Supervised Machine Learning

The paper authored by Sultan and Pande addresses the challenges inherent in selecting appropriate collective variables (CVs) for enhancing sampling in molecular simulations, an unsolved dilemma in computational modeling. The authors propose a structured approach utilizing supervised machine learning (SML) to solve the "initial" CV problem, a technique demonstrating significant potential when applied to complex molecular systems such as solvated alanine dipeptide and the Chignolin mini-protein.

Supervised Machine Learning as a Strategy for CV Selection

One of the paper's key contributions is its recasting of CV selection into a supervised machine learning problem. By using decision functions from SML algorithms, such as Support Vector Machines (SVMs), Logistic Regression (LR), and Deep Neural Networks (DNNs), the study illustrates how these can be repurposed as initial collective variables (SMLcv) for molecular simulations. These CVs are shown to be effective in reversibly sampling slow structural transitions, thereby offering a potential advancement over traditional, manually determined methods.

Application and Results

The application of the SML-based framework yielded encouraging results across different test cases. For alanine dipeptide, the use of SVM and LR models demonstrated the capability to efficiently sample the slow $\beta$ to $\alpha_L$ transition multiple times, which was used to robustly estimate the associated free energy surfaces through reweighting. Results manifested similar success when deploying DNNs for non-linear separations, achieving 15 transitions along the alanine's slower dihedral coordinate within 45 ns of sampling. These outcomes underscore the viability of utilizing SML-derived decision functions as dynamic CVs in molecular simulations.

Furthermore, the extension of these methods to multiple state systems using multiclass classification is noteworthy. The multiclass SVM approach provided a systematic framework to generate CVs for systems exhibiting multiple metastable states, thereby facilitating multidimensional enhanced sampling.

Broader Implications

The paper suggests that the SML approach can significantly streamline the process of determining CVs, minimizing pre-study efforts. Additionally, supervised machine learning might serve as a preliminary step for further optimization via methods like SGOOP or VAC, potentially transitioning CV selection into an online learning setup. This adaptability indicates potential applications in diverse domains such as drug binding kinetics, mutational studies, and force field assessments.

While the proposed method is identified as a preliminary estimate that might inadvertently include orthogonal modes, the limitations are noted to be a general issue in the field. The discussion on transfer learning and its boundaries offers a rich area for future research exploration.

Conclusion

Sultan and Pande's study provides a compelling approach to automate CV selection using machine learning, opening avenues for more systematically advancing molecular simulations. The blend of SML with molecular sampling constructs a novel path forward in computational biophysics that may inform future developments in automated CV optimization protocols, ultimately enhancing the sophistication of free energy simulations. Researchers exploring this domain may derive significant benefit from the structured use of machine learning frameworks to optimize collective variables, thereby reducing manual subjectivity and enhancing computational efficiency.

Markdown Report Issue