Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints (2403.17954v1)

Published 10 Mar 2024 in cs.LG, physics.chem-ph, and q-bio.BM

Abstract: Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods; in contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure-selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the $L$ most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, $L$. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated methods across prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. Journal of Chemical Information and Modeling 50(5):742–754
  2. Morgan HL (1965) The generation of a unique machine description for chemical structures-A technique developed at chemical abstracts service. Journal of Chemical Documentation 5(2):107–113
  3. Riniker S, Landrum G (2013) Open-source platform to benchmark fingerprints for ligand-based virtual screening. Journal of Cheminformatics 5(1):26
  4. Menke J, Koch O (2021) Using domain-specific fingerprints generated through neural networks to enhance ligand-based virtual screening. Journal of Chemical Information and Modeling 61(2):664–675
  5. Weininger D (1988) SMILES, a chemical language and information system. Journal of Chemical Information and Computer Sciences 28(1):31–36
  6. Gütlein M, Kramer S (2016) Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability. Journal of Cheminformatics 8(1):1–16
  7. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:160902907
  8. Zhong S, Guan X (2023) Count-based Morgan fingerprint: A more efficient and interpretable molecular representation in developing machine learning-based predictive regression models for water contaminants’ activities and properties. Environmental Science & Technology 57(46):18,193–18,202
  9. Shen J, Nicolaou CA (2019) Molecular property prediction: Recent trends in the era of artificial intelligence. Drug Discovery Today: Technologies 32:29–36
  10. Probst D, Reymond JL (2018) A probabilistic molecular fingerprint for big data settings. Journal of Cheminformatics 10:1–12
  11. Shannon CE (1948) A mathematical theory of communication. The Bell System Technical Journal 27(3):379–423
  12. Zhang Z, Zhang X (2011) A normal law for the plug-in estimator of entropy. IEEE Transactions on Information Theory 58(5):2745–2747
  13. Fleuret F (2004) Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5(9)
  14. Bemis GW, Murcko MA (1996) The properties of known drugs: Molecular frameworks. Journal of Medicinal Chemistry 39(15):2887–2893
  15. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:171105101
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Markus Dablander (5 papers)
  2. Thierry Hanser (2 papers)
  3. Renaud Lambiotte (125 papers)
  4. Garrett M. Morris (8 papers)

Summary

We haven't generated a summary for this paper yet.