Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 180 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints (2403.17954v1)

Published 10 Mar 2024 in cs.LG, physics.chem-ph, and q-bio.BM

Abstract: Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods; in contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure-selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the $L$ most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, $L$. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated methods across prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. Journal of Chemical Information and Modeling 50(5):742–754
  2. Morgan HL (1965) The generation of a unique machine description for chemical structures-A technique developed at chemical abstracts service. Journal of Chemical Documentation 5(2):107–113
  3. Riniker S, Landrum G (2013) Open-source platform to benchmark fingerprints for ligand-based virtual screening. Journal of Cheminformatics 5(1):26
  4. Menke J, Koch O (2021) Using domain-specific fingerprints generated through neural networks to enhance ligand-based virtual screening. Journal of Chemical Information and Modeling 61(2):664–675
  5. Weininger D (1988) SMILES, a chemical language and information system. Journal of Chemical Information and Computer Sciences 28(1):31–36
  6. Gütlein M, Kramer S (2016) Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability. Journal of Cheminformatics 8(1):1–16
  7. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:160902907
  8. Zhong S, Guan X (2023) Count-based Morgan fingerprint: A more efficient and interpretable molecular representation in developing machine learning-based predictive regression models for water contaminants’ activities and properties. Environmental Science & Technology 57(46):18,193–18,202
  9. Shen J, Nicolaou CA (2019) Molecular property prediction: Recent trends in the era of artificial intelligence. Drug Discovery Today: Technologies 32:29–36
  10. Probst D, Reymond JL (2018) A probabilistic molecular fingerprint for big data settings. Journal of Cheminformatics 10:1–12
  11. Shannon CE (1948) A mathematical theory of communication. The Bell System Technical Journal 27(3):379–423
  12. Zhang Z, Zhang X (2011) A normal law for the plug-in estimator of entropy. IEEE Transactions on Information Theory 58(5):2745–2747
  13. Fleuret F (2004) Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5(9)
  14. Bemis GW, Murcko MA (1996) The properties of known drugs: Molecular frameworks. Journal of Medicinal Chemistry 39(15):2887–2893
  15. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:171105101

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.