SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

Published 12 Nov 2019 in cs.LG and stat.ML | (1911.04738v1)

Abstract: In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained LLMs from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence LLM using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms in small-data settings where pre-training facilitated good generalization. Moreover, we define a novel metric to concurrently measure model accuracy and data efficiency.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (160)

View on Semantic Scholar

Summary

The paper introduces the SMILES Transformer to generate continuous molecular fingerprints for low data drug discovery.
It uses a 4-block Transformer architecture with unsupervised pre-training on 861,000 SMILES, achieving a perfect decoding perplexity of 1.0.
A novel Data Efficiency Metric (DEM) is proposed, with benchmarking on MoleculeNet showing improved performance in small dataset scenarios.

Overview of SMILES Transformer for Low Data Drug Discovery

The paper introduces the SMILES Transformer, a data-driven approach to generating molecular fingerprints intended for low data drug discovery scenarios. This approach leverages the architecture of Transformers, prominent in NLP, to improve the representation of molecules by utilizing SMILES notation, a text-based representation system for encoding molecular structures.

Problem Context

Traditional molecular fingerprint algorithms rely on rule-based mappings, creating sparse, discrete spaces. These methods can underperform when paired with shallow predictors or limited datasets. Graph-based approaches, despite their efficacy in QSPR tasks, require large labeled datasets, which are often impractical due to the scarcity of experimentally validated molecular data.

Methodology

The SMILES Transformer is inspired by recent advances in pre-trained LLMs such as BERT and XLNet. It employs an encoder-decoder network consisting of 4 Transformer blocks per layer. The method involves unsupervised pre-training using a substantial corpus of SMILES derived from ChEMBL24, optimizing for sequence-to-sequence transformations with cross-entropy minimization.

Key steps include:

Pre-training: Utilizes 861,000 SMILES as input, transforming canonical representations randomly for diversity. The model achieves perfect decoding with a perplexity of 1.0.
Fingerprint Extraction: Represents molecules as 1024-dimensional vectors, pooling outputs to create continuous data-driven fingerprints.

Novel Contributions

Data Efficiency Metric (DEM): A new scalar metric for assessing model performance across varying training set sizes, enabling a standardized evaluation of data efficiency.
Benchmarking: Performance evaluations on 10 MoleculeNet datasets reveal that the SMILES Transformer outperforms existing methods in half of these datasets, particularly excelling in smaller data contexts.

Numerical Results and Implications

The SMILES Transformer achieved the best DEM scores in 5 out of 10 datasets. It delivered substantial performance improvements in datasets like ESOL, FreeSolv, BBBP, and ClinTox. The results underscore the model's capability to effectively address the challenge of limited data, highlighting its potential utility in early-stage drug discovery pipelines.

Theoretical and Practical Implications

Theoretically, this work establishes a novel intersection between NLP methods and cheminformatics, illustrating how large-scale unsupervised learning can enhance molecular representation without extensive labeled data. Practically, the SMILES Transformer could reduce the reliance on large datasets, thus streamlining drug discovery processes and reducing associated costs.

Future Directions

The paper suggests several avenues for future research:

Advanced Architectures: Incorporating models like Transformer-XL to handle larger sequences.
Multi-task Learning: Expanding training objectives to predict molecular properties alongside sequence decoding, improving chemical representation.
SMILES Enumeration: Leveraging diverse SMILES encodings to enhance representation accuracy.

The source code is publicly available, promoting reproducibility and further exploration of the approach.

In conclusion, the SMILES Transformer demonstrates promising advancements in low data drug discovery, offering a robust, pre-trained molecular fingerprinting model with substantial implications for both the theoretical and practical landscapes of cheminformatics and computational drug design.

Markdown Report Issue