Molecule Joint Auto-Encoding: Trajectory Pretraining with 2D and 3D Diffusion (2312.03475v1)
Abstract: Recently, artificial intelligence for drug discovery has raised increasing interest in both machine learning and chemistry domains. The fundamental building block for drug discovery is molecule geometry and thus, the molecule's geometrical representation is the main bottleneck to better utilize machine learning techniques for drug discovery. In this work, we propose a pretraining method for molecule joint auto-encoding (MoleculeJAE). MoleculeJAE can learn both the 2D bond (topology) and 3D conformation (geometry) information, and a diffusion process model is applied to mimic the augmented trajectories of such two modalities, based on which, MoleculeJAE will learn the inherent chemical structure in a self-supervised manner. Thus, the pretrained geometrical representation in MoleculeJAE is expected to benefit downstream geometry-related tasks. Empirically, MoleculeJAE proves its effectiveness by reaching state-of-the-art performance on 15 out of 20 tasks by comparing it with 12 competitive baselines.
- Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265, 2019.
- N-gram graph: Simple unsupervised representation for graphs, with applications to molecules. Advances in neural information processing systems, 32, 2019.
- Gpt-gnn: Generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1857–1867, 2020a.
- Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In International Conference on Learning Representations, ICLR, 2020.
- Molecular geometry pretraining with SE(3)-invariant denoising distance matching. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=CjTHVo1dvR.
- Pre-training via denoising for molecular property prediction. arXiv preprint arXiv:2206.00133, 2022.
- Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728, 2021a.
- A group symmetric stochastic differential equation model for molecule multi-modal pretraining. In International Conference on Machine Learning, pages 21497–21526. PMLR, 2023b.
- Symmetry-informed geometric representation for molecules, proteins, and crystalline materials. arXiv preprint arXiv:2306.09375, 2023c.
- Graph self-supervised learning: A survey. arXiv preprint arXiv:2103.00111, 2021b.
- Self-supervised learning of graph neural networks: A unified review. arXiv preprint arXiv:2102.10757, 2021.
- Self-supervised on graphs: Contrastive, generative, or predictive. arXiv preprint arXiv:2105.07342, 2021.
- Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 2021c.
- Language models as few-shot learner for task-oriented dialogue systems. arXiv preprint arXiv:2008.06239, 2020.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219, 2018.
- Molecular dynamics and monte carlo calculations in statistical mechanics. Annual Review of Physical Chemistry, 27(1):319–348, 1976.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- Score-based generative modeling with critically-damped langevin diffusion. arXiv preprint arXiv:2112.07068, 2021.
- Score-based generative modeling of graphs via the system of stochastic differential equations. In International Conference on Machine Learning, pages 10362–10383. PMLR, 2022.
- Digress: Discrete denoising diffusion for graph generation. arXiv preprint arXiv:2209.14734, 2022.
- Se (3) equivariant graph neural networks with complete local frames. In International Conference on Machine Learning, pages 5583–5608. PMLR, 2022a.
- A group symmetric stochastic differential equation model for molecule multi-modal pretraining. 2023d.
- Learning gradient fields for molecular conformation generation. In International Conference on Machine Learning, pages 9558–9568. PMLR, 2021.
- Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning, pages 8867–8887. PMLR, 2022.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- A flexible diffusion model. arXiv preprint arXiv:2206.10365, 2022b.
- Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
- Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Cold diffusion: Inverting arbitrary image transforms without noise. arXiv preprint arXiv:2208.09392, 2022.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- What makes for good views for contrastive learning? Advances in neural information processing systems, 33:6827–6839, 2020.
- Negative data augmentation. arXiv preprint arXiv:2102.05113, 2021.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10619–10629, 2022.
- Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
- Image data augmentation for deep learning: A survey. arXiv preprint arXiv:2204.08610, 2022.
- Effective data augmentation with diffusion models. arXiv preprint arXiv:2302.07944, 2023.
- gddim: Generalized denoising diffusion implicit models. arXiv preprint arXiv:2206.05564, 2022a.
- A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 34:22863–22876, 2021.
- E (n) equivariant graph neural networks. In International conference on machine learning, pages 9323–9332. PMLR, 2021.
- Machine learning of accurate energy-conserving molecular force fields. Science advances, 3(5):e1603015, 2017.
- Bayesian neural networks: An introduction and survey. Case Studies in Applied Bayesian Data Science: CIRM Jean-Morlet Chair, Fall 2018, pages 45–87, 2020.
- Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263, 2019.
- Joint energy-based models for semi-supervised classification. In ICML 2020 Workshop on Uncertainty and Robustness in Deep Learning, volume 1, 2020.
- Autoregressive generative modeling with noise conditional maximum likelihood estimation. arXiv preprint arXiv:2210.10715, 2022.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Understanding and improving the role of projection head in self-supervised learning. arXiv preprint arXiv:2212.11491, 2022.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
- A new perspective on building efficient and expressive 3d equivariant graph neural networks. arXiv preprint arXiv:2304.04757, 2023.
- Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020b.
- Pubchemqc project: a large-scale first-principles electronic structure database for data-driven chemistry. Journal of chemical information and modeling, 57(6):1300–1308, 2017.
- Schnet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics, 148(24):241722, 2018.
- 3d infomax improves gnns for molecular property prediction. In International Conference on Machine Learning, pages 20479–20502. PMLR, 2022.
- Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1(1):1–7, 2014.
- Torsional diffusion for molecular conformer generation. arXiv preprint arXiv:2206.01729, 2022.
- Accurate learning of graph representations with graph multiset pooling. arXiv preprint arXiv:2102.11533, 2021.
- Mdm: Molecular diffusion model for 3d molecule generation, 2022.
- Thomas A Halgren. Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94. Journal of computational chemistry, 17(5-6):490–519, 1996.
- Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. arXiv preprint arXiv:2205.11508, 2022.
- Disentangling factors of variation in deep representation using adversarial training. Advances in neural information processing systems, 29, 2016.
- Consistency models. arXiv preprint arXiv:2303.01469, 2023.
- Rethinking adversarial transferability from a data distribution perspective. In International Conference on Learning Representations, 2021.
- How mask matters: Towards theoretical understandings of masked autoencoders. arXiv preprint arXiv:2210.08344, 2022b.
- Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence, 4(3):279–287, 2022.
- Molecular geometry pretraining with SE(3)-invariant denoising distance matching. In The Eleventh International Conference on Learning Representations, 2023e. URL https://openreview.net/forum?id=CjTHVo1dvR.
- Uni-mol: A universal 3d molecular representation learning framework. 2023.
- Soft diffusion: Score matching for general corruptions. arXiv preprint arXiv:2209.05442, 2022.
- Smart augmentation learning an optimal data augmentation strategy. Ieee Access, 5:5858–5869, 2017.
- Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 113–123, 2019.
- Midi: Mixed graph and 3d denoising diffusion for molecule generation, 2023.
- Geom, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data, 9(1):185, 2022. doi: 10.1038/s41597-022-01288-4. URL https://doi.org/10.1038/s41597-022-01288-4.