45 research outputs found

    Designing algorithms to aid discovery by chemical robots

    Get PDF
    Recently, automated robotic systems have become very efficient, thanks to improved coupling between sensor systems and algorithms, of which the latter have been gaining significance thanks to the increase in computing power over the past few decades. However, intelligent automated chemistry platforms for discovery orientated tasks need to be able to cope with the unknown, which is a profoundly hard problem. In this Outlook, we describe how recent advances in the design and application of algorithms, coupled with the increased amount of chemical data available, and automation and control systems may allow more productive chemical research and the development of chemical robots able to target discovery. This is shown through examples of workflow and data processing with automation and control, and through the use of both well-used and cutting-edge algorithms illustrated using recent studies in chemistry. Finally, several algorithms are presented in relation to chemical robots and chemical intelligence for knowledge discovery

    Deeper Connections between Neural Networks and Gaussian Processes Speed-up Active Learning

    Full text link
    Active learning methods for neural networks are usually based on greedy criteria which ultimately give a single new design point for the evaluation. Such an approach requires either some heuristics to sample a batch of design points at one active learning iteration, or retraining the neural network after adding each data point, which is computationally inefficient. Moreover, uncertainty estimates for neural networks sometimes are overconfident for the points lying far from the training sample. In this work we propose to approximate Bayesian neural networks (BNN) by Gaussian processes, which allows us to update the uncertainty estimates of predictions efficiently without retraining the neural network, while avoiding overconfident uncertainty prediction for out-of-sample points. In a series of experiments on real-world data including large-scale problems of chemical and physical modeling, we show superiority of the proposed approach over the state-of-the-art methods

    Convolutional architectures for virtual screening

    Get PDF
    Background: A Virtual Screening algorithm has to adapt to the different stages of this process. Early screening needs to ensure that all bioactive compounds are ranked in the first positions despite of the number of false positives, while a second screening round is aimed at increasing the prediction accuracy. Results: A novel CNN architecture is presented to this aim, which predicts bioactivity of candidate compounds on CDK1 using a combination of molecular fingerprints as their vector representation, and has been trained suitably to achieve good results as regards both enrichment factor and accuracy in different screening modes (98.55% accuracy in active-only selection, and 98.88% in high precision discrimination). Conclusion: The proposed architecture outperforms state-of-the-art ML approaches, and some interesting insights on molecular fingerprints are devised

    Effect of missing data on multitask prediction methods

    Get PDF
    There has been a growing interest in multitask prediction in chemoinformatics, helped by the increasing use of deep neural networks in this field. This technique is applied to multitarget data sets, where compounds have been tested against different targets, with the aim of developing models to predict a profile of biological activities for a given compound. However, multitarget data sets tend to be sparse; i.e., not all compound-target combinations have experimental values. There has been little research on the effect of missing data on the performance of multitask methods. We have used two complete data sets to simulate sparseness by removing data from the training set. Different models to remove the data were compared. These sparse sets were used to train two different multitask methods, deep neural networks and Macau, which is a Bayesian probabilistic matrix factorization technique. Results from both methods were remarkably similar and showed that the performance decrease because of missing data is at first small before accelerating after large amounts of data are removed. This work provides a first approximation to assess how much data is required to produce good performance in multitask prediction exercises

    Computational Experimentation

    Get PDF
    Experimentation conjures images of laboratories and equipment in biotechnology, chemistry, materials science, and pharmaceuticals. Yet modern day experimentation is not limited to only chemical synthesis, but is increasingly computational. Researchers in the unpredictable arts can experiment upon the functions, properties, reactions, and structures of chemical compounds with highly accurate computational techniques. These computational capabilities challenge the enablement and utility patentability requirements. The patent statute requires that the inventor explain how to make and use the invention without undue experimentation and that the invention have at least substantial and specific utility. These patentability requirements do not align with computational research capabilities, which allow inventors to file earlier patent applications, develop prophetic examples, and provide supporting disclosure in the patent specification without necessarily conducting traditional, laboratory-based experiments. This Article explores the contours and applications of computational capabilities on patentability, proposes reforms to the utility doctrine and to patent examination, responds to potential critiques of the proposed reforms, and analyzes innovation policy in the unpredictable arts. In light of increasing computational experimentation, this Article recommends strengthening the utility requirement in order to prevent a state of patent law in which enablement is subsumed into utility

    Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

    Full text link
    Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks

    λ”₯λŸ¬λ‹ 기반의 λΆ„μž νŠΉμ„± 예츑 연ꡬ

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : μžμ—°κ³Όν•™λŒ€ν•™ ν˜‘λ™κ³Όμ • 생물정보학전곡, 2021.8. μœ€μ„±λ‘œ.Deep learning (DL) has been advanced in various fields, such as vision tasks, language processing, and natural sciences. Recently, several remarkable researches in computational chemistry were accomplished by DL-based methods. However, the chemical system consists of diverse elements and their interactions. As a result, it is not trivial to predict chemical properties which are determined by intrinsically complicated factors. Consequently, conventional approaches usually depend on tremendous calculations for chemical simulations or predictions, which are cost-intensive and time-consuming. To address recent issues, we studied deep learning for computational chemistry. We focused on the chemical property prediction from molecular structure representations. A molecular structure is a complex of atoms and their arrangements. The molecular property is determined by the interactions from all these components. Therefore, molecular structural representations are the key factor in the chemical property prediction tasks. In particular, we explored public property prediction tasks in pharmacology, organic chemistry, and quantum chemistry. Molecular structures can be described as categorical sequences or geometric graphs. We utilized both representational formats for prediction tasks, and achieved competitive model performances. Our studies verified that the molecular representation is essential for various tasks in chemistry, and using appropriate types of neural networks for the representation type is significant to the model predictability.λ”₯λŸ¬λ‹ 방법둠은 이미지 및 μ–Έμ–΄ 처리 λΆ„μ•Όλ₯Ό ν¬ν•¨ν•˜μ—¬, 곡학 및 μžμ—°κ³Όν•™μ„ ν¬ν•¨ν•œ μ—¬λŸ¬ λΆ„μ•Όμ—μ„œ μ§„λ³΄ν•˜μ˜€λ‹€. μ΅œκ·Όμ—λŠ” 특히 계산 ν™”ν•™ λΆ„μ•Όμ—μ„œ λ”₯λŸ¬λ‹ 기반으둜 μ—°κ΅¬λœ μš°μˆ˜ν•œ 성과듀이 μ—¬λŸΏ λ³΄κ³ λ˜μ—ˆλ‹€. κ·ΈλŸ¬λ‚˜ 화학적인 계 λ‚΄μ—μ„œλŠ” λ§Žμ€ μ’…λ₯˜μ˜ μš”μ†Œλ“€κ³Ό μƒν˜Έμž‘μš©λ“€μ΄ λ³΅μž‘ν•˜κ²Œ μ–½ν˜€μžˆλ‹€. λ”°λΌμ„œ μ΄λŸ¬ν•œ μš”μ†Œλ“€μ„ μ΄μš©ν•˜μ—¬ ν™”ν•™ νŠΉμ„±μ„ μ˜ˆμΈ‘ν•˜λŠ” 것은 쉽지 μ•Šμ€ 일이닀. 결과적으둜, 전톡적인 방법듀은 주둜 μƒλ‹Ήν•œ λΉ„μš©κ³Ό μ‹œκ°„μ΄ μ†Œμš”λ˜λŠ” μ—„μ²­λ‚œ κ³„μ‚°λŸ‰μ„ 기반으둜 ν•˜μ˜€λ‹€. μ΄λŸ¬ν•œ ν•œκ³„μ μ„ ν•΄κ²°ν•˜κΈ° μœ„ν•˜μ—¬, λ³Έ μ—°κ΅¬λŠ” λ”₯λŸ¬λ‹μ„ ν™œμš©ν•œ ν™”ν•™μ—μ„œμ˜ 계산 문제λ₯Ό μ—°κ΅¬ν•˜μ˜€λ‹€. λ³Έ μ—°κ΅¬μ—μ„œλŠ” 특히 λΆ„μž ꡬ쑰 ν‘œν˜„ 데이터λ₯Ό 이용, λΆ„μžμ˜ νŠΉμ„±μ„ μ˜ˆμΈ‘ν•˜λŠ” λ¬Έμ œλ“€μ— μ§‘μ€‘ν•˜μ˜€λ‹€. λΆ„μž κ΅¬μ‘°λŠ” λ‹€μ–‘ν•œ μ›μžλ“€μ΄ νŠΉμ •ν•œ 배열을 이루고 μžˆλŠ” 볡합체이며, λΆ„μž νŠΉμ„±μ€ μ΄λŸ¬ν•œ μ›μž 및 κ·Έλ“€μ˜ μƒν˜Έ 관계듀에 μ˜ν•˜μ—¬ κ²°μ • λœλ‹€. λ”°λΌμ„œ, λΆ„μž κ΅¬μ‘°λŠ” 화학적 νŠΉμ„±μ„ μ˜ˆμΈ‘ν•˜λŠ” λ¬Έμ œμ— μžˆμ–΄μ„œ ν•„μˆ˜μ μΈ μš”μ†Œμ΄λ‹€. λ³Έ μ—°κ΅¬μ—μ„œλŠ” μ•½ν•™, 유기 ν™”ν•™, μ–‘μž ν™”ν•™ λ“± λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œμ˜ ν™”ν•™ νŠΉμ„± μ˜ˆμΈ‘μ—°κ΅¬λ“€μ„ μ§„ν–‰ν•˜μ˜€λ‹€. λΆ„μž κ΅¬μ‘°λŠ” μ‹œν€€μŠ€ ν˜Ήμ€ κ·Έλž˜ν”„ ν˜•νƒœλ‘œ ν‘œν˜„ν•  수 있고, λ³Έ μ—°κ΅¬μ—μ„œλŠ” 두 가지 ν˜•νƒœλ₯Ό λͺ¨λ‘ ν™œμš©ν•˜μ—¬μ„œ μ§„ν–‰ν•˜μ˜€λ‹€. λ³Έ μ—°κ΅¬λŠ” λΆ„μž ν‘œν˜„μ΄ ν™”ν•™ λΆ„μ•Ό λ‚΄μ˜ μ—¬λŸ¬ 가지 νƒœμŠ€ν¬μ— ν™œμš© 될 수 있으며, λΆ„μž ν‘œν˜„μ— λ”°λ₯Έ μ μ ˆν•œ λ”₯λŸ¬λ‹ λͺ¨λΈμ˜ 선택이 λͺ¨λΈ μ„±λŠ₯을 크게 높일 수 μžˆμŒμ„ λ³΄μ˜€λ‹€.1 Introduction 1 1.1 Motivation 1 1.2 Contents of dissertation 3 2 Background 8 2.1 Deep learning in Chemistry 8 2.2 Deep Learning for molecular property prediction 9 2.3 Approaches for molecular property prediction 12 2.3.1 Sequential modeling for molecular string 12 2.3.2 Structural modeling for molecular graph 15 2.4 Tasks on molecular properties 20 2.4.1 Pharmacological tasks 20 2.4.2 Biophysical and physiological tasks 21 2.4.3 Quantum-mechanical tasks 21 3 Application I. Drug class classification 23 3.1 Introduction 23 3.2 Proposed method 26 3.2.1 Preprocessing 27 3.2.2 Model architecture 27 3.2.3 Training and evaluation 30 3.3 Experimental results 31 3.4 Discussion 37 4 Application II. Biophysical property prediction 39 4.1 Introduction 39 4.2 Proposed method 41 4.2.1 Preprocessing 41 4.2.2 model architecture 42 4.2.3 Training and evaluation 45 4.3 Experimental results 47 4.4 Discussion 53 5 Application III. Quantum-mechanical property prediction 55 5.1 Introduction 55 5.2 Proposed method 57 5.2.1 Preprocessing 59 5.2.2 Model architecture 62 5.2.3 Training and evaluation 67 5.3 Experimental results 69 5.4 Discussion 70 6 Conclusion 74 Bibliography 76 초 둝 93λ°•
    corecore