45 research outputs found
Designing algorithms to aid discovery by chemical robots
Recently, automated robotic systems have become very efficient, thanks to improved coupling between sensor systems and algorithms, of which the latter have been gaining significance thanks to the increase in computing power over the past few decades. However, intelligent automated chemistry platforms for discovery orientated tasks need to be able to cope with the unknown, which is a profoundly hard problem. In this Outlook, we describe how recent advances in the design and application of algorithms, coupled with the increased amount of chemical data available, and automation and control systems may allow more productive chemical research and the development of chemical robots able to target discovery. This is shown through examples of workflow and data processing with automation and control, and through the use of both well-used and cutting-edge algorithms illustrated using recent studies in chemistry. Finally, several algorithms are presented in relation to chemical robots and chemical intelligence for knowledge discovery
Deeper Connections between Neural Networks and Gaussian Processes Speed-up Active Learning
Active learning methods for neural networks are usually based on greedy
criteria which ultimately give a single new design point for the evaluation.
Such an approach requires either some heuristics to sample a batch of design
points at one active learning iteration, or retraining the neural network after
adding each data point, which is computationally inefficient. Moreover,
uncertainty estimates for neural networks sometimes are overconfident for the
points lying far from the training sample. In this work we propose to
approximate Bayesian neural networks (BNN) by Gaussian processes, which allows
us to update the uncertainty estimates of predictions efficiently without
retraining the neural network, while avoiding overconfident uncertainty
prediction for out-of-sample points. In a series of experiments on real-world
data including large-scale problems of chemical and physical modeling, we show
superiority of the proposed approach over the state-of-the-art methods
Convolutional architectures for virtual screening
Background: A Virtual Screening algorithm has to adapt to the different stages of this process. Early screening needs to ensure that all bioactive compounds are ranked in the first positions despite of the number of false positives, while a second screening round is aimed at increasing the prediction accuracy. Results: A novel CNN architecture is presented to this aim, which predicts bioactivity of candidate compounds on CDK1 using a combination of molecular fingerprints as their vector representation, and has been trained suitably to achieve good results as regards both enrichment factor and accuracy in different screening modes (98.55% accuracy in active-only selection, and 98.88% in high precision discrimination). Conclusion: The proposed architecture outperforms state-of-the-art ML approaches, and some interesting insights on molecular fingerprints are devised
Effect of missing data on multitask prediction methods
There has been a growing interest in multitask prediction in chemoinformatics, helped by the increasing use of deep neural networks in this field. This technique is applied to multitarget data sets, where compounds have been tested against different targets, with the aim of developing models to predict a profile of biological activities for a given compound. However, multitarget data sets tend to be sparse; i.e., not all compound-target combinations have experimental values. There has been little research on the effect of missing data on the performance of multitask methods. We have used two complete data sets to simulate sparseness by removing data from the training set. Different models to remove the data were compared. These sparse sets were used to train two different multitask methods, deep neural networks and Macau, which is a Bayesian probabilistic matrix factorization technique. Results from both methods were remarkably similar and showed that the performance decrease because of missing data is at first small before accelerating after large amounts of data are removed. This work provides a first approximation to assess how much data is required to produce good performance in multitask prediction exercises
Computational Experimentation
Experimentation conjures images of laboratories and equipment in biotechnology, chemistry, materials science, and pharmaceuticals. Yet modern day experimentation is not limited to only chemical synthesis, but is increasingly computational. Researchers in the unpredictable arts can experiment upon the functions, properties, reactions, and structures of chemical compounds with highly accurate computational techniques. These computational capabilities challenge the enablement and utility patentability requirements. The patent statute requires that the inventor explain how to make and use the invention without undue experimentation and that the invention have at least substantial and specific utility. These patentability requirements do not align with computational research capabilities, which allow inventors to file earlier patent applications, develop prophetic examples, and provide supporting disclosure in the patent specification without necessarily conducting traditional, laboratory-based experiments. This Article explores the contours and applications of computational capabilities on patentability, proposes reforms to the utility doctrine and to patent examination, responds to potential critiques of the proposed reforms, and analyzes innovation policy in the unpredictable arts. In light of increasing computational experimentation, this Article recommends strengthening the utility requirement in order to prevent a state of patent law in which enablement is subsumed into utility
Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
Recently, pre-trained foundation models have enabled significant advancements
in multiple fields. In molecular machine learning, however, where datasets are
often hand-curated, and hence typically small, the lack of datasets with
labeled features, and codebases to manage those datasets, has hindered the
development of foundation models. In this work, we present seven novel datasets
categorized by size into three distinct categories: ToyMix, LargeMix and
UltraLarge. These datasets push the boundaries in both the scale and the
diversity of supervised labels for molecular learning. They cover nearly 100
million molecules and over 3000 sparsely defined tasks, totaling more than 13
billion individual labels of both quantum and biological nature. In comparison,
our datasets contain 300 times more data points than the widely used OGB-LSC
PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In
addition, to support the development of foundational models based on our
proposed datasets, we present the Graphium graph machine learning library which
simplifies the process of building and training molecular machine learning
models for multi-task and multi-level molecular datasets. Finally, we present a
range of baseline results as a starting point of multi-task and multi-level
training on these datasets. Empirically, we observe that performance on
low-resource biological datasets show improvement by also training on large
amounts of quantum data. This indicates that there may be potential in
multi-task and multi-level training of a foundation model and fine-tuning it to
resource-constrained downstream tasks
λ₯λ¬λ κΈ°λ°μ λΆμ νΉμ± μμΈ‘ μ°κ΅¬
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : μμ°κ³Όνλν νλκ³Όμ μλ¬Όμ 보νμ 곡, 2021.8. μ€μ±λ‘.Deep learning (DL) has been advanced in various fields, such as vision tasks, language processing, and natural sciences. Recently, several remarkable researches in computational chemistry were accomplished by DL-based methods. However, the chemical system consists of diverse elements and their interactions. As a result, it is not trivial to predict chemical properties which are determined by intrinsically complicated factors. Consequently, conventional approaches usually depend on tremendous calculations for chemical simulations or predictions, which are cost-intensive and time-consuming.
To address recent issues, we studied deep learning for computational chemistry. We focused on the chemical property prediction from molecular structure representations. A molecular structure is a complex of atoms and their arrangements. The molecular property is determined by the interactions from all these components. Therefore, molecular structural representations are the key factor in the chemical property prediction tasks. In particular, we explored public property prediction tasks in pharmacology, organic chemistry, and quantum chemistry. Molecular structures can be described as categorical sequences or geometric graphs. We utilized both representational formats for prediction tasks, and achieved competitive model performances. Our studies verified that the molecular representation is essential for various tasks in chemistry, and using appropriate types of neural networks for the representation type is significant to the model predictability.λ₯λ¬λ λ°©λ²λ‘ μ μ΄λ―Έμ§ λ° μΈμ΄ μ²λ¦¬ λΆμΌλ₯Ό ν¬ν¨νμ¬, 곡ν λ° μμ°κ³Όνμ ν¬ν¨ν μ¬λ¬ λΆμΌμμ μ§λ³΄νμλ€. μ΅κ·Όμλ νΉν κ³μ° νν λΆμΌμμ λ₯λ¬λ κΈ°λ°μΌλ‘ μ°κ΅¬λ μ°μν μ±κ³Όλ€μ΄ μ¬λΏ λ³΄κ³ λμλ€. κ·Έλ¬λ ννμ μΈ κ³ λ΄μμλ λ§μ μ’
λ₯μ μμλ€κ³Ό μνΈμμ©λ€μ΄ 볡μ‘νκ² μ½νμλ€. λ°λΌμ μ΄λ¬ν μμλ€μ μ΄μ©νμ¬ νν νΉμ±μ μμΈ‘νλ κ²μ μ½μ§ μμ μΌμ΄λ€. κ²°κ³Όμ μΌλ‘, μ ν΅μ μΈ λ°©λ²λ€μ μ£Όλ‘ μλΉν λΉμ©κ³Ό μκ°μ΄ μμλλ μμ²λ κ³μ°λμ κΈ°λ°μΌλ‘ νμλ€.
μ΄λ¬ν νκ³μ μ ν΄κ²°νκΈ° μνμ¬, λ³Έ μ°κ΅¬λ λ₯λ¬λμ νμ©ν ννμμμ κ³μ° λ¬Έμ λ₯Ό μ°κ΅¬νμλ€. λ³Έ μ°κ΅¬μμλ νΉν λΆμ ꡬ쑰 νν λ°μ΄ν°λ₯Ό μ΄μ©, λΆμμ νΉμ±μ μμΈ‘νλ λ¬Έμ λ€μ μ§μ€νμλ€. λΆμ ꡬ쑰λ λ€μν μμλ€μ΄ νΉμ ν λ°°μ΄μ μ΄λ£¨κ³ μλ 볡ν©μ²΄μ΄λ©°, λΆμ νΉμ±μ μ΄λ¬ν μμ λ° κ·Έλ€μ μνΈ κ΄κ³λ€μ μνμ¬ κ²°μ λλ€. λ°λΌμ, λΆμ ꡬ쑰λ ννμ νΉμ±μ μμΈ‘νλ λ¬Έμ μ μμ΄μ νμμ μΈ μμμ΄λ€. λ³Έ μ°κ΅¬μμλ μ½ν, μ κΈ° νν, μμ νν λ± λ€μν λΆμΌμμμ νν νΉμ± μμΈ‘μ°κ΅¬λ€μ μ§ννμλ€. λΆμ ꡬ쑰λ μνμ€ νΉμ κ·Έλν ννλ‘ ννν μ μκ³ , λ³Έ μ°κ΅¬μμλ λ κ°μ§ ννλ₯Ό λͺ¨λ νμ©νμ¬μ μ§ννμλ€. λ³Έ μ°κ΅¬λ λΆμ ννμ΄ νν λΆμΌ λ΄μ μ¬λ¬ κ°μ§ νμ€ν¬μ νμ© λ μ μμΌλ©°, λΆμ ννμ λ°λ₯Έ μ μ ν λ₯λ¬λ λͺ¨λΈμ μ νμ΄ λͺ¨λΈ μ±λ₯μ ν¬κ² λμΌ μ μμμ 보μλ€.1 Introduction 1
1.1 Motivation 1
1.2 Contents of dissertation 3
2 Background 8
2.1 Deep learning in Chemistry 8
2.2 Deep Learning for molecular property prediction 9
2.3 Approaches for molecular property prediction 12
2.3.1 Sequential modeling for molecular string 12
2.3.2 Structural modeling for molecular graph 15
2.4 Tasks on molecular properties 20
2.4.1 Pharmacological tasks 20
2.4.2 Biophysical and physiological tasks 21
2.4.3 Quantum-mechanical tasks 21
3 Application I. Drug class classification 23
3.1 Introduction 23
3.2 Proposed method 26
3.2.1 Preprocessing 27
3.2.2 Model architecture 27
3.2.3 Training and evaluation 30
3.3 Experimental results 31
3.4 Discussion 37
4 Application II. Biophysical property prediction 39
4.1 Introduction 39
4.2 Proposed method 41
4.2.1 Preprocessing 41
4.2.2 model architecture 42
4.2.3 Training and evaluation 45
4.3 Experimental results 47
4.4 Discussion 53
5 Application III. Quantum-mechanical property prediction 55
5.1 Introduction 55
5.2 Proposed method 57
5.2.1 Preprocessing 59
5.2.2 Model architecture 62
5.2.3 Training and evaluation 67
5.3 Experimental results 69
5.4 Discussion 70
6 Conclusion 74
Bibliography 76
μ΄ λ‘ 93λ°