471 research outputs found
Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design
Generative molecular design has moved from proof-of-concept to real-world
applicability, as marked by the surge in very recent papers reporting
experimental validation. Key challenges in explainability and sample efficiency
present opportunities to enhance generative design to directly optimize
expensive high-fidelity oracles and provide actionable insights to domain
experts. Here, we propose Beam Enumeration to exhaustively enumerate the most
probable sub-sequences from language-based molecular generative models and show
that molecular substructures can be extracted. When coupled with reinforcement
learning, extracted substructures become meaningful, providing a source of
explainability and improving sample efficiency through self-conditioned
generation. Beam Enumeration is generally applicable to any language-based
molecular generative model and notably further improves the performance of the
recently reported Augmented Memory algorithm, which achieved the new
state-of-the-art on the Practical Molecular Optimization benchmark for sample
efficiency. The combined algorithm generates more high reward molecules and
faster, given a fixed oracle budget. Beam Enumeration is the first method to
jointly address explainability and sample efficiency for molecular design
CLP-based protein fragment assembly
The paper investigates a novel approach, based on Constraint Logic
Programming (CLP), to predict the 3D conformation of a protein via fragments
assembly. The fragments are extracted by a preprocessor-also developed for this
work- from a database of known protein structures that clusters and classifies
the fragments according to similarity and frequency. The problem of assembling
fragments into a complete conformation is mapped to a constraint solving
problem and solved using CLP. The constraint-based model uses a medium
discretization degree Ca-side chain centroid protein model that offers
efficiency and a good approximation for space filling. The approach adapts
existing energy models to the protein representation used and applies a large
neighboring search strategy. The results shows the feasibility and efficiency
of the method. The declarative nature of the solution allows to include future
extensions, e.g., different size fragments for better accuracy.Comment: special issue dedicated to ICLP 201
Inductive queries for a drug designing robot scientist
It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments
Mining Brain Networks using Multiple Side Views for Neurological Disorder Identification
Mining discriminative subgraph patterns from graph data has attracted great
interest in recent years. It has a wide variety of applications in disease
diagnosis, neuroimaging, etc. Most research on subgraph mining focuses on the
graph representation alone. However, in many real-world applications, the side
information is available along with the graph data. For example, for
neurological disorder identification, in addition to the brain networks derived
from neuroimaging data, hundreds of clinical, immunologic, serologic and
cognitive measures may also be documented for each subject. These measures
compose multiple side views encoding a tremendous amount of supplemental
information for diagnostic purposes, yet are often ignored. In this paper, we
study the problem of discriminative subgraph selection using multiple side
views and propose a novel solution to find an optimal set of subgraph features
for graph classification by exploring a plurality of side views. We derive a
feature evaluation criterion, named gSide, to estimate the usefulness of
subgraph patterns based upon side views. Then we develop a branch-and-bound
algorithm, called gMSV, to efficiently search for optimal subgraph features by
integrating the subgraph mining process and the procedure of discriminative
feature selection. Empirical studies on graph classification tasks for
neurological disorders using brain networks demonstrate that subgraph patterns
selected by the multi-side-view guided subgraph selection approach can
effectively boost graph classification performances and are relevant to disease
diagnosis.Comment: in Proceedings of IEEE International Conference on Data Mining (ICDM)
201
An Efficient Algorithm for Enumerating Chordless Cycles and Chordless Paths
A chordless cycle (induced cycle) of a graph is a cycle without any
chord, meaning that there is no edge outside the cycle connecting two vertices
of the cycle. A chordless path is defined similarly. In this paper, we consider
the problems of enumerating chordless cycles/paths of a given graph
and propose algorithms taking time for each chordless cycle/path. In
the existing studies, the problems had not been deeply studied in the
theoretical computer science area, and no output polynomial time algorithm has
been proposed. Our experiments showed that the computation time of our
algorithms is constant per chordless cycle/path for non-dense random graphs and
real-world graphs. They also show that the number of chordless cycles is much
smaller than the number of cycles. We applied the algorithm to prediction of
NMR (Nuclear Magnetic Resonance) spectra, and increased the accuracy of the
prediction
ChEMBL-Likeness Score and Database GDBChEMBL
The generated database GDB17 enumerates 166.4 billion molecules up to 17 atoms of C, N, O, S and halogens following simple rules of chemical stability and synthetic feasibility. However, most molecules in GDB17 are too complex to be considered for chemical synthesis. To address this limitation, we report GDBChEMBL as a subset of GDB17 featuring 10 million molecules selected according to a ChEMBL-likeness score (CLscore) calculated from the frequency of occurrence of circular substructures in ChEMBL, followed by uniform sampling across molecular size, stereocenters and heteroatoms. Compared to the previously reported subsets FDB17 and GDBMedChem selected from GDB17 by fragment-likeness, respectively, medicinal chemistry criteria, our new subset features molecules with higher synthetic accessibility and possibly bioactivity yet retains a broad and continuous coverage of chemical space typical of the entire GDB17. GDBChEMBL is accessible at http://gdb.unibe.ch for download and for browsing using an interactive chemical space map at http://faerun.gdb.tools
- …