90,521 research outputs found
Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening
This work introduces a number of algebraic topology approaches, such as
multicomponent persistent homology, multi-level persistent homology and
electrostatic persistence for the representation, characterization, and
description of small molecules and biomolecular complexes. Multicomponent
persistent homology retains critical chemical and biological information during
the topological simplification of biomolecular geometric complexity.
Multi-level persistent homology enables a tailored topological description of
inter- and/or intra-molecular interactions of interest. Electrostatic
persistence incorporates partial charge information into topological
invariants. These topological methods are paired with Wasserstein distance to
characterize similarities between molecules and are further integrated with a
variety of machine learning algorithms, including k-nearest neighbors, ensemble
of trees, and deep convolutional neural networks, to manifest their descriptive
and predictive powers for chemical and biological problems. Extensive numerical
experiments involving more than 4,000 protein-ligand complexes from the PDBBind
database and near 100,000 ligands and decoys in the DUD database are performed
to test respectively the scoring power and the virtual screening power of the
proposed topological approaches. It is demonstrated that the present approaches
outperform the modern machine learning based methods in protein-ligand binding
affinity predictions and ligand-decoy discrimination
A Multi-Armed Bandit to Smartly Select a Training Set from Big Medical Data
With the availability of big medical image data, the selection of an adequate
training set is becoming more important to address the heterogeneity of
different datasets. Simply including all the data does not only incur high
processing costs but can even harm the prediction. We formulate the smart and
efficient selection of a training dataset from big medical image data as a
multi-armed bandit problem, solved by Thompson sampling. Our method assumes
that image features are not available at the time of the selection of the
samples, and therefore relies only on meta information associated with the
images. Our strategy simultaneously exploits data sources with high chances of
yielding useful samples and explores new data regions. For our evaluation, we
focus on the application of estimating the age from a brain MRI. Our results on
7,250 subjects from 10 datasets show that our approach leads to higher accuracy
while only requiring a fraction of the training data.Comment: MICCAI 2017 Proceeding
A convolutional autoencoder approach for mining features in cellular electron cryo-tomograms and weakly supervised coarse segmentation
Cellular electron cryo-tomography enables the 3D visualization of cellular
organization in the near-native state and at submolecular resolution. However,
the contents of cellular tomograms are often complex, making it difficult to
automatically isolate different in situ cellular components. In this paper, we
propose a convolutional autoencoder-based unsupervised approach to provide a
coarse grouping of 3D small subvolumes extracted from tomograms. We demonstrate
that the autoencoder can be used for efficient and coarse characterization of
features of macromolecular complexes and surfaces, such as membranes. In
addition, the autoencoder can be used to detect non-cellular features related
to sample preparation and data collection, such as carbon edges from the grid
and tomogram boundaries. The autoencoder is also able to detect patterns that
may indicate spatial interactions between cellular components. Furthermore, we
demonstrate that our autoencoder can be used for weakly supervised semantic
segmentation of cellular components, requiring a very small amount of manual
annotation.Comment: Accepted by Journal of Structural Biolog
Methods for Analysing Endothelial Cell Shape and Behaviour in Relation to the Focal Nature of Atherosclerosis
The aim of this thesis is to develop automated methods for the analysis of the
spatial patterns, and the functional behaviour of endothelial cells, viewed under
microscopy, with applications to the understanding of atherosclerosis.
Initially, a radial search approach to segmentation was attempted in order to
trace the cell and nuclei boundaries using a maximum likelihood algorithm; it
was found inadequate to detect the weak cell boundaries present in the available
data. A parametric cell shape model was then introduced to fit an equivalent
ellipse to the cell boundary by matching phase-invariant orientation fields of the
image and a candidate cell shape. This approach succeeded on good quality
images, but failed on images with weak cell boundaries. Finally, a support
vector machines based method, relying on a rich set of visual features, and a
small but high quality training dataset, was found to work well on large numbers
of cells even in the presence of strong intensity variations and imaging noise.
Using the segmentation results, several standard shear-stress dependent parameters
of cell morphology were studied, and evidence for similar behaviour
in some cell shape parameters was obtained in in-vivo cells and their nuclei.
Nuclear and cell orientations around immature and mature aortas were broadly
similar, suggesting that the pattern of flow direction near the wall stayed approximately
constant with age. The relation was less strong for the cell and
nuclear length-to-width ratios.
Two novel shape analysis approaches were attempted to find other properties
of cell shape which could be used to annotate or characterise patterns, since a
wide variability in cell and nuclear shapes was observed which did not appear
to fit the standard parameterisations. Although no firm conclusions can yet be
drawn, the work lays the foundation for future studies of cell morphology.
To draw inferences about patterns in the functional response of cells to flow,
which may play a role in the progression of disease, single-cell analysis was performed
using calcium sensitive florescence probes. Calcium transient rates were
found to change with flow, but more importantly, local patterns of synchronisation
in multi-cellular groups were discernable and appear to change with flow.
The patterns suggest a new functional mechanism in flow-mediation of cell-cell
calcium signalling
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
- …