552 research outputs found
Tree Edit Distance Learning via Adaptive Symbol Embeddings
Metric learning has the aim to improve classification accuracy by learning a
distance measure which brings data points from the same class closer together
and pushes data points from different classes further apart. Recent research
has demonstrated that metric learning approaches can also be applied to trees,
such as molecular structures, abstract syntax trees of computer programs, or
syntax trees of natural language, by learning the cost function of an edit
distance, i.e. the costs of replacing, deleting, or inserting nodes in a tree.
However, learning such costs directly may yield an edit distance which violates
metric axioms, is challenging to interpret, and may not generalize well. In
this contribution, we propose a novel metric learning approach for trees which
we call embedding edit distance learning (BEDL) and which learns an edit
distance indirectly by embedding the tree nodes as vectors, such that the
Euclidean distance between those vectors supports class discrimination. We
learn such embeddings by reducing the distance to prototypical trees from the
same class and increasing the distance to prototypical trees from different
classes. In our experiments, we show that BEDL improves upon the
state-of-the-art in metric learning for trees on six benchmark data sets,
ranging from computer science over biomedical data to a natural-language
processing data set containing over 300,000 nodes.Comment: Paper at the International Conference of Machine Learning (2018),
2018-07-10 to 2018-07-15 in Stockholm, Swede
Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics
Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites.
google.com/site/gaussianbhc
The Mathematics of Phylogenomics
The grand challenges in biology today are being shaped by powerful
high-throughput technologies that have revealed the genomes of many organisms,
global expression patterns of genes and detailed information about variation
within populations. We are therefore able to ask, for the first time,
fundamental questions about the evolution of genomes, the structure of genes
and their regulation, and the connections between genotypes and phenotypes of
individuals. The answers to these questions are all predicated on progress in a
variety of computational, statistical, and mathematical fields.
The rapid growth in the characterization of genomes has led to the
advancement of a new discipline called Phylogenomics. This discipline results
from the combination of two major fields in the life sciences: Genomics, i.e.,
the study of the function and structure of genes and genomes; and Molecular
Phylogenetics, i.e., the study of the hierarchical evolutionary relationships
among organisms and their genomes. The objective of this article is to offer
mathematicians a first introduction to this emerging field, and to discuss
specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure
Projected -SNE for batch correction
Biomedical research often produces high-dimensional data confounded by batch
effects such as systematic experimental variations, different protocols and
subject identifiers. Without proper correction, low-dimensional representation
of high-dimensional data might encode and reproduce the same systematic
variations observed in the original data, and compromise the interpretation of
the results. In this article, we propose a novel procedure to remove batch
effects from low-dimensional embeddings obtained with t-SNE dimensionality
reduction. The proposed methods are based on linear algebra and constrained
optimization, leading to efficient algorithms and fast computation in many
high-dimensional settings. Results on artificial single-cell transcription
profiling data show that the proposed procedure successfully removes multiple
batch effects from t-SNE embeddings, while retaining fundamental information on
cell types. When applied to single-cell gene expression data to investigate
mouse medulloblastoma, the proposed method successfully removes batches related
with mice identifiers and the date of the experiment, while preserving clusters
of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are
expected to lie in the stroma within or adjacent to the tumors.Comment: 16 pages, 3 figure
Augment to Interpret: Unsupervised and Inherently Interpretable Graph Embeddings
Unsupervised learning allows us to leverage unlabelled data, which has become
abundantly available, and to create embeddings that are usable on a variety of
downstream tasks. However, the typical lack of interpretability of unsupervised
representation learning has become a limiting factor with regard to recent
transparent-AI regulations. In this paper, we study graph representation
learning and we show that data augmentation that preserves semantics can be
learned and used to produce interpretations. Our framework, which we named
INGENIOUS, creates inherently interpretable embeddings and eliminates the need
for costly additional post-hoc analysis. We also introduce additional metrics
addressing the lack of formalism and metrics in the understudied area of
unsupervised-representation learning interpretability. Our results are
supported by an experimental study applied to both graph-level and node-level
tasks and show that interpretable embeddings provide state-of-the-art
performance on subsequent downstream tasks
- …