Search CORE

552 research outputs found

Molecular Embedding via a Second Order Dissimilarity Parameterized Approach

Author: Gower J. C.
Ian G. Grooms
Michael W. Trosset
Robert Michael Lewis
Trosset M. W.
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date
Field of study

Tree Edit Distance Learning via Adaptive Symbol Embeddings

Author: Gallicchio Claudio
Hammer Barbara
Micheli Alessio
Paaßen Benjamin
Publication venue
Publication date: 01/01/2018
Field of study

Metric learning has the aim to improve classification accuracy by learning a distance measure which brings data points from the same class closer together and pushes data points from different classes further apart. Recent research has demonstrated that metric learning approaches can also be applied to trees, such as molecular structures, abstract syntax trees of computer programs, or syntax trees of natural language, by learning the cost function of an edit distance, i.e. the costs of replacing, deleting, or inserting nodes in a tree. However, learning such costs directly may yield an edit distance which violates metric axioms, is challenging to interpret, and may not generalize well. In this contribution, we propose a novel metric learning approach for trees which we call embedding edit distance learning (BEDL) and which learns an edit distance indirectly by embedding the tree nodes as vectors, such that the Euclidean distance between those vectors supports class discrimination. We learn such embeddings by reducing the distance to prototypical trees from the same class and increasing the distance to prototypical trees from different classes. In our experiments, we show that BEDL improves upon the state-of-the-art in metric learning for trees on six benchmark data sets, ranging from computer science over biomedical data to a natural-language processing data set containing over 300,000 nodes.Comment: Paper at the International Conference of Machine Learning (2018), 2018-07-10 to 2018-07-15 in Stockholm, Swede

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics

Author: A Su
B Frey
C Nutt
C Rasmussen
C Rasmussen
D Arango
D Jiang
D Singh
David R. J. Snead
E Cooke
Ferdinando Di Cunto
G Brock
J Ihmels
J Yao
K Yeung
Korsuk Sirinukunwattana
L Hubert
L McQuitty
LF Wu
M De Souto
M Eisen
M Shipp
Muhammad F. Bari
Nasir M. Rajpoot
P D'haeseleer
P Laiho
R Neal
R Savage
R Sokal
Richard S. Savage
S Armstrong
S Datta
S Eschrich
S Falcon
S Matsui
S Pomeroy
S Ramaswamy
S Varambally
T Golub
Y Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites. google.com/site/gaussianbhc

CiteSeerX

Public Library of Science (PLOS)

Qatar University Institutional Repository

Crossref

Directory of Open Access Journals

PubMed Central

Warwick Research Archives Portal Repository

The Mathematics of Phylogenomics

Author: Pachter Lior
Sturmfels Bernd
Publication venue
Publication date: 01/01/2004
Field of study

The grand challenges in biology today are being shaped by powerful high-throughput technologies that have revealed the genomes of many organisms, global expression patterns of genes and detailed information about variation within populations. We are therefore able to ask, for the first time, fundamental questions about the evolution of genomes, the structure of genes and their regulation, and the connections between genotypes and phenotypes of individuals. The answers to these questions are all predicated on progress in a variety of computational, statistical, and mathematical fields. The rapid growth in the characterization of genomes has led to the advancement of a new discipline called Phylogenomics. This discipline results from the combination of two major fields in the life sciences: Genomics, i.e., the study of the function and structure of genes and genomes; and Molecular Phylogenetics, i.e., the study of the hierarchical evolutionary relationships among organisms and their genomes. The objective of this article is to offer mathematicians a first introduction to this emerging field, and to discuss specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure

arXiv.org e-Print Archive

CiteSeerX

Caltech Authors

Projected $t$ -SNE for batch correction

Author: Aliverti Emanuele
Babcock Benjamin
Colaneri Alejandro
Dunson David B.
Filer Dayne
Gershon Timothy R.
Ocasio Jennifer
Tilson Jeff
Wilhelmsen Kirk C.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2020
Field of study

Biomedical research often produces high-dimensional data confounded by batch effects such as systematic experimental variations, different protocols and subject identifiers. Without proper correction, low-dimensional representation of high-dimensional data might encode and reproduce the same systematic variations observed in the original data, and compromise the interpretation of the results. In this article, we propose a novel procedure to remove batch effects from low-dimensional embeddings obtained with t-SNE dimensionality reduction. The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumors.Comment: 16 pages, 3 figure

arXiv.org e-Print Archive

Crossref

PubMed Central

Archivio istituzionale della ricerca - Università di Padova

Augment to Interpret: Unsupervised and Inherently Interpretable Graph Embeddings

Author: Ciortan Madalina
Ferre Quentin
Scafarto Gregory
Tihon Simon
Publication venue
Publication date: 28/09/2023
Field of study

Unsupervised learning allows us to leverage unlabelled data, which has become abundantly available, and to create embeddings that are usable on a variety of downstream tasks. However, the typical lack of interpretability of unsupervised representation learning has become a limiting factor with regard to recent transparent-AI regulations. In this paper, we study graph representation learning and we show that data augmentation that preserves semantics can be learned and used to produce interpretations. Our framework, which we named INGENIOUS, creates inherently interpretable embeddings and eliminates the need for costly additional post-hoc analysis. We also introduce additional metrics addressing the lack of formalism and metrics in the understudied area of unsupervised-representation learning interpretability. Our results are supported by an experimental study applied to both graph-level and node-level tasks and show that interpretable embeddings provide state-of-the-art performance on subsequent downstream tasks

arXiv.org e-Print Archive