Search CORE

4,649 research outputs found

Learning the optimal scale for GWAS through hierarchical SNP aggregation

Author: Ambroise Christophe
Guinot Florent
Samson Franck
Szafranski Marie
Publication venue
Publication date: 01/01/2018
Field of study

Motivation: Genome-Wide Association Studies (GWAS) seek to identify causal genomic variants associated with rare human diseases. The classical statistical approach for detecting these variants is based on univariate hypothesis testing, with healthy individuals being tested against affected individuals at each locus. Given that an individual's genotype is characterized by up to one million SNPs, this approach lacks precision, since it may yield a large number of false positives that can lead to erroneous conclusions about genetic associations with the disease. One way to improve the detection of true genetic associations is to reduce the number of hypotheses to be tested by grouping SNPs. Results: We propose a dimension-reduction approach which can be applied in the context of GWAS by making use of the haplotype structure of the human genome. We compare our method with standard univariate and multivariate approaches on both synthetic and real GWAS data, and we show that reducing the dimension of the predictor matrix by aggregating SNPs gives a greater precision in the detection of associations between the phenotype and genomic regions

arXiv.org e-Print Archive

HAL Evry

Directory of Open Access Journals

HAL Descartes

Hal-Diderot

Simplicial similarity and its application to hierarchical clustering

Author: Juan Romo
Ángel López
Publication venue
Publication date
Field of study

In the present document, an extension of the statistical depth notion is introduced with the aim to allow for measuring proximities between pairs of points. In particular, we will extend the simplicial depth function, which measures how central is a point by using random simplices (triangles in the two-dimensional space). The paper is structured as follows: In first place, there is a brief introduction to statistical depth functions. Next, the simplicial similarity function will be defined and its properties studied. Finally, we will present a few graphical examples in order to show its behavior with symmetric and asymmetric distributions, and apply the function to hierarchical clustering.Statistical depth, Similarity measures, Hierarchical clustering

Research Papers in Economics

Incremental Clustering: The Case for Extra Clusters

Author: Ackerman Margareta
Dasgupta Sanjoy
Publication venue
Publication date: 24/06/2014
Field of study

The explosion in the amount of data available for analysis often necessitates a transition from batch to incremental clustering methods, which process one element at a time and typically store only a small subset of the data. In this paper, we initiate the formal analysis of incremental clustering methods focusing on the types of cluster structure that they are able to detect. We find that the incremental setting is strictly weaker than the batch model, proving that a fundamental class of cluster structures that can readily be detected in the batch setting is impossible to identify using any incremental method. Furthermore, we show how the limitations of incremental clustering can be overcome by allowing additional clusters

arXiv.org e-Print Archive

CiteSeerX

On Sequence Clustering and Supervised Dimensionality Reduction

Author: Wang Tiexing
Publication venue: SURFACE at Syracuse University
Publication date: 26/06/2020
Field of study

This dissertation studies two machine learning problems: 1) clustering of independent and identically generated random sequences, and 2) dimensionality reduction for classification problems. For sequence clustering, the focus is on large sample performance of classical clustering algorithms, including the k-medoids algorithm and hierarchical agglomerative clustering (HAC) algorithms. Data sequences are generated from unknown continuous distributions that are assumed to form clusters according to some well-defined distance metrics. The goal is to group data sequences according to their underlying distributions with little or no prior knowledge of both the underlying distributions as well as the number of clusters. Upper bounds on the clustering error probability are derived for the k-medoids algorithm and a class of HAC algorithms under mild assumptions on the distribution clusters and distance metrics. For both cases, the error probabilities are shown to decay exponentially fast as the number of samples in each data sequence goes to infinity. The obtained error exponent bound has a simple form when either the Kolmogrov-Smirnov distance or the maximum mean discrepancy is used as the distance metric. Tighter upper bound on the error probability of the single-linkage HAC algorithm is derived by taking advantage of the simplified metric updating scheme. Numerical results are provided to validate the analysis. For dimensionality reduction, the focus is on classification problem where label information in the training data can be leveraged for improved learning performance. A supervised dimensionality reduction method maximizing the difference of average projection energy of samples with different labels is proposed. Both synthetic data and WiFi sensing data are used to validate the effectiveness of the proposed method. The numerical results show that the proposed method outperforms existing supervised dimensionality reduction approaches based on Fisher discriminant analysis (FDA) and Hilbert-Schmidt independent criterion (HSIC). When kernel trick is applied to all three approaches, the performance of the proposed dimensionality reduction method is comparable to FDA and HSIC and is superior over unsupervised principal component analysis

Syracuse University Research Facility and Collaborative Environment