Search CORE

1,366 research outputs found

Selective inference after convex clustering with $\ell_1$ penalization

Author: Bachoc François
Maugis-Rabusseau Cathy
Neuvial Pierre
Publication venue
Publication date: 04/09/2023
Field of study

Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clustering with

\ell_1

penalization, by leveraging related selective inference tools for regression, based on Gaussian vectors conditioned to polyhedral sets. In the one-dimensional case, we prove a polyhedral characterization of obtaining given clusters, than enables us to suggest a test procedure with statistical guarantees. This characterization also allows us to provide a computationally efficient regularization path algorithm. Then, we extend the above test procedure and guarantees to multi-dimensional clustering with

\ell_1

penalization, and also to more general multi-dimensional clusterings that aggregate one-dimensional ones. With various numerical experiments, we validate our statistical guarantees and we demonstrate the power of our methods to detect differences in mean between clusters. Our methods are implemented in the R package poclin.Comment: 40 pages, 8 figure

arXiv.org e-Print Archive

Data Representation for Learning and Information Fusion in Bioinformatics

Author: Rajapakse Vinodh Nalin
Publication venue
Publication date: 01/01/2013
Field of study

This thesis deals with the rigorous application of nonlinear dimension reduction and data organization techniques to biomedical data analysis. The Laplacian Eigenmaps algorithm is representative of these methods and has been widely applied in manifold learning and related areas. While their asymptotic manifold recovery behavior has been well-characterized, the clustering properties of Laplacian embeddings with finite data are largely motivated by heuristic arguments. We develop a precise bound, characterizing cluster structure preservation under Laplacian embeddings. From this foundation, we introduce flexible and mathematically well-founded approaches for information fusion and feature representation. These methods are applied to three substantial case studies in bioinformatics, illustrating their capacity to extract scientifically valuable information from complex data

Digital Repository at the University of Maryland

A New Measure for Analyzing and Fusing Sequences of Objects

Author: Goulermas JY
Kostopoulos A
Mu T
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/08/2015
Field of study

This work is related to the combinatorial data analysis problem of seriation used for data visualization and exploratory analysis. Seriation re-sequences the data, so that more similar samples or objects appear closer together, whereas dissimilar ones are further apart. Despite the large number of current algorithms to realize such re-sequencing, there has not been a systematic way for analyzing the resulting sequences, comparing them, or fusing them to obtain a single unifying one. We propose a new positional proximity measure that evaluates the similarity of two arbitrary sequences based on their agreement on pairwise positional information of the sequenced objects. Furthermore, we present various statistical properties of this measure as well as its normalized version modeled as an instance of the generalized correlation coefficient. Based on this measure, we define a new procedure for consensus seriation that fuses multiple arbitrary sequences based on a quadratic assignment problem formulation and an efficient way of approximating its solution. We also derive theoretical links with other permutation distance functions and present their associated combinatorial optimization forms for consensus tasks. The utility of the proposed contributions is demonstrated through the comparison and fusion of multiple seriation algorithms we have implemented, using many real-world datasets from different application domains

University of Liverpool Repository

Entanglement thresholds for random induced states

Author: Anderson
Arveson
Aubrun
Aubrun
Aubrun
Aubrun
Bai
Banaszczyk
Bengtsson
Bhatia
Bourgain
Buchleitner
Collins
Davidson
Davidson
Einstein
Figiel
Galambos
Gurvits
Gurvits
Haagerup
Hastings
Hayden
Horodecki
Horodecki
Horodecki
Kendon
Latała
Ledoux
Li
Lévy
Marčhenko
Mehta
Milman
Milman
Nielsen
Peres
Pisier
Pisier
Rogers
Rudelson
Ruskai
Serre
Shor
Silverstein
Stormer
Szarek
Szarek
Szarek
Szarek
Walgate
Werner
Woronowicz
Ye
Ye
Życzkowski
Życzkowski
Publication venue: 'Wiley'
Publication date: 15/10/2012
Field of study

For a random quantum state on

H=C^d \otimes C^d

obtained by partial tracing a random pure state on

H \otimes C^s

, we consider the whether it is typically separable or typically entangled. For this problem, we show the existence of a sharp threshold

s_0=s_0(d)

of order roughly

d^3

. More precisely, for any

a > 0

and for d large enough, such a random state is entangled with very large probability when

s < (1-a)s_0

, and separable with very large probability when

s > (1+a)s_0

. One consequence of this result is as follows: for a system of N identical particles in a random pure state, there is a threshold

k_0 = k_0(N) \sim N/5

such that two subsystems of k particles each typically share entanglement if

k > k_0

, and typically do not share entanglement if

k < k_0

. Our methods work also for multipartite systems and for "unbalanced" systems such as

C^{d} \otimes C^{d'}

d \neq d'

. The arguments rely on random matrices, classical convexity, high-dimensional probability and geometry of Banach spaces; some of the auxiliary results may be of reference value. A high-level non-technical overview of the results of this paper and of a related article arXiv:1011.0275 can be found in arXiv:1112.4582.Comment: 34 pages; v.3: reorganized proof, new results only in section 7.1, references added; v.2: main result strengthened (much stronger threshold property) allowing the sharp "N-particle" interpretation of the results stated in the abstract, new appendix on majorization and \infty-Wasserstein distance, references adde

arXiv.org e-Print Archive

HAL-UJM

Crossref

Hal-Diderot

K-means based clustering and context quantization

Author: Xu Mantao
Publication venue: University of Joensuu
Publication date
Field of study

UEF Electronic Publications

Networked Data Analytics: Network Comparison And Applied Graph Signal Processing

Author: Huang Weiyu
Publication venue: ScholarlyCommons
Publication date: 01/01/2018
Field of study

Networked data structures has been getting big, ubiquitous, and pervasive. As our day-to-day activities become more incorporated with and influenced by the digital world, we rely more on our intuition to provide us a high-level idea and subconscious understanding of the encountered data. This thesis aims at translating the qualitative intuitions we have about networked data into quantitative and formal tools by designing rigorous yet reasonable algorithms. In a nutshell, this thesis constructs models to compare and cluster networked data, to simplify a complicated networked structure, and to formalize the notion of smoothness and variation for domain-specific signals on a network. This thesis consists of two interrelated thrusts which explore both the scenarios where networks have intrinsic value and are themselves the object of study, and where the interest is for signals defined on top of the networks, so we leverage the information in the network to analyze the signals. Our results suggest that the intuition we have in analyzing huge data can be transformed into rigorous algorithms, and often the intuition results in superior performance, new observations, better complexity, and/or bridging two commonly implemented methods. Even though different in the principles they investigate, both thrusts are constructed on what we think as a contemporary alternation in data analytics: from building an algorithm then understanding it to having an intuition then building an algorithm around it. We show that in order to formalize the intuitive idea to measure the difference between a pair of networks of arbitrary sizes, we could design two algorithms based on the intuition to find mappings between the node sets or to map one network into the subset of another network. Such methods also lead to a clustering algorithm to categorize networked data structures. Besides, we could define the notion of frequencies of a given network by ordering features in the network according to how important they are to the overall information conveyed by the network. These proposed algorithms succeed in comparing collaboration histories of researchers, clustering research communities via their publication patterns, categorizing moving objects from uncertain measurmenets, and separating networks constructed from different processes. In the context of data analytics on top of networks, we design domain-specific tools by leveraging the recent advances in graph signal processing, which formalizes the intuitive notion of smoothness and variation of signals defined on top of networked structures, and generalizes conventional Fourier analysis to the graph domain. In specific, we show how these tools can be used to better classify the cancer subtypes by considering genetic profiles as signals on top of gene-to-gene interaction networks, to gain new insights to explain the difference between human beings in learning new tasks and switching attentions by considering brain activities as signals on top of brain connectivity networks, as well as to demonstrate how common methods in rating prediction are special graph filters and to base on this observation to design novel recommendation system algorithms

ScholarlyCommons@Penn

Efficient Data Driven Multi Source Fusion

Author: Islam Muhammad Aminul
Publication venue: Scholars Junction
Publication date: 10/08/2018
Field of study

Data/information fusion is an integral component of many existing and emerging applications; e.g., remote sensing, smart cars, Internet of Things (IoT), and Big Data, to name a few. While fusion aims to achieve better results than what any one individual input can provide, often the challenge is to determine the underlying mathematics for aggregation suitable for an application. In this dissertation, I focus on the following three aspects of aggregation: (i) efficient data-driven learning and optimization, (ii) extensions and new aggregation methods, and (iii) feature and decision level fusion for machine learning with applications to signal and image processing. The Choquet integral (ChI), a powerful nonlinear aggregation operator, is a parametric way (with respect to the fuzzy measure (FM)) to generate a wealth of aggregation operators. The FM has 2N variables and N(2N − 1) constraints for N inputs. As a result, learning the ChI parameters from data quickly becomes impractical for most applications. Herein, I propose a scalable learning procedure (which is linear with respect to training sample size) for the ChI that identifies and optimizes only data-supported variables. As such, the computational complexity of the learning algorithm is proportional to the complexity of the solver used. This method also includes an imputation framework to obtain scalar values for data-unsupported (aka missing) variables and a compression algorithm (lossy or losselss) of the learned variables. I also propose a genetic algorithm (GA) to optimize the ChI for non-convex, multi-modal, and/or analytical objective functions. This algorithm introduces two operators that automatically preserve the constraints; therefore there is no need to explicitly enforce the constraints as is required by traditional GA algorithms. In addition, this algorithm provides an efficient representation of the search space with the minimal set of vertices. Furthermore, I study different strategies for extending the fuzzy integral for missing data and I propose a GOAL programming framework to aggregate inputs from heterogeneous sources for the ChI learning. Last, my work in remote sensing involves visual clustering based band group selection and Lp-norm multiple kernel learning based feature level fusion in hyperspectral image processing to enhance pixel level classification

Mississippi State University Libraries ETD database

Scholars Junction - Mississippi State University Institutional Repository

Weakly monotonic averaging with application to image processing

Author: Wilkin Timothy
Publication venue: Deakin University, Faculty of Science, Engineering and Built Environment, School of Information Technology
Publication date: 01/05/2014
Field of study

Deakin Research Online