4 research outputs found
Recovering Data Permutations from Noisy Observations: The Linear Regime
This paper considers a noisy data structure recovery problem. The goal is to
investigate the following question: Given a noisy observation of a permuted
data set, according to which permutation was the original data sorted? The
focus is on scenarios where data is generated according to an isotropic
Gaussian distribution, and the noise is additive Gaussian with an arbitrary
covariance matrix. This problem is posed within a hypothesis testing framework.
The objective is to study the linear regime in which the optimal decoder has a
polynomial complexity in the data size, and it declares the permutation by
simply computing a permutation-independent linear function of the noisy
observations. The main result of the paper is a complete characterization of
the linear regime in terms of the noise covariance matrix. Specifically, it is
shown that this matrix must have a very flat spectrum with at most three
distinct eigenvalues to induce the linear regime. Several practically relevant
implications of this result are discussed, and the error probability incurred
by the decision criterion in the linear regime is also characterized. A core
technical component consists of using linear algebraic and geometric tools,
such as Steiner symmetrization
A Hypergradient Approach to Robust Regression without Correspondence
We consider a variant of regression problem, where the correspondence between
input and output data is not available. Such shuffled data is commonly observed
in many real world problems. Taking flow cytometry as an example, the measuring
instruments may not be able to maintain the correspondence between the samples
and the measurements. Due to the combinatorial nature of the problem, most
existing methods are only applicable when the sample size is small, and limited
to linear regression models. To overcome such bottlenecks, we propose a new
computational framework -- ROBOT -- for the shuffled regression problem, which
is applicable to large data and complex nonlinear models. Specifically, we
reformulate the regression without correspondence as a continuous optimization
problem. Then by exploiting the interaction between the regression model and
the data correspondence, we develop a hypergradient approach based on
differentiable programming techniques. Such a hypergradient approach
essentially views the data correspondence as an operator of the regression, and
therefore allows us to find a better descent direction for the model parameter
by differentiating through the data correspondence. ROBOT can be further
extended to the inexact correspondence setting, where there may not be an exact
alignment between the input and output data. Thorough numerical experiments
show that ROBOT achieves better performance than existing methods in both
linear and nonlinear regression tasks, including real-world applications such
as flow cytometry and multi-object tracking
A Pseudo-Likelihood Approach to Linear Regression with Partially Shuffled Data
Recently, there has been significant interest in linear regression in the
situation where predictors and responses are not observed in matching pairs
corresponding to the same statistical unit as a consequence of separate data
collection and uncertainty in data integration. Mismatched pairs can
considerably impact the model fit and disrupt the estimation of regression
parameters. In this paper, we present a method to adjust for such mismatches
under ``partial shuffling" in which a sufficiently large fraction of
(predictors, response)-pairs are observed in their correct correspondence. The
proposed approach is based on a pseudo-likelihood in which each term takes the
form of a two-component mixture density. Expectation-Maximization schemes are
proposed for optimization, which (i) scale favorably in the number of samples,
and (ii) achieve excellent statistical performance relative to an oracle that
has access to the correct pairings as certified by simulations and case
studies. In particular, the proposed approach can tolerate considerably larger
fraction of mismatches than existing approaches, and enables estimation of the
noise level as well as the fraction of mismatches. Inference for the resulting
estimator (standard errors, confidence intervals) can be based on established
theory for composite likelihood estimation. Along the way, we also propose a
statistical test for the presence of mismatches and establish its consistency
under suitable conditions.Comment: 31 page
Estimation in exponential family Regression based on linked data contaminated by mismatch error
Identification of matching records in multiple files can be a challenging and
error-prone task. Linkage error can considerably affect subsequent statistical
analysis based on the resulting linked file. Several recent papers have studied
post-linkage linear regression analysis with the response variable in one file
and the covariates in a second file from the perspective of the "Broken Sample
Problem" and "Permuted Data". In this paper, we present an extension of this
line of research to exponential family response given the assumption of a small
to moderate number of mismatches. A method based on observation-specific
offsets to account for potential mismatches and -penalization is
proposed, and its statistical properties are discussed. We also present
sufficient conditions for the recovery of the correct correspondence between
covariates and responses if the regression parameter is known. The proposed
approach is compared to established baselines, namely the methods by
Lahiri-Larsen and Chambers, both theoretically and empirically based on
synthetic and real data. The results indicate that substantial improvements
over those methods can be achieved even if only limited information about the
linkage process is available.Comment: 51 pages, 7 figure