4 research outputs found

    Recovering Data Permutations from Noisy Observations: The Linear Regime

    Full text link
    This paper considers a noisy data structure recovery problem. The goal is to investigate the following question: Given a noisy observation of a permuted data set, according to which permutation was the original data sorted? The focus is on scenarios where data is generated according to an isotropic Gaussian distribution, and the noise is additive Gaussian with an arbitrary covariance matrix. This problem is posed within a hypothesis testing framework. The objective is to study the linear regime in which the optimal decoder has a polynomial complexity in the data size, and it declares the permutation by simply computing a permutation-independent linear function of the noisy observations. The main result of the paper is a complete characterization of the linear regime in terms of the noise covariance matrix. Specifically, it is shown that this matrix must have a very flat spectrum with at most three distinct eigenvalues to induce the linear regime. Several practically relevant implications of this result are discussed, and the error probability incurred by the decision criterion in the linear regime is also characterized. A core technical component consists of using linear algebraic and geometric tools, such as Steiner symmetrization

    A Hypergradient Approach to Robust Regression without Correspondence

    Full text link
    We consider a variant of regression problem, where the correspondence between input and output data is not available. Such shuffled data is commonly observed in many real world problems. Taking flow cytometry as an example, the measuring instruments may not be able to maintain the correspondence between the samples and the measurements. Due to the combinatorial nature of the problem, most existing methods are only applicable when the sample size is small, and limited to linear regression models. To overcome such bottlenecks, we propose a new computational framework -- ROBOT -- for the shuffled regression problem, which is applicable to large data and complex nonlinear models. Specifically, we reformulate the regression without correspondence as a continuous optimization problem. Then by exploiting the interaction between the regression model and the data correspondence, we develop a hypergradient approach based on differentiable programming techniques. Such a hypergradient approach essentially views the data correspondence as an operator of the regression, and therefore allows us to find a better descent direction for the model parameter by differentiating through the data correspondence. ROBOT can be further extended to the inexact correspondence setting, where there may not be an exact alignment between the input and output data. Thorough numerical experiments show that ROBOT achieves better performance than existing methods in both linear and nonlinear regression tasks, including real-world applications such as flow cytometry and multi-object tracking

    A Pseudo-Likelihood Approach to Linear Regression with Partially Shuffled Data

    Full text link
    Recently, there has been significant interest in linear regression in the situation where predictors and responses are not observed in matching pairs corresponding to the same statistical unit as a consequence of separate data collection and uncertainty in data integration. Mismatched pairs can considerably impact the model fit and disrupt the estimation of regression parameters. In this paper, we present a method to adjust for such mismatches under ``partial shuffling" in which a sufficiently large fraction of (predictors, response)-pairs are observed in their correct correspondence. The proposed approach is based on a pseudo-likelihood in which each term takes the form of a two-component mixture density. Expectation-Maximization schemes are proposed for optimization, which (i) scale favorably in the number of samples, and (ii) achieve excellent statistical performance relative to an oracle that has access to the correct pairings as certified by simulations and case studies. In particular, the proposed approach can tolerate considerably larger fraction of mismatches than existing approaches, and enables estimation of the noise level as well as the fraction of mismatches. Inference for the resulting estimator (standard errors, confidence intervals) can be based on established theory for composite likelihood estimation. Along the way, we also propose a statistical test for the presence of mismatches and establish its consistency under suitable conditions.Comment: 31 page

    Estimation in exponential family Regression based on linked data contaminated by mismatch error

    Full text link
    Identification of matching records in multiple files can be a challenging and error-prone task. Linkage error can considerably affect subsequent statistical analysis based on the resulting linked file. Several recent papers have studied post-linkage linear regression analysis with the response variable in one file and the covariates in a second file from the perspective of the "Broken Sample Problem" and "Permuted Data". In this paper, we present an extension of this line of research to exponential family response given the assumption of a small to moderate number of mismatches. A method based on observation-specific offsets to account for potential mismatches and â„“1\ell_1-penalization is proposed, and its statistical properties are discussed. We also present sufficient conditions for the recovery of the correct correspondence between covariates and responses if the regression parameter is known. The proposed approach is compared to established baselines, namely the methods by Lahiri-Larsen and Chambers, both theoretically and empirically based on synthetic and real data. The results indicate that substantial improvements over those methods can be achieved even if only limited information about the linkage process is available.Comment: 51 pages, 7 figure
    corecore