827 research outputs found

    Towards a Theoretical Analysis of PCA for Heteroscedastic Data

    Full text link
    Principal Component Analysis (PCA) is a method for estimating a subspace given noisy samples. It is useful in a variety of problems ranging from dimensionality reduction to anomaly detection and the visualization of high dimensional data. PCA performs well in the presence of moderate noise and even with missing data, but is also sensitive to outliers. PCA is also known to have a phase transition when noise is independent and identically distributed; recovery of the subspace sharply declines at a threshold noise variance. Effective use of PCA requires a rigorous understanding of these behaviors. This paper provides a step towards an analysis of PCA for samples with heteroscedastic noise, that is, samples that have non-uniform noise variances and so are no longer identically distributed. In particular, we provide a simple asymptotic prediction of the recovery of a one-dimensional subspace from noisy heteroscedastic samples. The prediction enables: a) easy and efficient calculation of the asymptotic performance, and b) qualitative reasoning to understand how PCA is impacted by heteroscedasticity (such as outliers).Comment: Presented at 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton

    Optimally Weighted PCA for High-Dimensional Heteroscedastic Data

    Full text link
    Modern applications increasingly involve high-dimensional and heterogeneous data, e.g., datasets formed by combining numerous measurements from myriad sources. Principal Component Analysis (PCA) is a classical method for reducing dimensionality by projecting such data onto a low-dimensional subspace capturing most of their variation, but PCA does not robustly recover underlying subspaces in the presence of heteroscedastic noise. Specifically, PCA suffers from treating all data samples as if they are equally informative. This paper analyzes a weighted variant of PCA that accounts for heteroscedasticity by giving samples with larger noise variance less influence. The analysis provides expressions for the asymptotic recovery of underlying low-dimensional components from samples with heteroscedastic noise in the high-dimensional regime, i.e., for sample dimension on the order of the number of samples. Surprisingly, it turns out that whitening the noise by using inverse noise variance weights is suboptimal. We derive optimal weights, characterize the performance of weighted PCA, and consider the problem of optimally collecting samples under budget constraints.Comment: 52 pages, 13 figure

    Xylem plasticity in Pinus pinaster and Quercus ilex growing at sites with different water availability in the Mediterranean region: relations between Intra-Annual Density Fluctuations and environmental conditions.

    Get PDF
    Fluctuations in climatic conditions during the growing season are recorded in Mediterranean tree-rings and often result in intra-annual density fluctuations (IADFs). Dendroecology and quantitative wood anatomy analyses were used to characterize the relations between the variability of IADF traits and climatic drivers in Pinus pinaster Aiton and Quercus ilex L. growing at sites with different water availability on the Elba island in Central Italy. Our results showed that both species present high xylem plasticity resulting in the formation of L-type IADFs (L-IADFs), consisting of earlywood-like cells in latewood. The occurrence of such IADFs was linked to rain events following periods of summer drought. The formation of L-IADFs in both species increased the hydraulic conductivity late in the growing season, due to their larger lumen area in comparison to "true latewood". The two species expressed greater similarity under arid conditions, as unfavorable climates constrained trait variation. Wood density, measured as the percentage of cell walls over total xylem area, IADF frequency, as well as conduit lumen area and vessel frequency, specifically in the hardwood species, proved to be efficient proxies to encode climate signals recorded in the xylem. The response of these anatomical traits to climatic variations was found to be species- and site-specific

    HeMPPCAT: Mixtures of Probabilistic Principal Component Analysers for Data with Heteroscedastic Noise

    Full text link
    Mixtures of probabilistic principal component analysis (MPPCA) is a well-known mixture model extension of principal component analysis (PCA). Similar to PCA, MPPCA assumes the data samples in each mixture contain homoscedastic noise. However, datasets with heterogeneous noise across samples are becoming increasingly common, as larger datasets are generated by collecting samples from several sources with varying noise profiles. The performance of MPPCA is suboptimal for data with heteroscedastic noise across samples. This paper proposes a heteroscedastic mixtures of probabilistic PCA technique (HeMPPCAT) that uses a generalized expectation-maximization (GEM) algorithm to jointly estimate the unknown underlying factors, means, and noise variances under a heteroscedastic noise setting. Simulation results illustrate the improved factor estimates and clustering accuracies of HeMPPCAT compared to MPPCA

    ALPCAH: Sample-wise Heteroscedastic PCA with Tail Singular Value Regularization

    Full text link
    Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction that is useful for various data science problems. However, many applications involve heterogeneous data that varies in quality due to noise characteristics associated with different sources of the data. Methods that deal with this mixed dataset are known as heteroscedastic methods. Current methods like HePPCAT make Gaussian assumptions of the basis coefficients that may not hold in practice. Other methods such as Weighted PCA (WPCA) assume the noise variances are known, which may be difficult to know in practice. This paper develops a PCA method that can estimate the sample-wise noise variances and use this information in the model to improve the estimate of the subspace basis associated with the low-rank structure of the data. This is done without distributional assumptions of the low-rank component and without assuming the noise variances are known. Simulations show the effectiveness of accounting for such heteroscedasticity in the data, the benefits of using such a method with all of the data versus retaining only good data, and comparisons are made against other PCA methods established in the literature like PCA, Robust PCA (RPCA), and HePPCAT. Code available at https://github.com/javiersc1/ALPCAHComment: This article has been accepted for publication in the Fourteenth International Conference on Sampling Theory and Applications, accessible via IEEE XPlore. See DOI sectio

    Streaming Probabilistic PCA for Missing Data with Heteroscedastic Noise

    Full text link
    Streaming principal component analysis (PCA) is an integral tool in large-scale machine learning for rapidly estimating low-dimensional subspaces of very high dimensional and high arrival-rate data with missing entries and corrupting noise. However, modern trends increasingly combine data from a variety of sources, meaning they may exhibit heterogeneous quality across samples. Since standard streaming PCA algorithms do not account for non-uniform noise, their subspace estimates can quickly degrade. On the other hand, the recently proposed Heteroscedastic Probabilistic PCA Technique (HePPCAT) addresses this heterogeneity, but it was not designed to handle missing entries and streaming data, nor does it adapt to non-stationary behavior in time series data. This paper proposes the Streaming HeteroscedASTic Algorithm for PCA (SHASTA-PCA) to bridge this divide. SHASTA-PCA employs a stochastic alternating expectation maximization approach that jointly learns the low-rank latent factors and the unknown noise variances from streaming data that may have missing entries and heteroscedastic noise, all while maintaining a low memory and computational footprint. Numerical experiments validate the superior subspace estimation of our method compared to state-of-the-art streaming PCA algorithms in the heteroscedastic setting. Finally, we illustrate SHASTA-PCA applied to highly-heterogeneous real data from astronomy.Comment: 19 pages, 6 figure

    Thiourea Derivative of 2-[(1 R)-1-Aminoethyl]phenol: A Flexible Pocket-like Chiral Solvating Agent (CSA) for the Enantiodifferentiation of Amino Acid Derivatives by NMR Spectroscopy

    Get PDF
    Thiourea derivatives of 2-[(1R)-1-aminoethyl]phenol, (1S,2R)-1-amino-2,3-dihydro-1H-inden-2-ol, (1R,2R)-(1S,2R)-1-amino-2,3-dihydro-1H-inden-2-ol, and (R)-1-phenylethanamine have been compared as chiral solvating agents (CSAs) for the enantiodiscrimination of derivatized amino acids using nuclear magnetic resonance (NMR) spectroscopy. Thiourea derivative, prepared by reacting 2-[(1R)-1-aminoethyl]phenol with benzoyl isothiocyanate, constitutes an effective CSA for the enantiodiscrimination of N-3,5-dinitrobenzoyl (DNB) derivatives of amino acids with free or derivatized carboxyl functions. A base additive 1,4-diazabicyclo[2.2.2]octane(DABCO)/N,N-dimethylpyridin-4-amine (DMAP)/NBu4OH) is required both to solubilize amino acid derivatives with free carboxyl groups in CDCl3 and to mediate their interaction with the chiral auxiliary to attain efficient differentiation of the NMR signals of enantiomeric substrates. For ternary systems CSA/substrate/DABCO, the chiral discrimination mechanism has been ascertained through the NMR determination of complexation stoichiometry, association constants, and stereochemical features of the diastereomeric solvates

    Renewable Resources for Enantiodiscrimination: Chiral Solvating Agents for NMR Spectroscopy from Isomannide and Isosorbide

    Get PDF
    A new family of chiral selectors was synthesized in a single synthetic step with yields up to 84% starting from isomannide and isosorbide. Mono- or disubstituted carbamate derivatives were obtained by reacting the isohexides with electron-donating arylisocyanate (3,5-dimethylphenyl- or 3,5-dimethoxyphenyl-) and electron-withdrawing arylisocyanate (3,5-bis(trifluoromethyl)phenyl-) groups to test opposite electronic effects on enantiodifferentiation. Deeper chiral pockets and derivatives with more acidic protons were obtained by derivatization with 1-naphthylisocyanate and p-toluenesulfonylisocyanate, respectively. All compounds were tested as chiral solvating agents (CSAs) in H-1 NMR experiments with rac-N-3,5-dinitrobenzoylphenylglycine methyl ester in order to determine the influence of different structural features on the enantiodiscrimination capabilities. Some selected compounds were tested with other racemic analytes, still leading to enantiodiscrimination. The enantiodiscrimination conditions were then optimized for the best CSA/analyte couple. Finally, a 2D- and 1D-NMR study was performed employing the best performing CSA with the two enantiomers of the selected analyte, aiming to determine the enantiodiscrimination mechanism, the stoichiometry of interaction, and the complexation constant

    2-Methyl-β-cyclodextrin grafted ammonium chitosan: synergistic effects of cyclodextrin host and polymer backbone in the interaction with amphiphilic prednisolone phosphate salt as revealed by NMR spectroscopy

    Get PDF
    Reduced molecular weight chitosan was quaternized with 2-chloro-N,N-diethylethylamine to obtain a water soluble derivative (N+-rCh). Methylated-β-cyclodextrin (MCD), with 0.5 molar substitution, was covalently linked to N+-rCh through 1,6-hexamethylene diisocyanate spacer to give the derivatized ammonium chitosan N+-rCh-MCD. To shed light on the role of the cyclodextrin pendant in guiding binding interactions with amphiphilic active ingredients, corticosteroid prednisolone phosphate salt (PN) was considered. The deep inclusion of PN into cyclodextrin in PN/MCD model system was pointed out by analysis of 1H NMR complexation shifts, 1D ROESY spectra, and diffusion measurements (DOSY). By using proton selective relaxation rates measurements as investigation tool, the superior affinity of N+-rCh-MCD towards PN was demonstrated in comparison with parent ammonium chitosan N+-rCh
    • …
    corecore