859 research outputs found
Optimally Weighted PCA for High-Dimensional Heteroscedastic Data
Modern applications increasingly involve high-dimensional and heterogeneous
data, e.g., datasets formed by combining numerous measurements from myriad
sources. Principal Component Analysis (PCA) is a classical method for reducing
dimensionality by projecting such data onto a low-dimensional subspace
capturing most of their variation, but PCA does not robustly recover underlying
subspaces in the presence of heteroscedastic noise. Specifically, PCA suffers
from treating all data samples as if they are equally informative. This paper
analyzes a weighted variant of PCA that accounts for heteroscedasticity by
giving samples with larger noise variance less influence. The analysis provides
expressions for the asymptotic recovery of underlying low-dimensional
components from samples with heteroscedastic noise in the high-dimensional
regime, i.e., for sample dimension on the order of the number of samples.
Surprisingly, it turns out that whitening the noise by using inverse noise
variance weights is suboptimal. We derive optimal weights, characterize the
performance of weighted PCA, and consider the problem of optimally collecting
samples under budget constraints.Comment: 52 pages, 13 figure
Towards a Theoretical Analysis of PCA for Heteroscedastic Data
Principal Component Analysis (PCA) is a method for estimating a subspace
given noisy samples. It is useful in a variety of problems ranging from
dimensionality reduction to anomaly detection and the visualization of high
dimensional data. PCA performs well in the presence of moderate noise and even
with missing data, but is also sensitive to outliers. PCA is also known to have
a phase transition when noise is independent and identically distributed;
recovery of the subspace sharply declines at a threshold noise variance.
Effective use of PCA requires a rigorous understanding of these behaviors. This
paper provides a step towards an analysis of PCA for samples with
heteroscedastic noise, that is, samples that have non-uniform noise variances
and so are no longer identically distributed. In particular, we provide a
simple asymptotic prediction of the recovery of a one-dimensional subspace from
noisy heteroscedastic samples. The prediction enables: a) easy and efficient
calculation of the asymptotic performance, and b) qualitative reasoning to
understand how PCA is impacted by heteroscedasticity (such as outliers).Comment: Presented at 54th Annual Allerton Conference on Communication,
Control, and Computing (Allerton
Xylem plasticity in Pinus pinaster and Quercus ilex growing at sites with different water availability in the Mediterranean region: relations between Intra-Annual Density Fluctuations and environmental conditions.
Fluctuations in climatic conditions during the growing season are recorded in Mediterranean tree-rings and often result in intra-annual density fluctuations (IADFs). Dendroecology and quantitative wood anatomy analyses were used to characterize the relations between the variability of IADF traits and climatic drivers in Pinus pinaster Aiton and Quercus ilex L. growing at sites with different water availability on the Elba island in Central Italy. Our results showed that both species present high xylem plasticity resulting in the formation of L-type IADFs (L-IADFs), consisting of earlywood-like cells in latewood. The occurrence of such IADFs was linked to rain events following periods of summer drought. The formation of L-IADFs in both species increased the hydraulic conductivity late in the growing season, due to their larger lumen area in comparison to "true latewood". The two species expressed greater similarity under arid conditions, as unfavorable climates constrained trait variation. Wood density, measured as the percentage of cell walls over total xylem area, IADF frequency, as well as conduit lumen area and vessel frequency, specifically in the hardwood species, proved to be efficient proxies to encode climate signals recorded in the xylem. The response of these anatomical traits to climatic variations was found to be species- and site-specific
HeMPPCAT: Mixtures of Probabilistic Principal Component Analysers for Data with Heteroscedastic Noise
Mixtures of probabilistic principal component analysis (MPPCA) is a
well-known mixture model extension of principal component analysis (PCA).
Similar to PCA, MPPCA assumes the data samples in each mixture contain
homoscedastic noise. However, datasets with heterogeneous noise across samples
are becoming increasingly common, as larger datasets are generated by
collecting samples from several sources with varying noise profiles. The
performance of MPPCA is suboptimal for data with heteroscedastic noise across
samples. This paper proposes a heteroscedastic mixtures of probabilistic PCA
technique (HeMPPCAT) that uses a generalized expectation-maximization (GEM)
algorithm to jointly estimate the unknown underlying factors, means, and noise
variances under a heteroscedastic noise setting. Simulation results illustrate
the improved factor estimates and clustering accuracies of HeMPPCAT compared to
MPPCA
ALPCAH: Sample-wise Heteroscedastic PCA with Tail Singular Value Regularization
Principal component analysis (PCA) is a key tool in the field of data
dimensionality reduction that is useful for various data science problems.
However, many applications involve heterogeneous data that varies in quality
due to noise characteristics associated with different sources of the data.
Methods that deal with this mixed dataset are known as heteroscedastic methods.
Current methods like HePPCAT make Gaussian assumptions of the basis
coefficients that may not hold in practice. Other methods such as Weighted PCA
(WPCA) assume the noise variances are known, which may be difficult to know in
practice. This paper develops a PCA method that can estimate the sample-wise
noise variances and use this information in the model to improve the estimate
of the subspace basis associated with the low-rank structure of the data. This
is done without distributional assumptions of the low-rank component and
without assuming the noise variances are known. Simulations show the
effectiveness of accounting for such heteroscedasticity in the data, the
benefits of using such a method with all of the data versus retaining only good
data, and comparisons are made against other PCA methods established in the
literature like PCA, Robust PCA (RPCA), and HePPCAT. Code available at
https://github.com/javiersc1/ALPCAHComment: This article has been accepted for publication in the Fourteenth
International Conference on Sampling Theory and Applications, accessible via
IEEE XPlore. See DOI sectio
Streaming Probabilistic PCA for Missing Data with Heteroscedastic Noise
Streaming principal component analysis (PCA) is an integral tool in
large-scale machine learning for rapidly estimating low-dimensional subspaces
of very high dimensional and high arrival-rate data with missing entries and
corrupting noise. However, modern trends increasingly combine data from a
variety of sources, meaning they may exhibit heterogeneous quality across
samples. Since standard streaming PCA algorithms do not account for non-uniform
noise, their subspace estimates can quickly degrade. On the other hand, the
recently proposed Heteroscedastic Probabilistic PCA Technique (HePPCAT)
addresses this heterogeneity, but it was not designed to handle missing entries
and streaming data, nor does it adapt to non-stationary behavior in time series
data. This paper proposes the Streaming HeteroscedASTic Algorithm for PCA
(SHASTA-PCA) to bridge this divide. SHASTA-PCA employs a stochastic alternating
expectation maximization approach that jointly learns the low-rank latent
factors and the unknown noise variances from streaming data that may have
missing entries and heteroscedastic noise, all while maintaining a low memory
and computational footprint. Numerical experiments validate the superior
subspace estimation of our method compared to state-of-the-art streaming PCA
algorithms in the heteroscedastic setting. Finally, we illustrate SHASTA-PCA
applied to highly-heterogeneous real data from astronomy.Comment: 19 pages, 6 figure
Thiourea Derivative of 2-[(1 R)-1-Aminoethyl]phenol: A Flexible Pocket-like Chiral Solvating Agent (CSA) for the Enantiodifferentiation of Amino Acid Derivatives by NMR Spectroscopy
Thiourea derivatives of 2-[(1R)-1-aminoethyl]phenol, (1S,2R)-1-amino-2,3-dihydro-1H-inden-2-ol, (1R,2R)-(1S,2R)-1-amino-2,3-dihydro-1H-inden-2-ol, and (R)-1-phenylethanamine have been compared as chiral solvating agents (CSAs) for the enantiodiscrimination of derivatized amino acids using nuclear magnetic resonance (NMR) spectroscopy. Thiourea derivative, prepared by reacting 2-[(1R)-1-aminoethyl]phenol with benzoyl isothiocyanate, constitutes an effective CSA for the enantiodiscrimination of N-3,5-dinitrobenzoyl (DNB) derivatives of amino acids with free or derivatized carboxyl functions. A base additive 1,4-diazabicyclo[2.2.2]octane(DABCO)/N,N-dimethylpyridin-4-amine (DMAP)/NBu4OH) is required both to solubilize amino acid derivatives with free carboxyl groups in CDCl3 and to mediate their interaction with the chiral auxiliary to attain efficient differentiation of the NMR signals of enantiomeric substrates. For ternary systems CSA/substrate/DABCO, the chiral discrimination mechanism has been ascertained through the NMR determination of complexation stoichiometry, association constants, and stereochemical features of the diastereomeric solvates
Renewable Resources for Enantiodiscrimination: Chiral Solvating Agents for NMR Spectroscopy from Isomannide and Isosorbide
A new family of chiral selectors was synthesized in a single synthetic step with yields up to 84% starting from isomannide and isosorbide. Mono- or disubstituted carbamate derivatives were obtained by reacting the isohexides with electron-donating arylisocyanate (3,5-dimethylphenyl- or 3,5-dimethoxyphenyl-) and electron-withdrawing arylisocyanate (3,5-bis(trifluoromethyl)phenyl-) groups to test opposite electronic effects on enantiodifferentiation. Deeper chiral pockets and derivatives with more acidic protons were obtained by derivatization with 1-naphthylisocyanate and p-toluenesulfonylisocyanate, respectively. All compounds were tested as chiral solvating agents (CSAs) in H-1 NMR experiments with rac-N-3,5-dinitrobenzoylphenylglycine methyl ester in order to determine the influence of different structural features on the enantiodiscrimination capabilities. Some selected compounds were tested with other racemic analytes, still leading to enantiodiscrimination. The enantiodiscrimination conditions were then optimized for the best CSA/analyte couple. Finally, a 2D- and 1D-NMR study was performed employing the best performing CSA with the two enantiomers of the selected analyte, aiming to determine the enantiodiscrimination mechanism, the stoichiometry of interaction, and the complexation constant
2-Methyl-β-cyclodextrin grafted ammonium chitosan: synergistic effects of cyclodextrin host and polymer backbone in the interaction with amphiphilic prednisolone phosphate salt as revealed by NMR spectroscopy
Reduced molecular weight chitosan was quaternized with 2-chloro-N,N-diethylethylamine to obtain a water soluble derivative (N+-rCh). Methylated-β-cyclodextrin (MCD), with 0.5 molar substitution, was covalently linked to N+-rCh through 1,6-hexamethylene diisocyanate spacer to give the derivatized ammonium chitosan N+-rCh-MCD. To shed light on the role of the cyclodextrin pendant in guiding binding interactions with amphiphilic active ingredients, corticosteroid prednisolone phosphate salt (PN) was considered. The deep inclusion of PN into cyclodextrin in PN/MCD model system was pointed out by analysis of 1H NMR complexation shifts, 1D ROESY spectra, and diffusion measurements (DOSY). By using proton selective relaxation rates measurements as investigation tool, the superior affinity of N+-rCh-MCD towards PN was demonstrated in comparison with parent ammonium chitosan N+-rCh
- …