38,608 research outputs found
Taming Wild High Dimensional Text Data with a Fuzzy Lash
The bag of words (BOW) represents a corpus in a matrix whose elements are the
frequency of words. However, each row in the matrix is a very high-dimensional
sparse vector. Dimension reduction (DR) is a popular method to address sparsity
and high-dimensionality issues. Among different strategies to develop DR
method, Unsupervised Feature Transformation (UFT) is a popular strategy to map
all words on a new basis to represent BOW. The recent increase of text data and
its challenges imply that DR area still needs new perspectives. Although a wide
range of methods based on the UFT strategy has been developed, the fuzzy
approach has not been considered for DR based on this strategy. This research
investigates the application of fuzzy clustering as a DR method based on the
UFT strategy to collapse BOW matrix to provide a lower-dimensional
representation of documents instead of the words in a corpus. The quantitative
evaluation shows that fuzzy clustering produces superior performance and
features to Principal Components Analysis (PCA) and Singular Value
Decomposition (SVD), two popular DR methods based on the UFT strategy
Singular Value Decomposition for High Dimensional Data
Singular value decomposition is a widely used tool for dimension reduction in multivariate analysis. However, when used for statistical estimation in high-dimensional low rank matrix models, singular vectors of the noise-corrupted matrix are inconsistent for their counterparts of the true mean matrix. We suppose the true singular vectors have sparse representations in a certain basis. We propose an iterative thresholding algorithm that can estimate the subspaces spanned by leading left and right singular vectors and also the true mean matrix optimally under Gaussian assumption. We further turn the algorithm into a practical methodology that is fast, data-driven and robust to heavy-tailed noises. Simulations and a real data example further show its competitive performance. The dissertation contains two chapters. For the ease of the delivery, Chapter 1 is dedicated to the description and the study of the practical methodology and Chapter 2 states and proves the theoretical property of the algorithm under Gaussian noise
SOFARI: High-Dimensional Manifold-Based Inference
Multi-task learning is a widely used technique for harnessing information
from various tasks. Recently, the sparse orthogonal factor regression (SOFAR)
framework, based on the sparse singular value decomposition (SVD) within the
coefficient matrix, was introduced for interpretable multi-task learning,
enabling the discovery of meaningful latent feature-response association
networks across different layers. However, conducting precise inference on the
latent factor matrices has remained challenging due to orthogonality
constraints inherited from the sparse SVD constraint. In this paper, we suggest
a novel approach called high-dimensional manifold-based SOFAR inference
(SOFARI), drawing on the Neyman near-orthogonality inference while
incorporating the Stiefel manifold structure imposed by the SVD constraints. By
leveraging the underlying Stiefel manifold structure, SOFARI provides
bias-corrected estimators for both latent left factor vectors and singular
values, for which we show to enjoy the asymptotic mean-zero normal
distributions with estimable variances. We introduce two SOFARI variants to
handle strongly and weakly orthogonal latent factors, where the latter covers a
broader range of applications. We illustrate the effectiveness of SOFARI and
justify our theoretical results through simulation examples and a real data
application in economic forecasting.Comment: 114 pages, 2 figure
CenetBiplot: a new proposal of sparse and orthogonal biplots methods by means of elastic net CSVD
[EN[ In this work, a new mathematical algorithm for sparse and orthogonal constrained biplots, called CenetBiplots, is proposed. Biplots provide a joint representation of observations and variables of a multidimensional matrix in the same reference system. In this subspace the relationships between them can be interpreted in terms of geometric elements. CenetBiplots projects a matrix onto a low-dimensional space generated simultaneously by sparse and orthogonal principal components. Sparsity is desired to select variables automatically, and orthogonality is necessary to keep the geometrical properties that ensure the biplots graphical interpretation. To this purpose, the present study focuses on two different objectives: 1) the extension of constrained singular value decomposition to incorporate an elastic net sparse constraint (CenetSVD), and 2) the implementation of CenetBiplots using CenetSVD. The usefulness of the proposed methodologies for analysing high-dimensional and low-dimensional matrices is shown. Our method is implemented in R software and available for download from
https://github.com/ananieto/SparseCenetMA.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was not supported by any grant.Publicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL
- …