193 research outputs found
CenetBiplot: a new proposal of sparse and orthogonal biplots methods by means of elastic net CSVD
[EN[ In this work, a new mathematical algorithm for sparse and orthogonal constrained biplots, called CenetBiplots, is proposed. Biplots provide a joint representation of observations and variables of a multidimensional matrix in the same reference system. In this subspace the relationships between them can be interpreted in terms of geometric elements. CenetBiplots projects a matrix onto a low-dimensional space generated simultaneously by sparse and orthogonal principal components. Sparsity is desired to select variables automatically, and orthogonality is necessary to keep the geometrical properties that ensure the biplots graphical interpretation. To this purpose, the present study focuses on two different objectives: 1) the extension of constrained singular value decomposition to incorporate an elastic net sparse constraint (CenetSVD), and 2) the implementation of CenetBiplots using CenetSVD. The usefulness of the proposed methodologies for analysing high-dimensional and low-dimensional matrices is shown. Our method is implemented in R software and available for download from
https://github.com/ananieto/SparseCenetMA.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was not supported by any grant.Publicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL
Recommended from our members
Interpretable and fast dimension reduction of multivariate data
The main objective of this thesis is to propose new techniques to simplify the interpretation of newly formed 'variables' or components, while reducing the dimensionality of multivariate data. Most attention is given to the interpretation of principal components, although one chapter is devoted to that of factors in factor analysis. Sparse principal components are proposed, in which some of the component loadings are made exactly zero. One approach is to make use of the idea of correlation biplots, where orthogonal matrix of sparse loadings is obtained from computing the biplot factors of the product of principal component loading matrix and functions of their variances. Other approaches in volve clustering of variables as a pre-processings tep, so that sparse components are computed from the data or correlation matrix of each cluster. New clustering techniques are proposed for this purpose. In addition, a penalized varimax approach is proposed for simplifying the interpretation of factors in factor analysis, especially for factor solutions with considerably different sum of squares. This is done by adding a penalty term to the ordinary varimax criterion. Data sets of varying sizes, both synthetic and real, are used to illustrate the proposed methods, and the results are compared with those of existing ones. In the case of principal component analysis, the resulting sparse components are found to be more interpretable (sparser) and explain higher cumulative percentage of adjusted variance compared to their counterparts from other techniques. The penalized varimax approach contributes in finding a factor solution with simple structures which are not revealed by the standard varimax solution. The proposed methods are very simple to understand and involve fast algorithms compared to some of the existing methods. They contribute much to the interpretation of components in a reduced dimension while dealing with dimensionality reduction of multivariate data
An exploratory data analysis method to reveal modular latent structures in high-throughput data
<p>Abstract</p> <p>Background</p> <p>Modular structures are ubiquitous across various types of biological networks. The study of network modularity can help reveal regulatory mechanisms in systems biology, evolutionary biology and developmental biology. Identifying putative modular latent structures from high-throughput data using exploratory analysis can help better interpret the data and generate new hypotheses. Unsupervised learning methods designed for global dimension reduction or clustering fall short of identifying modules with factors acting in linear combinations.</p> <p>Results</p> <p>We present an exploratory data analysis method named MLSA (Modular Latent Structure Analysis) to estimate modular latent structures, which can find co-regulative modules that involve non-coexpressive genes.</p> <p>Conclusions</p> <p>Through simulations and real-data analyses, we show that the method can recover modular latent structures effectively. In addition, the method also performed very well on data generated from sparse global latent factor models. The R code is available at <url>http://userwww.service.emory.edu/~tyu8/MLSA/</url>.</p
Contributions to Functional Data Analysis with Applications to Modeling Time Series and Panel Data
In Chapter 1 we propose a new perspective on modeling and forecasting electricity spot prices. Our approach is motivated by the data-generating process of electricity spot prices, which is well described what is called the merit order model. The merit order model is a micro economic model based on the assumption that spot prices on electricity exchanges are determined by the marginal generation costs of the last power plant that is required to cover the demand. The resulting merit order curve reflects the increasing generation costs of the installed power plants. Correspondingly, we suggest interpreting hourly electricity spot prices as noisy discretization points of smooth price functions.
These price functions are modeled by a functional factor model (FFM) for which we discuss a two-step estimation procedure. The first step is a classical pre-smoothing step in order to estimate the single price functions from the noisy discretization points. The second step then aims for a robust estimation of a finite set of common basis functions from the pre-smoothed price functions. In doing this, we carefully consider the issue of finding an optimal smoothing parameter.
The presentation of our functional factor model concludes with an extensive forecast study which compares our FFM with alternative time series models that have been successfully applied in the literature on electricity spot prices. The forecast study clearly confirms the superior power of our functional factor model and the use of price functions as underlying structures of electricity spot prices in general.
A slightly modified version of Chapter 1 is forthcoming as a single-authored article in "The Annals of Applied Statistics"; see Liebl (2013).
Chapter 2 further discusses the problem of modeling electricity spot prices. On the one hand, we extend the concept of price function introduced in Chapter 1 using covariables. On the other hand, we focus on a generally deeper theoretical consideration of the involved multivariate nonparametric regression model, which is used as a tool for FPCA.
We extend existing theoretical results with respect to FPCA for sparse functional data by considering the asymptotic bias and variance of the multivariate local linear estimator of the mean and the covariance functions. Here, we carefully consider the effects of between-correlations, which are caused by the time series context, and the effects of within-correlations, which are caused by the functional nature of the data.
In order to demonstrate the usefulness of our model we analyze the effects of Germany's nuclear moratorium on March 14, 2011. This event describes a natural experiment, since in the course of Germany's nuclear moratorium on March 14, 2011, eight nuclear power plants were phased out [Nestle (2012)]. The data set analyzed in Chapter 2 covers exactly one year before and one year after Germany's nuclear power phase-out. We apply our model separately to these two time spans in order to contrast the different market situations.
In Chapter 3 we pick up the successful application of FDA within the literature on panel data models. Recent panel data models allow us to control for complex unobserved heterogeneity effects by the incorporation of latent factor models. This new kind of panel data models extends the classical concept of individual random (scalar) effects to random processes or random functions [see, e.g., Bai, Kao and Ng (2009), Bai (2009), and Kneip, Sickles and Sond (2012)].
Even though this class of panel models is of high relevance for practical problems such as stochastic frontier analysis, they are still rarely applied in the empirical literature. Our implementation of these methods in the statistical software package of phtt provides a first step towards facilitating their application.
As the estimation procedure of Kneip, Sickles and Sond (2012) involves nonparametric smoothing methods, the choice of a reliable procedure to find an optimal smoothing parameter is most important for implementing the estimation procedure in a statistical software package. We consider this problem and suggest to use the technique of ``parameter-cascading'' in order to approximate an upper bound for the optimal smoothing parameter [see also Cao and Ramsay (2010)].
The final optimal smoothing parameter lies somewhere between this approximated upper bound and zero. Knowledge of this interval allows for a robust implementation of the computationally costly cross validation criterion.
A slightly modified version of Chapter 3 is accepted as a co-authored article for the "Journal of Statistical Software"; see Bada and Liebl (2013)
Multiproduct Pricing in Major League Baseball: A Principal Components Analysis
The empirical analysis of multiproduct pricing suffers from a lack of clear theoretical guidance and appropriate data, limitations which often render traditional regression-based analyses impractical. This paper analyzes ticket, parking, and concession pricing in Major League Baseball for the period 1991-2003 using a new methodology based on principal components, which allows inferences to be formed about the factors underlying price variation without strong theoretical guidance or abundant information about costs and demand. While general demand shifts are the most important factor, they explain only half of overall price variation. Also important are price interactions that derive from demand interrelationships between goods and the desire to maximize the capture of consumer surplus in the presence of heterogeneous demand.
Sparsity in partial least squares regression models
Data sets with multiple responses and multiple predictor variables are increasingly common. It is known that such data sets often exhibit near multicollinearity and the traditional ordinary least squares (OLS) regression method do not perform well in such a setting because the mean square error of the OLS regression coefficients will be large and prediction performance will be poor. This drawback of OLS is often handled by using well-known dimension reduction methods; the focus in this thesis is Partial Least Squares (PLS).
The following contributions are made in the thesis: (a) Introduce relevant components (RC) models characterized by restrictions on the joint covariance matrix of the response and predictor variables, and show that the univariate (single-response) version of the RC model can be represented as a Krylov model. These representations will shed more light on the understanding of PLS. Also, PLS algorithms are reviewed and presented as estimators of the RC models. (b) Unify various multiple-response regression models under the framework of the RC models, and review some multiple-response PLS methods. In addition, simulation studies are carried out to compare the prediction performance of multivariate PLS (PLS2) methods. (c) Propose novel sparse multivariate PLS (SPLS2) methods for parameter estimation and variable selection, which offers more flexibility compared to known SPLS2 methods, and compare the novel methods against methods in the literature in terms of prediction performance and accuracy in variable selection. (d) Apply the PLS regression methods to a proteomics data set to predict the severity of systemic sclerosis
and identify candidate markers. Furthermore, compare the PLS, SPLS and OLS methods with regard to predictive ability using the proteomics data
- …