11,041 research outputs found
A two-step regression method with connections to partial least squares and the growth curve model
Prediction of a continuous response variable from background data is considered. The independent prediction variable data may have a collinear structure and comprise group effects. A new two-step regression method inspired by PLS (partial least squares regression) is proposed. The proposed new method is coupled to a novel application of the Cayley-Hamilton theorem and a two-step estimation procedure. In the two-step approach, the first step summarizes the information in the predictors via a bilinear model. The bilinear model has a Krylov structured within-individuals design matrix, which is closely linked to PLS, and a between-individuals design matrix, which allows the model to handle complex structures, e.g. group effects. The second step is the prediction step, where conditional expectation is used. The close relation between the two-step method and PLS gives new insight into PLS; i.e. PLS can be considered as an algorithm for generating a Krylov structured sequence to approximate the inverse of the covariance matrix of the predictors. Compared with classical PLS, the new two-step method is a non-algorithmic approach. The bilinear model used in the first step gives a greater modelling flexibility than classical PLS. The proposed new two-step method has been extended to handle grouped data, especially data with different mean levels and with nested mean structures. Correspondingly, the new two-step method uses bilinear models with a structure similar to that of the classical growth curve model and the extended growth curve model, but with design matrices which are unknown. Given that the covariance between the predictors and the response is known, the explicit maximum likelihood estimators (MLEs) for the dispersion and mean of the predictors have all been derived. Real silage spectra data have been used to justify and illustrate the two-step method
Functional Regression
Functional data analysis (FDA) involves the analysis of data whose ideal
units of observation are functions defined on some continuous domain, and the
observed data consist of a sample of functions taken from some population,
sampled on a discrete grid. Ramsay and Silverman's 1997 textbook sparked the
development of this field, which has accelerated in the past 10 years to become
one of the fastest growing areas of statistics, fueled by the growing number of
applications yielding this type of data. One unique characteristic of FDA is
the need to combine information both across and within functions, which Ramsay
and Silverman called replication and regularization, respectively. This article
will focus on functional regression, the area of FDA that has received the most
attention in applications and methodological development. First will be an
introduction to basis functions, key building blocks for regularization in
functional regression methods, followed by an overview of functional regression
methods, split into three types: [1] functional predictor regression
(scalar-on-function), [2] functional response regression (function-on-scalar)
and [3] function-on-function regression. For each, the role of replication and
regularization will be discussed and the methodological development described
in a roughly chronological manner, at times deviating from the historical
timeline to group together similar methods. The primary focus is on modeling
and methodology, highlighting the modeling structures that have been developed
and the various regularization approaches employed. At the end is a brief
discussion describing potential areas of future development in this field
Most Likely Transformations
We propose and study properties of maximum likelihood estimators in the class
of conditional transformation models. Based on a suitable explicit
parameterisation of the unconditional or conditional transformation function,
we establish a cascade of increasingly complex transformation models that can
be estimated, compared and analysed in the maximum likelihood framework. Models
for the unconditional or conditional distribution function of any univariate
response variable can be set-up and estimated in the same theoretical and
computational framework simply by choosing an appropriate transformation
function and parameterisation thereof. The ability to evaluate the distribution
function directly allows us to estimate models based on the exact likelihood,
especially in the presence of random censoring or truncation. For discrete and
continuous responses, we establish the asymptotic normality of the proposed
estimators. A reference software implementation of maximum likelihood-based
estimation for conditional transformation models allowing the same flexibility
as the theory developed here was employed to illustrate the wide range of
possible applications.Comment: Accepted for publication by the Scandinavian Journal of Statistics
2017-06-1
Sparse reduced-rank regression for imaging genetics studies: models and applications
We present a novel statistical technique; the sparse reduced rank regression (sRRR) model
which is a strategy for multivariate modelling of high-dimensional imaging responses and
genetic predictors. By adopting penalisation techniques, the model is able to enforce sparsity
in the regression coefficients, identifying subsets of genetic markers that best explain
the variability observed in subsets of the phenotypes. To properly exploit the rich structure
present in each of the imaging and genetics domains, we additionally propose the use of
several structured penalties within the sRRR model. Using simulation procedures that accurately
reflect realistic imaging genetics data, we present detailed evaluations of the sRRR
method in comparison with the more traditional univariate linear modelling approach. In
all settings considered, we show that sRRR possesses better power to detect the deleterious
genetic variants. Moreover, using a simple genetic model, we demonstrate the potential
benefits, in terms of statistical power, of carrying out voxel-wise searches as opposed to
extracting averages over regions of interest in the brain. Since this entails the use of phenotypic
vectors of enormous dimensionality, we suggest the use of a sparse classification
model as a de-noising step, prior to the imaging genetics study. Finally, we present the
application of a data re-sampling technique within the sRRR model for model selection.
Using this approach we are able to rank the genetic markers in order of importance of association
to the phenotypes, and similarly rank the phenotypes in order of importance to
the genetic markers. In the very end, we illustrate the application perspective of the proposed
statistical models in three real imaging genetics datasets and highlight some potential
associations
GAS DISTRICT COOLING LOAD DEMAND MODELLING
Gas District Cooling (GDC) is a co-generation plant that owned by Universiti
Teknologi PETRONAS (UTP). The plant supplies electricity and chilled water to the
UTP campus. At present, there is no mathematical model available for GDC
application. As a sole customer of the plant, the UTP 2011 load demand data is used
to develop the load demand modelling using exponential smoothing methods. The
methods produce a few mathematical models that replicate UTP 2011 load demand
pattern. The result obtain in the analysis would address the variation of electricity
demand in the university which is beneficial for the utility company and for
forecasting purpose. Winter’s method is selected to characterize the mathematical
load demand modelling for UTP since it produced the lowest MAPE as compared to
Simple, Holt’s Fit and Holt-Winters of exponential smoothing methods
Twenty years of P-splines
P-splines first appeared in the limelight twenty years ago. Since then they have become popular in applications and in theoretical work. The combination of a rich B-spline basis and a simple difference penalty lends itself well to a variety of generalizations, because it is based on regression. In effect, P-splines allow the building of a “backbone” for the “mixing and matching” of a variety of additive smooth structure components, while inviting all sorts of extensions: varying-coefficient effects, signal (functional) regressors, two-dimensional surfaces, non-normal responses, quantile (expectile) modelling, among others. Strong connections with mixed models and Bayesian analysis have been established. We give an overview of many of the central developments during the first two decades of P-splines.Peer Reviewe
Twenty years of P-splines
P-splines first appeared in the limelight twenty years ago. Since then they have become popular in applications and in theoretical work. The combination of a rich B-spline basis and a simple difference penalty lends itself well to a variety of generalizations, because it is based on regression. In effect, P-splines allow the building of a “backbone” for the “mixing and matching” of a variety of additive smooth structure components, while inviting all sorts of extensions: varying-coefficient effects, signal (functional) regressors, two-dimensional surfaces, non-normal responses, quantile (expectile) modelling, among others. Strong connections with mixed models and Bayesian analysis have been established. We give an overview of many of the central developments during the first two decades of P-splines
Univariate and multivariate statistical approaches for the analyses of omics data: sample classification and two-block integration.
The wealth of information generated by high-throughput omics technologies in the
context of large-scale epidemiological studies has made a significant contribution to the identification of factors influencing the onset and progression of common diseases. Advanced computational and statistical modelling techniques are required to manipulate and extract meaningful biological information from these omics data as several layers of complexity are associated with them. Recent research efforts have concentrated in the development of novel statistical and bioinformatic tools; however, studies thoroughly investigating the applicability and suitability of these novel methods in real data have often fallen behind. This thesis focuses in the analyses of proteomics and transcriptomics data from the EnviroGenoMarker project with the purpose of addressing two main research objectives:
i) to critically appraise established and recently developed statistical approaches
in their ability to appropriately accommodate the inherently complex nature
of real-world omics data and ii) to improve the current understanding of a prevalent
condition by identifying biological markers predictive of disease as well as possible
biological mechanisms leading to its onset. The specific disease endpoint of interest
corresponds to B-cell Lymphoma, a common haematological malignancy for which
many challenges related to its aetiology remain unanswered.
The seven chapters comprising this thesis are structured in the following manner:
the first two correspond to introductory chapters where I describe the main omics
technologies and statistical methods employed for their analyses. The third chapter provides a description of the epidemiological project giving rise to the study population and the disease outcome of interest. These are followed by three results chapters that address the research aims described above by applying univariate and multivariate statistical approaches for sample classification and data integration purposes. A summary of findings, concluding general remarks and discussion of open problems offering potential avenues for future research are presented in the final chapter.Open Acces
- …