162 research outputs found
Recent advances in directional statistics
Mainstream statistical methodology is generally applicable to data observed
in Euclidean space. There are, however, numerous contexts of considerable
scientific interest in which the natural supports for the data under
consideration are Riemannian manifolds like the unit circle, torus, sphere and
their extensions. Typically, such data can be represented using one or more
directions, and directional statistics is the branch of statistics that deals
with their analysis. In this paper we provide a review of the many recent
developments in the field since the publication of Mardia and Jupp (1999),
still the most comprehensive text on directional statistics. Many of those
developments have been stimulated by interesting applications in fields as
diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics,
image analysis, text mining, environmetrics, and machine learning. We begin by
considering developments for the exploratory analysis of directional data
before progressing to distributional models, general approaches to inference,
hypothesis testing, regression, nonparametric curve estimation, methods for
dimension reduction, classification and clustering, and the modelling of time
series, spatial and spatio-temporal data. An overview of currently available
software for analysing directional data is also provided, and potential future
developments discussed.Comment: 61 page
BAYESIAN SEMIPARAMETRIC GENERALIZATIONS OF LINEAR MODELS USING POLYA TREES
In a Bayesian framework, prior distributions on a space of nonparametric continuous distributions may be defined using Polya trees. This dissertation addresses statistical problems for which the Polya tree idea can be utilized to provide efficient and practical methodological solutions.
One problem considered is the estimation of risks, odds ratios, or other similar measures that are derived by specifying a threshold for an observed continuous variable. It has been previously shown that fitting a linear model to the continuous outcome under the assumption of a logistic error distribution leads to more efficient odds ratio estimates. We will show that deviations from the assumption of logistic error can result in great bias in odds ratio estimates. A one-step approximation to the Savage-Dickey ratio will be presented as a Bayesian test for distributional assumptions in the traditional logistic regression model. The approximation utilizes least-squares estimates in the place of a full Bayesian Markov Chain simulation, and the equivalence of inferences based on the two implementations will be shown. A framework for flexible, semiparametric estimation of risks in the case that the assumption of logistic error is rejected will be proposed.
A second application deals with regression scenarios in which residuals are correlated and their distribution evolves over an ordinal covariate such as time. In the context of prediction, such complex error distributions need to be modeled carefully and flexibly. The proposed model introduces dependent, but separate Polya tree priors for each time point, thus pooling information across time points to model gradual changes in distributional shapes. Theoretical properties of the proposed model will be outlined, and its potential predictive advantages in simulated scenarios and real data will be demonstrated
Testing Statistical Hypotheses for Latent Variable Models and Some Computational Issues
In this dissertation, I address unorthodox statistical problems concerning goodness-of-fit tests
in the latent variable context and efficient statistical computations.
In epidemiological and biomedical studies observations with measurement errors are quite
common, especially when it is difficult to calibrate true signals accurately. In this first problem,
I develop a statistical test for testing equality of two distributions when the observed contaminated
data follow the classical additive measurement error model. The fact is that the two-sample
homogeneity tests, such as Kolmogorov-Smirnov, Anderson-Darling, or von Mises test, are not
consistent when observations are subject to measurement error. To develop a consistent test, first
the characteristic functions of unobservable true random variables are estimated from the contaminated
data, and then the test statistic is defined as the integrated difference between the two
estimated characteristic functions. It is shown that when the sample size is large and the null hypothesis
holds, the test statistic converges to an integral of a squared Gaussian process. However,
enumeration of this distribution to obtain the rejection region is not simple. Therefore, I propose a
bootstrap approach to compute the p-value of the test statistic. The operating characteristics of the
proposed test is assessed and compared with the other approaches via extensive simulation studies.
The proposed method is then applied to analyze the National Health and Nutrition Examination
Survey (NHANES) dataset. Although researchers considered estimation of the regression parameters
in the presence of exposure measurement error, this testing problem is completely new and no
one has considered it before.
In the next problem, I consider the stochastic frontier model (SFM) which is a widely used
model for measuring firms’ efficiency. In productivity or cost studies in the field of econometrics,
there is a discrepancy between the theoretically optimal product and the actual output for a
certain amount of inputs and this gap is called technical inefficiency. To assess this inefficiency,
the stochastic frontier model is in use to include this gap as a latent variable in addition to the
usual statistical noise. Since it is unable to observe this gap, estimation and inference depend on the distributional assumption of the technical inefficiency term. Usually, an exponential or half-normal
distribution is widely assumed for the inefficiency term. In that sense, I develop a Bayesian
test for testing whether this parametric assumption is correct. I construct a broad semiparametric
family which approximate or contain the true distribution as an alternative and then define a Bayes
factor. I show the Bayes factor consistency under certain conditions and present the finite sample
performance via Monte-Carlo simulations.
The second part of my dissertation is about statistical computational problems. Frequentist
standard errors are of interest to evaluate uncertainty of an estimator and utilized for many statistical
inference problems. In this dissertation, I consider standard error calculation for Bayes
estimators. Except some hypothetical scenarios, estimating frequentist variability of any estimator
possibly involves bootstrapping to approximate the sampling distribution of the estimator. In addition,
for a Bayesian modeling combined with Markov chain Monte Carlo (MCMC) and bootstrap
the computation of the standard error of Bayes estimator is computationally expensive and impractical.
Specifically, repeated application of the MCMC on each of the bootstrapped data make
everything computationally inefficient. To overcome this difficulty, I propose a clever use of the
importance sampling technique to reduce the computational burden. I apply this proposed technique
to several examples including logistic regression, linear measurement error model, Weibull
regression model and vector autoregressive model.
In the second computational problem, I explore the binary regression with flexible skew-probit
link function which contains traditional probit link function as a special case. The skew-probit
model is useful for modelling success probability of binary response or count data where the success
probability is not a symmetric function of continuous regressors. In this topic, I investigate the
parameter identifiability of skew-probit model. I then demonstrate that the maximum likelihood
estimator (MLE) of the skewness parameter is highly biased. I develop a penalized likelihood
approach based on three penalty functions to reduce the finite sample bias of the MLE of the
skew-probit model. The performances of each penalized MLE are compared through extensive
simulations and I analyze the heart-disease data using the proposed approaches
Model-based clustering based on sparse finite Gaussian mixtures
In the framework of Bayesian model-based clustering based on a finite mixture of Gaussian distributions, we present a joint approach to estimate the number of mixture components and identify cluster-relevant variables simultaneously as well as to obtain an identified model. Our approach consists in specifying sparse hierarchical priors on the mixture weights and component means. In a deliberately overfitting mixture model the sparse prior on the weights empties superfluous components during MCMC. A straightforward estimator for the true number of components is given by the most frequent number of non-empty components visited during MCMC sampling. Specifying a shrinkage prior, namely the normal gamma prior, on the component means leads to improved parameter estimates as well as identification of cluster-relevant variables. After estimating the mixture model using MCMC methods based on data augmentation and Gibbs sampling, an identified model is obtained by relabeling the MCMC output in the point process representation of the draws. This is performed using K-centroids cluster analysis based on the Mahalanobis distance. We evaluate our proposed strategy in a simulation setup with artificial data and by applying it to benchmark data sets. (authors' abstract
Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain
The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio
Copula Based Hierarchical Bayesian Models
The main objective of our study is to employ copula methodology to develop Bayesian
hierarchical models to study the dependencies exhibited by temporal, spatial and
spatio-temporal processes. We develop hierarchical models for both discrete and
continuous outcomes. In doing so we expect to address the dearth of copula based
Bayesian hierarchical models to study hydro-meteorological events and other physical
processes yielding discrete responses.
First, we present Bayesian methods of analysis for longitudinal binary outcomes using
Generalized Linear Mixed models (GLMM). We allow flexible marginal association
among the repeated outcomes from different time-points. An unique property of this
copula-based GLMM is that if the marginal link function is integrated over the distribution
of the random effects, its form remains same as that of the conditional link
function. This unique property enables us to retain the physical interpretation of the
fixed effects under conditional and marginal model and yield proper posterior distribution.
We illustrate the performance of the posited model using real life AIDS data
and demonstrate its superiority over the traditional Gaussian random effects model.
We develop a semiparametric extension of our GLMM and re-analyze the data from
the AIDS study.
Next, we propose a general class of models to handle non-Gaussian spatial data. The proposed model can deal with geostatistical data that can accommodate skewness,
tail-heaviness, multimodality. We fix the distribution of the marginal processes and
induce dependence via copulas. We illustrate the superior predictive performance
of our approach in modeling precipitation data as compared to other kriging variants.
Thereafter, we employ mixture kernels as the copula function to accommodate
non-stationary data. We demonstrate the adequacy of this non-stationary model by
analyzing permeability data. In both cases we perform extensive simulation studies
to investigate the performances of the posited models under misspecification.
Finally, we take up the important problem of modeling multivariate extreme values
with copulas. We describe, in detail, how dependences can be induced in the
block maxima approach and peak over threshold approach by an extreme value copula.
We prove the ability of the posited model to handle both strong and weak extremal
dependence and derive the conditions for posterior propriety. We analyze the extreme
precipitation events in the continental United States for the past 98 years and come
up with a suite of predictive maps
- …