9,193 research outputs found
Fixed Effect Estimation of Large T Panel Data Models
This article reviews recent advances in fixed effect estimation of panel data
models for long panels, where the number of time periods is relatively large.
We focus on semiparametric models with unobserved individual and time effects,
where the distribution of the outcome variable conditional on covariates and
unobserved effects is specified parametrically, while the distribution of the
unobserved effects is left unrestricted. Compared to existing reviews on long
panels (Arellano and Hahn 2007; a section in Arellano and Bonhomme 2011) we
discuss models with both individual and time effects, split-panel Jackknife
bias corrections, unbalanced panels, distribution and quantile effects, and
other extensions. Understanding and correcting the incidental parameter bias
caused by the estimation of many fixed effects is our main focus, and the
unifying theme is that the order of this bias is given by the simple formula
p/n for all models discussed, with p the number of estimated parameters and n
the total sample size.Comment: 40 pages, 1 tabl
Bayesian Approach of Joint Models of Longitudinal Outcomes and Informative Time
Longitudinal studies are commonly encountered in a variety of research areas in which the scientific interest is in the pattern of change in a response variable over time. In longitudinal data analyses, a number of methods have been proposed. Most of the traditional longitudinal methods assume that the independent variables are the same across all subjects. It is commonly assumed that time intervals for collecting outcomes are predetermined and have no information regarding the measured variables. However, in practice, researchers might occasionally have irregular time intervals and informative time, which violate the above assumptions. Hence, if traditional statistical methods are used for this situation, the results would be biased. The joint models of longitudinal outcomes and informative time are used as a solution to the above violations by using joint probability distributions, incorporating the relationships between outcomes and time. The joint models are designed to handle outcome distributions from a normal distribution with informative time following an exponential distribution. Several studies used the maximum likelihood parameter estimates of the joint model. This study, however, presented an alternative method for parameters estimation, based on a Bayesian approach, with respect to joint models of longitudinal outcomes and informative time. Using a Bayesian approach permitted the inclusion of knowledge of the observed data within the analysis through the prior distribution of unknown parameters. In this dissertation, the prior distribution adopted three scenarios: (1) the prior distributions of all unknown parameters are noninformative prior, which will set to be vague but proper prior: Normal(0, 1e6). (2) The prior distributions of all unknown parameters are informative prior, which will be set to be normal for unrestricted parameters, and inverse gamma (IG) priors for positive parameters such as the variance σ2. (3) A combination of two above scenarios, so the prior distributions of some unknown parameters are noninformative, and the others are informative. The procedure for estimating the model parameters was developed via a Markov chain Monte Carlo method using the Metropolis-Hastings algorithm. The key idea was to construct the likelihood function, specify the prior information, and then calculate the posterior distribution. Simulated observations were generated by the MCMC technique from the posterior distribution. Thus, the primary purpose of this study was to find Bayesian estimates for the unknown parameters in the joint model, with the assumptions of a normal distribution for the outcome process and an exponential distribution for informative time. The properties and merits of the proposed procedure were illustrated employing a simulation study through a written R program and OpenBUGS
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
- …