167 research outputs found
Cluster-Robust Bootstrap Inference in Quantile Regression Models
In this paper I develop a wild bootstrap procedure for cluster-robust
inference in linear quantile regression models. I show that the bootstrap leads
to asymptotically valid inference on the entire quantile regression process in
a setting with a large number of small, heterogeneous clusters and provides
consistent estimates of the asymptotic covariance function of that process. The
proposed bootstrap procedure is easy to implement and performs well even when
the number of clusters is much smaller than the sample size. An application to
Project STAR data is provided.Comment: 46 pages, 4 figure
Techniques for handling clustered binary data
Bibliography : leaves 143-153.Over the past few decades there has been increasing interest in clustered studies and hence much research has gone into the analysis of data arising from these studies. It is erroneous to treat clustered data, where observations within a cluster are correlated with each other, as one would treat independent data. It has been found that point estimates are not as greatly affected by clustering as are the standard deviations of the estimates. But as a consequence, confidence intervals and hypothesis testing are severely affected. Therefore one has to approach the analysis of clustered data with caution. Methods that specifically deal with correlated data have been developed. Analysis may be further complicated when the outcome variable of interest is binary rather than continuous. Methods for estimation of proportions, their variances, calculation of confidence intervals and a variety of techniques for testing the homogeneity of proportions have been developed over the years (Donner and Klar, 1993; Donner, 1989, and Rao and Scott, 1992). The methods developed within the context of experimental design generally involve incorporating the effect of clustering in the analysis. This cluster effect is quantified by the intracluster correlation and needs to be taken into account when estimating proportions, comparing proportions and in sample size calculations. In the context of observational studies, the effect of clustering is expressed by the design effect which is the inflation in the variance of an estimate that is due to selecting a cluster sample rather than an independent sample. Another important aspect of the analysis of complex sample data that is often neglected is sampling weights. One needs to recognise that each individual may not have the same probability of being selected. These weights adjust for this fact (Little et al, 1997). Methods for modelling correlated binary data have also been discussed quite extensively. Among the many models which have been proposed for analyzing binary clustered data are two approaches which have been studied and compared: the population-averaged and cluster-specific approach. The population-averaged model focuses on estimating the effect of a set of covariates on the marginal expectation of the response. One example of the population-averaged approach for parameter estimation is known as generalized estimating equations, proposed by Liang and Zeger (1986). It involves assuming that elements within a cluster are independent and then imposing a correlation structure on the set of responses. This is a useful application in longitudinal studies where a subject is regarded as a cluster. Then the parameters describe how the population-averaged response rather than a specific subject's response depends on the covariates of interest. On the other hand, cluster specific models introduce cluster to cluster variability in the model by including random effects terms, which are specific to the cluster, as linear predictors in the regression model (Neuhaus et al, 1991). Unlike the special case of correlated Gaussian responses, the parameters for the cluster specific model obtained for binary data describe different effects on the responses compared to that obtained from the population-averaged model. For longitudinal data, the parameters of a cluster-specific model describe how a specific individuals probability of a response depends on the covariates. The decision to use either of these modelling methods depends on the questions of interest. Cluster-specific models are useful for studying the effects of cluster-varying covariates and when an individual's response rather than an average population's response is the focus. The population-averaged model is useful when interest lies in how the average response across clusters changes with covariates. A criticism of this approach is that there may be no individual with the characteristics of the population-averaged model
Recommended from our members
Beyond Standard Assumptions - Semiparametric Models, A Dyadic Item Response Theory Model, and Cluster-Endogenous Random Intercept Models
In most statistical analyses, quantitative education researchers often make simplifying assumptions regarding the manner in which their data was generated in order to answer some of these questions. These assumptions can help to reduce the complexity of the problem, and allow the researcher to describe their data using a simpler, and often times more interpretable, statistical model. However, making some of these assumptions when they are not true can lead to biased estimates and misleading answers. While the standard sets of assumptions associated with commonly-used statistical models are usually sufficient in a wide range of contexts, it will always be beneficial for education researchers to understand what they are, when they are reasonable, and how to modify them if necessary. This dissertation focuses on three of the most common models used in quantitative education research (viz. parametric models like Linear Models (LMs), Item Response Theory (IRT) models, and Random-Intercept Models (RIMs)), discusses the standard sets of assumptions that accompany these models, and then describes related models with less stringent sets of assumptions. In each of the following three chapters, we either explicitly unpack existing models that are useful but are currently still uncommon in the field of education research, or propose novel models and/or estimation strategies for these models. We begin in Chapter 1 with a common parametric model known as the Gaussian LM, and use it as a scaffold to better understand semiparametric models and their estimation. We begin by reviewing how the coefficients of the Gaussian LM are usually estimated using Maximum Likelihood (ML) or Least-Squares (LS). We then introduce the notion of an -estimator as well as that of a Regular Asymptotically Linear estimator, and show how they relate to the ML estimator. In particular, we introduce the notion of influence functions/curves and discuss their geometry together with concepts such as Hilbert spaces and tangent spaces. We then demonstrate, concretely, how to derive the so-called efficient influence function under the Gaussian LM, and show that it is precisely the influence function of the ML and (Ordinary) LS estimators. This shows that the ML estimator (at least under the Gaussian LM) is efficient. Using the foundation built, we move on from the Gaussian LM by relaxing both the assumption that the residuals are normally distributed, as well as the assumption that they have a constant variance, and define this as the Heteroskedastic Linear Model. Unlike the Gaussian LM, this is a semiparametric model. Where possible, we make use of intuition and analogous results from the parametric setting to help describe the workflow for obtaining an efficient estimator for the coefficients of the Heteroskedastic Linear Model. In particular, we derive the nuisance tangent space for this semiparametric model, and use it to obtain the efficient influence function for our model. We then show how to use the efficient influence function to obtain an efficient estimator (which happens to be the Weighted LS estimator) from the (Ordinary) LS estimator via a one-step approach as well as an estimating equations approach. We then conclude by directing readers to more advanced material, including references on more modern approaches to estimating more general semiparametric models such as Targeted Maximum Likelihood Estimation. In Chapter 2, we focus on a class of measurement models known as Item Response Theory models which are useful for measuring latent traits of a subject based on the subject's response to items. We relax the condition that the responses are only a result of the individual's latent trait (and possibly an external rater), and propose a dyadic Item Response Theory (dIRT) model for measuring interactions of pairs of individuals when the responses to items represent the actions (or behaviors, perceptions, etc.) of each individual (actor) made within the context of a dyad formed with another individual (partner). Examples of its use in education include the assessment of collaborative problem solving among students, or the evaluation of intra-departmental dynamics among teachers. The dIRT model generalizes both Item Response Theory models for measurement and the Social Relations Model for dyadic data. Here, the responses of an actor when paired with a partner are modeled as a function of not only the actor's inclination to act and the partner's tendency to elicit that action, but also the unique relationship of the pair, represented by two directional, possibly correlated, interaction latent variables. We discuss generalizations such as accommodating triads or larger groups, but focus on demonstrating the key idea in the dyadic case. We show that estimation may be performed using Markov-chain Monte Carlo implemented in \texttt{Stan}, making it straightforward to extend the dIRT model in various ways. Specifically, we show how the basic dIRT model can be extended to accommodate latent regressions, random effects, distal outcomes. We perform a simulation study that demonstrates that our estimation approach performs well. In the absence of educational data of this form, we demonstrate the usefulness of our proposed approach using speed-dating data instead, and find new evidence of pairwise interactions between participants, describing a mutual attraction that is inadequately characterized by individual properties alone.Finally, in Chapter 3, we consider the often implicit assumption made when estimating the coefficients of structural Random Intercept Models (RIMs) that covariates at all levels do not co-vary with the random intercepts. A violation of this assumption (called cluster-level endogeneity) leads to inconsistent estimates when using standard estimation procedures. For two-level RIMs with such endogeneity, Hausman and Taylor (HT) devised a consistent multi-step instrumental variable estimator using only internal instruments. We, instead, approach this problem by explicitly modeling the endogeneity using a Structural Equation Model (SEM). In this chapter, we compare, through simulation, the HT and SEM estimators, and evaluate their asymptotic and finite sample properties. We show that the SEM approach is also flexible enough to deal with different exchangeability assumptions for the covariates (e.g., whether the correlations between pairs of all units in a cluster are the same) and investigate how these exchangeability assumptions affect finite sample properties of the HT estimator. For the simulations, we propose a new procedure for generating cluster- and unit-level covariates and random intercepts with a fully flexible covariance structure. We also compare our approach to another common approach known as Multilevel Matching using data from the High School and Beyond survey
EEG connectivity in infants at risk for autism spectrum disorder
Autism Spectrum Disorder (ASD) is characterized by social and communication difficulties, and restricted and repetitive behaviours, and is typically diagnosed
during toddlerhood. Electroencephalographic (EEG) connectivity during infancy may predict later diagnostic outcome, and dimensional traits, although results vary
with differences in methods. The aim of this thesis is to examine how infant EEG connectivity relates to familial risk, and later categorical and dimensional outcomes of ASD. A previous study found alpha band hyperconnectivity in 14-month-old infants who developed ASD compared to infants who did not develop ASD at 36 months. Chapter 3 shows that methods used in this previous study indeed provide reliable results. Chapter 4 describes the replication study using identical methods to the previous study. Although the difference between groups was not replicated, the association between alpha connectivity and restricted and repetitive behaviours during toddlerhood was replicated. Chapter 5 tested the
hypothesis that social and communication difficulties relate to theta connectivity in response to social and non-social stimuli. Theta connectivity was increased
during social compared to non-social stimuli. Network topologies differed between groups with high and low familial risk, but not between categorical outcome groups. Theta connectivity was not associated with dimensional traits at toddlerhood. Chapter 6 showed that graph organisation was not related to familial risk, or diagnostic or dimensional outcomes at toddlerhood. Finally, Chapter 7 combined measures from previous chapters and examined how these relate to dimensional outcomes at childhood. Graph organisation at infancy showed a stronger association with dimensional outcomes at childhood than other connectivity measures. Overall, the results in this thesis illustrate the variability in
developmental trajectories in ASD, while emphasizing the complexity of the disorder and use of a dimensional approach to ASD. Chapter 8 further discusses contributions and implications for research of EEG connectivity as early predictive marker for ASD
High-dimensional asymptotic expansion of LR statistic for testing intraclass correlation structure and its error bound
This paper deals with the null distribution of a likelihood ratio (LR) statistic for testing the intraclass correlation structure. We derive an asymptotic expansion of the null distribution of the LR statistic when the number of variable p and the sample size N approach infinity together, while the ratio p/N is converging on a finite nonzero limit c[set membership, variant](0,1). Numerical simulations reveal that our approximation is more accurate than the classical [chi]2-type and F-type approximations as p increases in value. Furthermore, we derive a computable error bound for its asymptotic expansion.Asymptotic expansion Error bound High-dimensional approximation Intraclass correlation structure Likelihood ratio statistic
- …