55 research outputs found

    Efficient Semiparametric Marginal Estimation for Longitudinal/Clustered Data

    Get PDF
    We consider marginal generalized semiparametric partially linear models for clustered data. Lin and Carroll (2001a) derived the semiparametric efficinet score funtion for this problem in the mulitvariate Gaussian case, but they were unable to contruct a semiparametric efficient estimator that actually achieved the semiparametric information bound. We propose such an estimator here and generalize the work to marginal generalized partially liner models. Asymptotic relative efficincies of the estimation or throughout are investigated. The finite sample performance of these estimators is evaluated through simulations and illustrated using a longtiudinal CD4 count data set. Both theoretical and numerical results indicate that properly taking into account the within-subject correlation among the responses can substantially improve efficiency

    Nonparametric estimation of correlation functions in longitudinal and spatial data, with application to colon carcinogenesis experiments

    Get PDF
    In longitudinal and spatial studies, observations often demonstrate strong correlations that are stationary in time or distance lags, and the times or locations of these data being sampled may not be homogeneous. We propose a nonparametric estimator of the correlation function in such data, using kernel methods. We develop a pointwise asymptotic normal distribution for the proposed estimator, when the number of subjects is fixed and the number of vectors or functions within each subject goes to infinity. Based on the asymptotic theory, we propose a weighted block bootstrapping method for making inferences about the correlation function, where the weights account for the inhomogeneity of the distribution of the times or locations. The method is applied to a data set from a colon carcinogenesis study, in which colonic crypts were sampled from a piece of colon segment from each of the 12 rats in the experiment and the expression level of p27, an important cell cycle protein, was then measured for each cell within the sampled crypts. A simulation study is also provided to illustrate the numerical performance of the proposed method.Comment: Published in at http://dx.doi.org/10.1214/009053607000000082 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Transformations of Additivity in Measurement Error Models

    Get PDF
    In many problems one wants to model the relationship between a response Y and a covariate X. Sometimes it is difficult, expensive, or even impossible to observe X directly, but one can instead observe a substitute variable W which is easier to obtain. By far the most common model for the relationship between the actual covariate of interest X and the substitute W is W = X + U, where the variable U represents measurement error. This assumption of additive measurement error may be unreasonable for certain data sets. We propose a new model, namely h(W) = h(X) + U, where h(.) is a monotone transformation function selected from some family H of monotone functions. The idea of the new model is that, in the correct scale, measurement error is additive. We propose two possible transformation families H. One is based of selecting a transformation which makes the within sample mean and standard deviation of replicated W’s uncorrelated. The second is based on selecting the transformation so that the errors (U’s) fit a prespecified distribution. Transformation families used are the parametric power transformations and a cubic spline family. Several data examples are presented to illustrate the methods

    Bayesian estimation of associations between identified longitudinal hormone subgroups and age at final menstrual period

    Full text link
    Abstract Background Although follicle stimulating hormone (FSH) is known to be predictive of age at final menstrual period (FMP), previous methods use FSH levels measured at time points that are defined relative to the age at FMP, and hence are not useful for prospective prediction purposes in clinical settings where age at FMP is an unknown outcome. This study is aimed at assessing whether FSH trajectory feature subgroups identified relative to chronological age can be used to improve the prediction of age at FMP. Methods We develop a Bayesian model to identify latent subgroups in longitudinal FSH trajectories, and study the relationship between subgroup membership and age at FMP. Data for our study is taken from the Penn Ovarian Aging study, 1996–2010. The proposed model utilizes mixture modeling and nonparametric smoothing methods to capture hypothesized latent subgroup features of the FSH longitudinal trajectory; and simultaneously studies the prognostic value of these latent subgroup features to predict age at FMP. Results The analysis identified two FSH trajectory subgroups that were significantly associated with FMP age: 1) early FSH class (15 %), which displayed initial increases in FSH shortly after age 40; and 2) late FSH class (85 %), which did not have a rise in FSH until after age 45. The use of FSH subgroup memberships, along with class-specific characteristics, i.e., level and rate of FSH change at class-specific pre-specified ages, improved prediction of FMP age by 20–22 % in comparison to the prediction based on previously identified risk factors (BMI, smoking and pre-menopausal levels of anti-mullerian hormone (AMH)). Conclusions To the best of our knowledge, this work is the first in the area to demonstrate the existence of subgroups in FSH trajectory patterns relative to chronological age and the fact that such a subgroup membership possesses prediction power for age at FMP. Earlier ages at FMP were found in a subgroup of women with rise in FSH levels commencing shortly after age 40, in comparison to women who did not exhibit an increase in FSH until after 45 years of age. Periodic evaluations of FSH in these age ranges are potentially useful for predicting age at FMP.http://deepblue.lib.umich.edu/bitstream/2027.42/116209/1/12874_2015_Article_101.pd

    Equivalent Kernels of Smoothing Splines in Nonparametric Regression for Clustered/Longitudinal Data

    Get PDF
    We compare spline and kernel methods for clustered/longitudinal data. For independent data, it is well known that kernel methods and spline methods are essentially asymptotically equivalent (Silverman, 1984). However, the recent work of Welsh, et al. (2002) shows that the same is not true for clustered/longitudinal data. First, conventional kernel methods fail to account for the within- cluster correlation, while spline methods are able to account for this correlation. Second, kernel methods and spline methods were found to have different local behavior, with conventional kernels being local and splines being non-local. To resolve these differences, we show that a smoothing spline estimator is asymptotically equivalent to a recently proposed seemingly unrelated kernel estimator of Wang (2003) for any working covariance matrix. To gain insight into this asymptotic equivalence, we show that both the seemingly unrelated kernel estimator and the smoothing spline estimator using any working covariance matrix can be obtained iteratively by applying conventional kernel or spline smoothing to pseudo-observations. This result allows us to study the asymptotic properties of the smoothing spline estimator by deriving its asymptotic bias and variance. We show that smoothing splines are asymptotically consistent for an arbitrary working covariance and have the smallest variance when assuming the true covariance. We further show that both the seemingly unrelated kernel estimator and the smoothing spline estimator are nonlocal (unless working independence is assumed) but have asymptotically negligible bias. Their finite sample performance is compared through simulations. Our results justify the use of efficient, non-local estimators such as smoothing splines for clustered/longitudinal data

    Structured Mixture of Linear Mappings in High Dimension

    Get PDF
    When analyzing data with complex structures such as high dimensionality and non-linearity, one often needs sophisticated models to capture the intrinsic complexity. However, practical implementation using these models could be difficult. Striking a balance between parsimony and model flexibility is essential to tackle data complexity while maintaining feasibility and satisfactory prediction performances. In this work, we proposed the use of Structured Mixture of Gaussian Locally Linear Mapping (SMoGLLiM) when there is a need to use high-dimensional predictors to predict low-dimensional responses and there is a possibility that the underlying associations could be heterogeneous or non-linear. Besides using mixtures of linear associations to approximate non-linear patterns locally and using inverse regression to mitigate the complications due to high-dimensional predictors, SMoGLLiM also aims at achieving robustness by adopting cluster-size constraints and trimming abnormal samples. Its hierarchical structure enables covariance matrices and latent factors being shared across smaller clusters, which effectively reduce the number of parameters. An Expectation-Maximization (EM) algorithm is devised for parameter estimation and, with analytical solutions; the estimation process is computa-tionally efficient. Numerical results obtained from three real-world datasets demonstrate the flexibility and ability of SMoGLLiM in accommodating complex data structure. They include using high-dimensional face images to predict the parameters under which the images were taken, predicting the sucrose levels by the high-dimensional hyperspectral measurements obtained from different types of orange juice and a magnetic resonance vascular fingerprinting (MRvF) study in which researchers are interested at using the so-called MRv fingerprints at voxel level to predict the microvascular properties in brain. The three datasets bear different features and presents different types of challenges. For example , the size of the MRv fingerprint dataset demands special consideration to reduce computational burden. With the hierarchical structure of SMoGLLiM, we are able to adopt parallel computing techniques to reduce the model building time by 97%. These examples illustrate the wide range of applicability of SMoGLLiM on handling different kinds of complex data structure

    Evaluation of fecal mRNA reproducibility via a marginal transformed mixture modeling approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Developing and evaluating new technology that enables researchers to recover gene-expression levels of colonic cells from fecal samples could be key to a non-invasive screening tool for early detection of colon cancer. The current study, to the best of our knowledge, is the first to investigate and report the reproducibility of fecal microarray data. Using the intraclass correlation coefficient (ICC) as a measure of reproducibility and the preliminary analysis of fecal and mucosal data, we assessed the reliability of mixture density estimation and the reproducibility of fecal microarray data. Using Monte Carlo-based methods, we explored whether ICC values should be modeled as a beta-mixture or transformed first and fitted with a normal-mixture. We used outcomes from bootstrapped goodness-of-fit tests to determine which approach is less sensitive toward potential violation of distributional assumptions.</p> <p>Results</p> <p>The graphical examination of both the distributions of ICC and probit-transformed ICC (PT-ICC) clearly shows that there are two components in the distributions. For ICC measurements, which are between 0 and 1, the practice in literature has been to assume that the data points are from a beta-mixture distribution. Nevertheless, in our study we show that the use of a normal-mixture modeling approach on PT-ICC could provide superior performance.</p> <p>Conclusions</p> <p>When modeling ICC values of gene expression levels, using mixture of normals in the probit-transformed (PT) scale is less sensitive toward model mis-specification than using mixture of betas. We show that a biased conclusion could be made if we follow the traditional approach and model the two sets of ICC values using the mixture of betas directly. The problematic estimation arises from the sensitivity of beta-mixtures toward model mis-specification, particularly when there are observations in the neighborhood of the the boundary points, 0 or 1. Since beta-mixture modeling is commonly used in approximating the distribution of measurements between 0 and 1, our findings have important implications beyond the findings of the current study. By using the normal-mixture approach on PT-ICC, we observed the quality of reproducible genes in fecal array data to be comparable to those in mucosal arrays.</p
    corecore