2 research outputs found

    Dimension reduction and efficient recommender system for large-scale complex data

    Get PDF
    Large-scale complex data have drawn great attention in recent years, which play an important role in information technology and biomedical research. In this thesis, we address three challenging issues: sufficient dimension reduction for longitudinal data, nonignorable missing data with refreshment samples, and large-scale recommender systems. In the first part of this thesis, we incorporate correlation structure in sufficient dimension reduction for longitudinal data. Existing sufficient dimension reduction approaches assuming independence may lead to substantial loss of efficiency. We apply the quadratic inference function to incorporate the correlation information and apply the transformation method to recover the central subspace. The proposed estimators are shown to be consistent and more efficient than the ones assuming independence. In addition, the estimated central subspace is also efficient when the correlation information is taken into account. We compare the proposed method with other dimension reduction approaches through simulation studies, and apply this new approach to an environmental health study. In the second part of this thesis, we address nonignorable missing data which occur frequently in longitudinal studies and can cause biased estimations. Refreshment samples which recruit new subjects in subsequent waves from the original population could mitigate the bias. In this thesis, we introduce a mixed-effects estimating equation approach which enables one to incorporate refreshment samples and recover missing information. We show that the proposed method achieves consistency and asymptotic normality for fixed-effect estimation under shared-parameter models, and we extend it to a more general nonignorable-missing framework. Our finite sample simulation studies show the effectiveness and robustness of the proposed method under different missing mechanisms. In addition, we apply our method to election poll longitudinal survey data with refreshment samples from the 2007-2008 Associated Press–Yahoo! News. In the third part of this thesis, we develop a novel recommender system which track users' preferences and recommend items of interest effectively. In this thesis, we propose a group-specific method to utilize dependency information from users and items which share similar characteristics under the singular value decomposition framework. The new approach is effective for the "cold-start" problem, where new users and new items' information is not available from the existing data collection. One advantage of the proposed model is that we are able to incorporate information from the missing mechanism and group-specific features through clustering based on variables associated with missing patterns. In addition, we propose a new algorithm that embeds a back-fitting algorithm into alternating least squares, which avoids large matrices operation and big memory storage, and therefore makes it feasible to achieve scalable computing. Our simulation studies and MovieLens data analysis both indicate that the proposed group-specific method improves prediction accuracy significantly compared to existing competitive recommender system approaches

    Inference in structural equation models with missing data

    Get PDF
    Missing data can lead to bias and inefficiency in estimating the quantities of interest in scientific studies. This can be especially problematic in longitudinal studies which measure the same subjects at several different points in time. It is not uncommon for individuals to be unavailable at one or more point in time, and even when available, an individual may fail to respond to one or more items. To give an indication of the nature of the problem, we consider here an example that only 295 out of 451 cases would be considered complete. A common approach for dealing with missing values in current practice is to restrict attention to those individuals or cases for which the data are completely observed. At best, this procedure is inefficient since some observed information (belonging to incomplete cases) is being ignored, but in some situations it can also be badly biased;Rubin (1976) defines three mechanisms by which values may become missing: values are missing completely at random (MCAR) if the fact that they are missing is completely unrelated to the problem at hand (e.g., a typographical error); values are missing at random (MAR) if the fact that they are missing does not affect our ability to draw conclusions (e.g., people may not answer a question about their date of birth while supplying correlated information like graduation dates); finally, values are nonignorably missing (NI) if the very fact that the values are missing contains important information about the values that belong there (e.g., people with high incomes tend not to answer questions about income). The naive approach of relying only on completed observations will be unbiased only in the nicest of these cases (missing data are MCAR) and may be substantially biased in the other cases. The last case, nonignorably missing: data, can not be addressed easily using statistical methods since according to the definition we are missing extremely important information in these cases. This thesis explores approaches that are valid when the missing data mechanism is MAR and then it is possible to use the observed values to learn about what the missing values might be;We consider three approaches to drawing: inferences in structural equation models with missing data: likelihood-based approach, Bayesian inference, and inference based on multiple imputation (or fill-ins) of the missing values. Many of the sociology and psychology sample survey problems that rely on SEM have two different kinds of variables: item responses and composites which are obtained by linear combinations of items. If one of the item responses that define a composite variable is missing then the composite variable might be declared missing. We also consider the ways of having advantage of using item responses in missing data problems. These approaches are described followed by an example and simulation study where each approach is used to analyze data about the psychological development of adolescents using structural equation models. The data are from the Iowa Youth and Family Project (IYFP) being carried out at the Center for Rural Health at Iowa State University
    corecore