113 research outputs found

    Statistical Challenges in Combining Information from Big and Small Data Sources

    Full text link
    Social Media, electronic health records, credit card transactional and administrative data, web scraping, and numerous other ways of collecting information have changed the landscape for those interested in addressing policy-relevant research questions. During the same time, the traditional sources of data, such as large-scale surveys, that have been a stable source for policy-relevant research have su ered set- backs due to large nonresponse and increasing data collection costs. The non-survey data usually contain detailed information on certain behaviors on a large number of individuals (such as all credit card transactions) but very little background information on them (such as important covariates to address the policy-relevant question). On the other hand, the survey data contains detailed information on co- variates but not so detailed information on the behaviors. Both data sources may not be perfect for the target population of interest. This paper develops and evaluates a framework for linking information from multiple imperfect data sources along with the Census data to draw statistical inference. An explicit modeling framework involving se- lection into the big data, sampling and nonresponse mechanism in the survey data, distribution of the key variables of interest and cer- tain marginal distributions from the Census Data are used as building blocks to draw inference about the population quantity of interest.http://deepblue.lib.umich.edu/bitstream/2027.42/120417/1/NAS-Paper.pdfDescription of NAS-Paper.pdf : Main Articl

    Bayesian sensitivity analysis of incomplete data: bridging pattern‐mixture and selection models

    Full text link
    Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/109600/1/sim6302.pd

    An Approximate Test for Homogeneity of Correlated Correlation Coefficients

    Full text link
    This paper develops and evaluates an approximate procedure for testing homogeneity of an arbitrary subset of correlation coefficients among variables measured on the same set of individuals. The sample may have some missing data. The simple test statistic is a multiple of the variance of Fisher r-to-z transformed correlation coefficients relevant to the null hypothesis being tested and is referred to a chi-square distribution. The use of this test is illustrated through several examples. Given the approximate nature of the test statistics, the procedure was evaluated using a simulation study. The accuracy in terms of the nominal and the actual significance levels of this test for several null hypotheses of interest were evaluated.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/43560/1/11135_2004_Article_394854.pd

    Improving on analyses of self-reported data in a large-scale health survey by using information from an examination-based survey

    Full text link
    Common data sources for assessing the health of a population of interest include large-scale surveys based on interviews that often pose questions requiring a self-report, such as, ‘Has a doctor or other health professional ever told you that you have 〈 health condition of interestâŒȘ ?’ or ‘What is your 〈 height/weightâŒȘ ?’ Answers to such questions might not always reflect the true prevalences of health conditions (for example, if a respondent misreports height/weight or does not have access to a doctor or other health professional). Such ‘measurement error’ in health data could affect inferences about measures of health and health disparities. Drawing on two surveys conducted by the National Center for Health Statistics, this paper describes an imputation-based strategy for using clinical information from an examination-based health survey to improve on analyses of self-reported data in a larger interview-based health survey. Models predicting clinical values from self-reported values and covariates are fitted to data from the National Health and Nutrition Examination Survey (NHANES), which asks self-report questions during an interview component and also obtains clinical measurements during a physical examination component. The fitted models are used to multiply impute clinical values for the National Health Interview Survey (NHIS), a larger survey that obtains data solely via interviews. Illustrations involving hypertension, diabetes, and obesity suggest that estimates of health measures based on the multiply imputed clinical values are different from those based on the NHIS self-reported data alone and have smaller estimated standard errors than those based solely on the NHANES clinical data. The paper discusses the relationship of the methods used in the study to two-phase/two-stage/validation sampling and estimation, along with limitations, practical considerations, and areas for future research. Published in 2009 by John Wiley & Sons, Ltd.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/65032/1/3809_ftp.pd

    Bayesian Variable Selection with Joint Modeling of Categorical and Survival Outcomes: An Application to Individualizing Chemotherapy Treatment in Advanced Colorectal Cancer

    Full text link
    Colorectal cancer is the second leading cause of cancer related deaths in the United States, with more than 130,000 new cases of colorectal cancer diagnosed each year. Clinical studies have shown that genetic alterations lead to different responses to the same treatment, despite the morphologic similarities of tumors. A molecular test prior to treatment could help in determining an optimal treatment for a patient with regard to both toxicity and efficacy. This article introduces a statistical method appropriate for predicting and comparing multiple endpoints given different treatment options and molecular profiles of an individual. A latent variable-based multivariate regression model with structured variance covariance matrix is considered here. The latent variables account for the correlated nature of multiple endpoints and accommodate the fact that some clinical endpoints are categorical variables and others are censored variables. The mixture normal hierarchical structure admits a natural variable selection rule. Inference was conducted using the posterior distribution sampling Markov chain Monte Carlo method. We analyzed the finite-sample properties of the proposed method using simulation studies. The application to the advanced colorectal cancer study revealed associations between multiple endpoints and particular biomarkers, demonstrating the potential of individualizing treatment based on genetic profiles.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/66395/1/j.1541-0420.2008.01181.x.pd

    A Bayesian model for longitudinal count data with non-ignorable dropout

    Full text link
    Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/73907/1/j.1467-9876.2008.00628.x.pd
    • 

    corecore