194 research outputs found

    Statistical approaches for handling longitudinal and cross sectional discrete data with missing values focusing on multiple imputation and probability weighting.

    Get PDF
    Doctor of Philosophy in Science. University of KwaZulu-Natal, Pietermaritzburg, 2018.Abstract available in PDF file

    MULTIPLE IMPUTATION TO CORRECT FOR MEASUREMENT ERROR: Application to Chronic Disease Case Ascertainment in Administrative Health Databases

    Get PDF
    Diagnosis codes in administrative health databases (AHDs) are commonly used to ascertain chronic disease cases for research and surveillance. Low sensitivity of diagnosis codes has been demonstrated in many studies that validate AHDs against a gold standard data source in which the true disease status is known. This will result in misclassification of disease status, which can lead to biased prevalence estimates and loss of power to detect associations between diseases status and health outcomes. Model-based case detection algorithms in combination with multiple imputation (MI) methods in validation dataset/main dataset designs could be used to correct for misclassification of chronic disease status in AHDs. Under this approach, a predictive model of disease status (e.g., logistic model) is constructed in the validation dataset, the model parameters are estimated and MI methods are used to impute true disease status in the main dataset. This research considered scenarios that the misclassification of the observed disease status is independent of disease predictors and dependent on disease predictors. When the misclassification of the observed disease status is independent of disease predictors, the MI methods based on Frequentist logistic model (with and without bias correction) and Bayesian logistic model were compared. And when the misclassification of the observed disease status is dependent on disease predictors, the MI based on Frequentist logistic model with different variables as covariates were compared. Monte Carlo techniques were used to investigate the effects of the following data and model characteristics on bias and error in chronic disease prevalence estimates from AHDs: sensitivity of observed disease status based on diagnosis codes, size of the validation dataset, number of imputations, and the magnitude of measurement error in covariates of the predictive model. Relative bias, root mean squared error and coverage of 95% confidence interval were used to measure the performance. Without bias correction, the Bayesian MI model has lower RMSE than the Frequentist MI model. And the Frequentist MI model with bias correction is demonstrated via a simulation study to have superior performance to Bayesian MI model and the Frequentist MI model without bias correction. The results indicate that MI works well for measurement error correction if the missing true values are not missing not at random no matter whether the observed disease diagnosis is dependent on other disease predictors or not. Increasing the size of the validation dataset can improve the performance of MI better than increasing the number of imputations

    Vol. 16, No. 2 (Full Issue)

    Get PDF

    Improved Methods for Modeling High Dimensional Binary Features Data with Applications for Assessing Disease Burden from Diagnostic History and for Dealing with Missing Covariates in Administrative Health Records

    Get PDF
    Healthcare outcomes research based on administrative data is frequently hindered by two important challenges: (1) accurate adjustment for disease burden and (2) effective management of missing data in key variables. Standard approaches exist for both problems, but these may contribute to biased results. For example, several well- established summary measures are used to adjust for disease burden, often without consideration for whether other methods could perform this task more accurately. Similarly, observations with missing values are often arbitrarily excluded, or the values are imputed without regard for the involved assumptions. Despite recent substantial gains in computing power, statistical approaches and machine learning methods, no comprehensive effort has been made to develop an improved comorbidity index based on predictive performance comparisons of competing approaches. Similarly, recently developed machine learning approaches have shown promise in addressing missing data problems, but these have not been compared with parametric methods via a rigorous simulation study using large-dimensional data with the complete range of missingness types. This makes it difficult to assess the relative merits of each procedure. This work accomplished three broad aims: (1) Improved models for summarizing disease burden were developed by comparing the predictive performance of a wide variety of statistical and machine learning methods. (2) A new comorbidity summary score for predicting five-year mortality was developed. (3) A comprehensive comparison of machine learning and model-based multiple imputation methods was completed, both in simulations and through an application to real data. Several sensitivity analyses were also examined for variables with missing not at random (MNAR) missingness. This work successfully demonstrated several new approaches for summarizing disease burden. Each of the competing disease burden models in the first aim and the summary score from the second aim had superior predictive performance when compared to the Elixhauser index, a commonly-used summary measure. This research also led to new applications for applying machine learning methods within the multiple imputation with chained equations (MICE) framework. Additionally, several MNAR sensitivity methods were adapted and applied to demonstrate that unbiased inference under MNAR may not be possible in some situations, even when the missingness mechanism is fully understood

    Exact Approaches for Bias Detection and Avoidance with Small, Sparse, or Correlated Categorical Data

    Get PDF
    Every day, traditional statistical methodology are used world wide to study a variety of topics and provides insight regarding countless subjects. Each technique is based on a distinct set of assumptions to ensure valid results. Additionally, many statistical approaches rely on large sample behavior and may collapse or degenerate in the presence of small, spare, or correlated data. This dissertation details several advancements to detect these conditions, avoid their consequences, and analyze data in a different way to yield trustworthy results. One of the most commonly used modeling techniques for outcomes with only two possible categorical values (eg. live/die, pass/fail, better/worse, ect.) is logistic regression. While some potential complications with this approach are widely known, many investigators are unaware that their particular data does not meet the foundational assumptions, since they are not easy to verify. We have developed a routine for determining if a researcher should be concerned about potential bias in logistic regression results, so they can take steps to mitigate the bias or use a different procedure altogether to model the data. Correlated data may arise from common situations such as multi-site medical studies, research on family units, or investigations on student achievement within classrooms. In these circumstance the associations between cluster members must be included in any statistical analysis testing the hypothesis of a connection be-tween two variables in order for results to be valid. Previously investigators had to choose between using a method intended for small or sparse data while assuming independence between observations or a method that allowed for correlation between observations, while requiring large samples to be reliable. We present a new method that allows for small, clustered samples to be assessed for a relationship between a two-level predictor (eg. treatment/control) and a categorical outcome (eg. low/medium/high)

    Classification of clinical outcomes using high-throughput and clinical informatics.

    Get PDF
    It is widely recognized that many cancer therapies are effective only for a subset of patients. However clinical studies are most often powered to detect an overall treatment effect. To address this issue, classification methods are increasingly being used to predict a subset of patients which respond differently to treatment. This study begins with a brief history of classification methods with an emphasis on applications involving melanoma. Nonparametric methods suitable for predicting subsets of patients responding differently to treatment are then reviewed. Each method has different ways of incorporating continuous, categorical, clinical and high-throughput covariates. For nonparametric and parametric methods, distance measures specific to the method are used to make classification decisions. Approaches are outlined which employ these distances to measure treatment interactions and predict patients more sensitive to treatment. Simulations are also carried out to examine empirical power of some of these classification methods in an adaptive signature design. Results were compared with logistic regression models. It was found that parametric and nonparametric methods performed reasonably well. Relative performance of the methods depends on the simulation scenario. Finally a method was developed to evaluate power and sample size needed for an adaptive signature design in order to predict the subset of patients sensitive to treatment. It is hoped that this study will stimulate more development of nonparametric and parametric methods to predict subsets of patients responding differently to treatment

    Statistical modeling of longitudinal survey data with binary outcomes

    Get PDF
    Data obtained from longitudinal surveys using complex multi-stage sampling designs contain cross-sectional dependencies among units caused by inherent hierarchies in the data, and within subject correlation arising due to repeated measurements. The statistical methods used for analyzing such data should account for stratification, clustering and unequal probability of selection as well as within-subject correlations due to repeated measurements. The complex multi-stage design approach has been used in the longitudinal National Population Health Survey (NPHS). This on-going survey collects information on health determinants and outcomes in a sample of the general Canadian population. This dissertation compares the model-based and design-based approaches used to determine the risk factors of asthma prevalence in the Canadian female population of the NPHS (marginal model). Weighted, unweighted and robust statistical methods were used to examine the risk factors of the incidence of asthma (event history analysis) and of recurrent asthma episodes (recurrent survival analysis). Missing data analysis was used to study the bias associated with incomplete data. To determine the risk factors of asthma prevalence, the Generalized Estimating Equations (GEE) approach was used for marginal modeling (model-based approach) followed by Taylor Linearization and bootstrap estimation of standard errors (design-based approach). The incidence of asthma (event history analysis) was estimated using weighted, unweighted and robust methods. Recurrent event history analysis was conducted using Anderson and Gill, Wei, Lin and Weissfeld (WLW) and Prentice, Williams and Peterson (PWP) approaches. To assess the presence of bias associated with missing data, the weighted GEE and pattern-mixture models were used.The prevalence of asthma in the Canadian female population was 6.9% (6.1-7.7) at the end of Cycle 5. When comparing model-based and design- based approaches for asthma prevalence, design-based method provided unbiased estimates of standard errors. The overall incidence of asthma in this population, excluding those with asthma at baseline, was 10.5/1000/year (9.2-12.1). For the event history analysis, the robust method provided the most stable estimates and standard errors. For recurrent event history, the WLW method provided stable standard error estimates. Finally, for the missing data approach, the pattern-mixture model produced the most stable standard errors To conclude, design-based approaches should be preferred over model-based approaches for analyzing complex survey data, as the former provides the most unbiased parameter estimates and standard errors

    Population Attributable Fraction (PAF) in epidemiologic follow-up studies

    Get PDF
    Tieto kuolleisuuteen tai erilaisten sairauksien ilmaantumiseen vaikuttavien riskitekijöiden suhteellisesta merkityksestä väestötasolla on tärkeää muun muassa terveysvalistusta tai sairauksien ehkäisyyn tarkoitettuja interventioita suunniteltaessa. Riskitekijän suhteellisen merkityksen arvioinnissa olennaista on paitsi se, miten voimakkaasti kyseinen tekijä vaikuttaa kuolleisuuteen tai sairastuvuuteen, myös se, miten yleinen kyseinen tekijä on väestössä. Väestösyyosuus (Population Attributable Fraction, PAF) on tilastollinen tunnusluku, joka huomioi nämä molemmat näkökulmat ja jolla siis voidaan arvioida eri riskitekijöiden selittämää osuutta kuolleisuudesta tai sairastuvuudesta. Väestösyysosuus kuvaa, miten suuri osuus tapahtumista voitaisiin välttää, jos yksi tai useampi riskitekijä voitaisiin poistaa tai sen arvoja parantaa. Menetelmiä väestösyyosuuden arviointiin on tähän asti pääasiassa kehitetty ja sovellettu epidemiologisista tutkimusasetelmista tapaus-verrokki- ja poikkileikkaustutkimuksissa. Menetelmiä väestösyyosuuden arviointiin kohorttitutkimuksissa, joissa seurataan tutkitun väestöryhmän kuolleisuutta tai sairastuvuutta tietyn ajan, on puolestaan ryhdytty kehittämään vasta viime vuosina. Tässä väitöskirjatyössä kehitetään tilastollisia menetelmiä riskitekijöiden sekä kokonaiskuolleisuudesta että sairastuvuudesta selittämän väestösyyosuuden arviointiin kohorttitutkimuksissa, joissa huomioidaan näille tutkimuksille tyypillinen aikaulottuvuus sekä näihin erityyppisiin vastetapahtumiin liittyvät ominaisuudet. Riskitekijöiden selittämä väestösyyosuus määriteltiin osuudeksi kokonaiskuolleisuudesta tai sairastuvuudesta, joka voitaisiin välttää tietyllä seuranta-aikavälillä, jos niiden riskitekijöitä kyettäisiin muuttamaan. Kuolleisuuden ja sairauden ilmaantuvuuden oletettiin noudattavan parametrista suhteellisten hasardien mallia. Potentiaaliset riskitekijän ja tutkittavan tapahtuman välistä yhteyttä sekoittavat tekijät vakioitiin ja potentiaaliset riskitekijän vaikutusta tutkittavan tapahtuman ilmaantumiseen muokkaavat tekijät huomioitiin mallituksessa. Riskitekijöiden kokonaiskuolleisuudesta selittämän väestösyyosuuden estimoinnissa huomioitiin seurannan päättymisestä johtuva havaintojen sensuroituminen, kun taas niiden selittämää väestösyyosuutta sairastuvuudesta estimoitaessa huomioitiin myös kuolleisuudesta johtuva sensuroituminen. Tässä väitöskirjatyössä kehitettiin myös uusi, kuvattuihin tilastollisiin menetelmiin pohjautuva, yleiskäyttöinen SAS-ohjelma sekä riskitekijöiden kokonaiskuolleisuudesta että sairastuvuudesta selittämän väestösyyosuuden estimointiin. Uutta tilastollista menetelmää ja ohjelmaa sovellettiin tyypin 2 diabeteksen elämäntapaan liittyvien riskitekijöiden suhteellisen merkityksen arviointiin väestötasolla kyseisen sairauden aiheuttajina kahdessa suomalaista väestöä edustavassa aineistossa (Mini-Suomi -aineisto ja Terveys 2000 -aineisto). Tämä sovellus toi lisää näyttöä painonhallinnan merkityksestä tyypin 2 diabeteksen tärkeimpänä ehkäisykeinona. Lisäksi selvitettiin näiden riskitekijöiden mahdollisesti eri tyyppistä vaikutusta tyypin 2 diabetekseen matalan ja korkean riskin ryhmissä, jotka määriteltiin tyypin 2 diabeteksen esivaiheen, niin sanotun metabolisen oireyhtymän olemassaolon perusteella. Tämä tutkimus tuotti uutta tietoa elintapatekijöiden muutosten ilmeisestä merkityksestä tyypin 2 diabeteksen ehkäisyssä matalamman riskin ryhmissä. Väestösyyosuus on hyödyllinen mittari, jolla voidaan tuottaa väestötasoista tietoa erilaisten tekijöiden vaikutuksesta kiinnostuksen kohteena oleviin tapahtumiin ja jolla on laajoja käyttömahdollisuuksia monilla eri tutkimusalueilla.Quantification of the impact of exposure to different risk factors on mortality or morbidity at the population level is a fundamental issue in epidemiologic research. Population Attributable Fraction (PAF) is a statistical concept that can be used to quantify this impact. PAF assesses the proportion of outcome that could be avoided if the current exposure distribution was replaced by a hypothetical, presumably preferable exposure distribution. So far, the methods for the estimation of PAF have mostly been developed for and applied in case-control and cross-sectional studies. The development of methods for the estimation of PAF from cohort studies, which properly take into account the time perspective, has started only recently. In the estimation of PAF for a certain follow-up time interval, the type of outcome (mortality vs. morbidity) of interest has not, however, been taken into account. In this study, the statistical methodology for the estimation of PAF in cohort studies will be extended to cover both the estimation of PAF for total mortality and disease incidence. The PAF for total mortality or disease incidence was defined as the proportion of mortality or disease incidence, respectively, that could be avoided during a follow-up time interval (0, t] if their risk factors were modified. A parametric proportional hazards model, with a piecewise constant baseline hazard function for death and disease occurrences, was assumed. Potential confounding factors were adjusted for and potential effect modifying factors accounted for in the model. The estimation of PAF and its asymptotic variance based on the delta method was demonstrated. The complementary logarithmic transformation in the calculation of the confidence interval of PAF was used. In the estimation of PAF for total mortality, censoring due to loss to follow-up was taken into account, whereas in the estimation of PAF for disease incidence censoring due to death was also considered. Furthermore, the meta-analysis techniques developed for pooling of relative risks were extended for the pooling of PAF estimates. In the data examples of this study, the PAF estimates for total mortality and disease incidence were demonstrated to decrease as the follow-up time increased. In the simulated data sets, taking censoring due to death into account in the estimation of PAF for disease incidence was shown to decrease the point estimates of PAF significantly in comparison to when censoring due to death was ignored. Ignoring censoring due to death increased the overestimation of PAF, especially when the impact of risk factors on mortality was strong and the follow-up time long. A new program for the estimation of PAF both for total mortality and disease incidence, implementing the new methods, was developed using SAS/IML language. This program was shown to be flexible and fast. An application of PAF to evaluate the relative importance of the risk factors of type 2 diabetes and the potential effect-modifying role of metabolic syndrome or its components in a meta-analysis of two representative Finnish cohorts was carried out using this program. As a result, the use of PAF provided further evidence of weight control being the primary diabetes prevention method. The pooling of the PAF estimates increased the power to detect associations in smaller subpopulations defined by the metabolic syndrome or its components, establishing new evidence on the importance of early lifestyle changes in the prevention of type 2 diabetes. In conclusion, it is essential to take time perspective into account in the estimation of PAF. Different estimators of PAF for a certain time interval, taking into account different sources of censoring, are needed, depending on the outcome of interest. PAF is a useful measure in cohort studies for providing population-level information on the effects of predictor modifications on the outcome in time and has wide applications in many different fields of research

    Vol. 16, No. 1 (Full Issue)

    Get PDF
    corecore