15 research outputs found

    Finding the Cutpoint of a Continuous Covariate in a Parametric Survival Analysis Model

    Get PDF
    In many clinical studies, continuous variables such as age, blood pressure and cholesterol are measured and analyzed. Often clinicians prefer to categorize these continuous variables into different groups, such as low and high risk groups. The goal of this work is to find the cutpoint of a continuous variable where the transition occurs from low to high risk group. Different methods have been published in literature to find such a cutpoint. We extended the methods of Contal and O’Quigley (1999) which was based on the log-rank test and the methods of Klein and Wu (2004) which was based on the Score test to find the cutpoint of a continuous covariate. Since the log-rank test is a nonparametric method and the Score test is a parametric method, we are interested to see if an extension of the parametric procedure performs better when the distribution of a population is known. We have developed a method that uses the parametric score residuals to find the cutpoint. The performance of the proposed method will be compared with the existing methods developed by Contal and O’Quigley and Klein and Wu by estimating the bias and mean square error of the estimated cutpoints for different scenarios in simulated data

    Examining the U-shaped relationship of sleep duration and systolic blood pressure with risk of cardiovascular events using a novel recursive gradient scanning model

    Get PDF
    BackgroundObservational studies have suggested U-shaped relationships between sleep duration and systolic blood pressure (SBP) with risks of many cardiovascular diseases (CVDs), but the cut-points that separate high-risk and low-risk groups have not been confirmed. We aimed to examine the U-shaped relationships between sleep duration, SBP, and risks of CVDs and confirm the optimal cut-points for sleep duration and SBP.MethodsA retrospective analysis was conducted on NHANES 2007–2016 data, which included a nationally representative sample of participants. The maximum equal-odds ratio (OR) method was implemented to obtain optimal cut-points for each continuous independent variable. Then, a novel “recursive gradient scanning method” was introduced for discretizing multiple non-monotonic U-shaped independent variables. Finally, a multivariable logistic regression model was constructed to predict critical risk factors associated with CVDs after adjusting for potential confounders.ResultsA total of 26,691 participants (48.66% were male) were eligible for the current study with an average age of 49.43 ± 17.69 years. After adjusting for covariates, compared with an intermediate range of sleep duration (6.5–8.0 h per day) and SBP (95–120 mmHg), upper or lower values were associated with a higher risk of CVDs [adjusted OR (95% confidence interval) was 1.20 (1.04–1.40) for sleep duration and 1.17 (1.01–1.36) for SBP].ConclusionsThis study indicates U-shaped relationships between SBP, sleep duration, and risks of CVDs. Both short and long duration of sleep/higher and lower BP are predictors of cardiovascular outcomes. Estimated total sleep duration of 6.5–8.0 h per day/SBP of 95–120 mmHg is associated with lower risk of CVDs

    Design and Analysis of Life History Studies Involving Incomplete Data

    Get PDF
    Incomplete life history data can arise in study designs, coarsened observations, missing covariates, and unobserved latent processes. This thesis consists of three different projects developing statistical models and methods to address problems involving such features. Statistical models which facilitate the exploration of spatial dependence can advance scientific understanding of chronic diseases processes affecting several organ systems or body sites. Motivated by the need to investigate the spatial nature of joint damage in patients with psoriatic arthritis, we develop a multivariate mixture model to characterize latent susceptibility and the progression of joint damage in different locations in Chapter 2. In addition to a large number of joints under consideration and the heterogeneity in risk, the times to joint damage are subject to interval censoring as damage status is only observed at intermittent radiological examination times. We address computational and inferential challenge through use of composite likelihood and two-stage estimation procedures. The key contribution of this chapter is the development of a convenient and general framework for regression modeling to study risk factors for susceptibility to joint damage and the time to damage, as well as spatial dependence of these features. The design and analysis of two-phase studies have been investigated for biomarker studies involving lifetime data. Two-phase designs aim to guide the efficient selection of a sub-sample of individuals from a phase I cohort to measure some "expensive" markers under budgetary constraints. In a phase I sample information on the response and inexpensive covariates is available for a large cohort, and in phase II, a subsample is selected in which to assay the marker of interest through examination of a biospecimen. The design efficiency is measured in terms of the precision in estimating the effect of the biomarker on some event process (e.g. disease progression) of interest. Chapter 3 considers two-phase designs involving current status observation of the failure process; here individuals are monitored at a single assessment time to determine whether or not they have experienced a failure event of interest. This kind of observation scheme is sometimes desirable in practice as it is more efficient and cost-effective then carrying out multiple assessments. We examine efficient two-phase designs under two analysis methods, namely maximum likelihood and inverse probability weighting. The former tends to be more efficient but requires additional model assumptions involving the nuisance covariate model, while the latter is more robust but yields less efficient estimators since it only analyses data from the phase II subsample. The optimal designs are derived by minimizing the asymptotic variance of the coefficient estimators for the expensive marker. To circumvent the computational challenge in evaluating asymptotic variances at the design stage, we consider designs involving sub-sampling based on extreme score statistics, extreme observations, or via stratified sub-sampling schemes. The role of the assessment time is highlighted. Research involving progressive chronic disease processes can be conducted by synthesizing data from different disease registries using different enrolment conditions. In inception cohorts, for example, individuals may be required to not have entered an advanced stage of the disease, while disease registries may focus on individuals who have progressed to a more advanced stage. The former yields left-truncated progression times while the latter yields right-truncated progression times. Chapter 4 considers the development of two-phase designs when the phase I sample contains data pooled from different registries launched to recruit individuals from a common population with different disease-dependent selection criteria. We frame the complex data structure by multistate models and construct partial likelihoods restricted to parameters of interest using intensity-based models under some model assumptions. Both recruitment (phase I) and sub-selection (phase II) biases are accounted for to ensure valid inference. An inverse probability weighting method is also developed to relax or weaken assumptions needed for the likelihood approach. We investigate and compare the performance of various two-phase sampling schemes under each analysis method and provide practical guidance for phase II selection given budgetary constraints. The contributions of this thesis are reviewed in Chapter 5 where we also mention topics of future research

    Comprehensive immune profiling and immune-monitoring using body fluid of patients with metastatic gastric cancer

    Get PDF
    BACKGROUND: The aim of this study is to profile the cytokines and immune cells of body fluid from metastatic gastric cancer (mGC), and evaluate the potential role as a prognostic factor and the feasibility as a predictive biomarker or monitoring source for immune checkpoint inhibitor. METHODS: Body fluid including ascites and pleural fluid were obtained from 55 mGC patients and 24 matched blood. VEGF-A, IL-10, and TGF-β1 were measured and immune cells were profiled by fluorescence assisted cell sorting (FACS). RESULTS: VEGF-A and IL-10 were significantly higher in body fluid than in plasma of mGC. Proportion of T lymphocytes with CD69 or PD-1, memory T cell marked with CD45RO, and number of Foxp3+ T regulatory cells (Tregs) were significantly higher in body fluid than those in blood of mGC. Proportion of CD8 T lymphocyte with memory marker (CD45RO) and activation marker (HLA-DR), CD3 T lymphocyte with PD-1, and number of FoxP3+ Tregs were identified as independent prognostic factors. When patients were classified by molecular subgroups of primary tumor, VEGF-A was significantly higher in genomically stable (GS)-like group than that in chromosomal instability (CIN)-like group while PD-L1 positive tumor cells (%) showed opposite results. Monitoring immune dynamics using body fluid was also feasible. Early activated T cell marked with CD25 was significantly increased in chemotherapy treated group. CONCLUSIONS: By analyzing cytokines and proportion of immune cells in body fluid, prognosis of patients with mGC can be predicted. Immune monitoring using body fluid may provide more effective treatment for patients with mGC.ope

    Some Recent Advances in Non- and Semiparametric Bayesian Modeling with Copulas, Mixtures, and Latent Variables

    Get PDF
    <p>This thesis develops flexible non- and semiparametric Bayesian models for mixed continuous, ordered and unordered categorical data. These methods have a range of possible applications; the applications considered in this thesis are drawn primarily from the social sciences, where multivariate, heterogeneous datasets with complex dependence and missing observations are the norm. </p><p>The first contribution is an extension of the Gaussian factor model to Gaussian copula factor models, which accommodate continuous and ordinal data with unspecified marginal distributions. I describe how this model is the most natural extension of the Gaussian factor model, preserving its essential dependence structure and the interpretability of factor loadings and the latent variables. I adopt an approximate likelihood for posterior inference and prove that, if the Gaussian copula model is true, the approximate posterior distribution of the copula correlation matrix asymptotically converges to the correct parameter under nearly any marginal distributions. I demonstrate with simulations that this method is both robust and efficient, and illustrate its use in an application from political science.</p><p>The second contribution is a novel nonparametric hierarchical mixture model for continuous, ordered and unordered categorical data. The model includes a hierarchical prior used to couple component indices of two separate models, which are also linked by local multivariate regressions. This structure effectively overcomes the limitations of existing mixture models for mixed data, namely the overly strong local independence assumptions. In the proposed model local independence is replaced by local conditional independence, so that the induced model is able to more readily adapt to structure in the data. I demonstrate the utility of this model as a default engine for multiple imputation of mixed data in a large repeated-sampling study using data from the Survey of Income and Participation. I show that it improves substantially on its most popular competitor, multiple imputation by chained equations (MICE), while enjoying certain theoretical properties that MICE lacks. </p><p>The third contribution is a latent variable model for density regression. Most existing density regression models are quite flexible but somewhat cumbersome to specify and fit, particularly when the regressors are a combination of continuous and categorical variables. The majority of these methods rely on extensions of infinite discrete mixture models to incorporate covariate dependence in mixture weights, atoms or both. I take a fundamentally different approach, introducing a continuous latent variable which depends on covariates through a parametric regression. In turn, the observed response depends on the latent variable through an unknown function. I demonstrate that a spline prior for the unknown function is quite effective relative to Dirichlet Process mixture models in density estimation settings (i.e., without covariates) even though these Dirichlet process mixtures have better theoretical properties asymptotically. The spline formulation enjoys a number of computational advantages over more flexible priors on functions. Finally, I demonstrate the utility of this model in regression applications using a dataset on U.S. wages from the Census Bureau, where I estimate the return to schooling as a smooth function of the quantile index.</p>Dissertatio

    Cutpoint selection for discretizing a continuous covariate for generalized estimating equations

    No full text
    We consider the problem of dichotomizing a continuous covariate when performing a regression analysis based on a generalized estimation approach. The problem involves estimation of the cutpoint for the covariate and testing the hypothesis that the binary covariate constructed from the continuous covariate has a significant impact on the outcome. Due to the multiple testing used to find the optimal cutpoint, we need to make an adjustment to the usual significance test to preserve the type-I error rates. We illustrate the techniques on one data set of patients given unrelated hematopoietic stem cell transplantation. Here the question is whether the CD34 cell dose given to patient affects the outcome of the transplant and what is the smallest cell dose which is needed for good outcomes.Dichotomized outcomes Generalized estimating equations Generalized linear model Pseudo-values Survival analysis

    Cutpoint selection for discretizing a continuous covariate for generalized estimating equations

    No full text
    We consider consider the problem of dichotomizing a continuous covariate when performing a regression analysis based on a generalized estimation approach. The problem involves estimation of the cutpoint for the covariate and testing the hypothesis that the binary covariate constructed from the continuous covariate has a significant impact on the outcome. Due to the multiple testing used to find the optimal cutpoint, we need to make an adjustment to the usual significance test to preserve the type-I error rates. We illustrate the techniques on one data set of patients given unrelated hematopoietic stem cell transplantation. Here the question is whether the CD34 cell dose given to patient affects the outcome of the transplant and what is the smallest cell dose which is needed for good outcomes. (C) 2010 Elsevier BM. All rights reserved.Fundacao de Amparo a Pesquisa do Estado de Sao Paulo - FAPESP, Brazil[2007/02823-3]Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)National Cancer Institute (NCI/NIH)[R01 CA54706-14]National Cancer Institute (NCI/NIH

    Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

    Get PDF
    International audienceBackground: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses

    Formalisation et étude de problématiques de scoring en risque de crédit: Inférence de rejet, discrétisation de variables et interactions, arbres de régression logistique

    Get PDF
    This manuscript deals with model-based statistical learning in the binary classification setting. As an application, credit scoring is widely examined with a special attention on its specificities. Proposed and existing approaches are illustrated on real data from Crédit Agricole Consumer Finance, a financial institute specialized in consumer loans which financed this PhD through a CIFRE funding.First, we consider the so-called reject inference problem, which aims at taking advantage of the information collected on rejected credit applicants for which no repayment performance can be observed (i.e. unlabelled observations). This industrial problem led to a research one by reinterpreting unlabelled observations as an information loss that can be compensated by modelling missing data. This interpretation sheds light on existing reject inference methods and allows to conclude that none of them should be recommended since they lack proper modelling assumptions that make them suitable for classical statistical model selection tools.Next, yet another industrial problem, corresponding to the discretization of continuous features or grouping of levels of categorical features before any modelling step, was tackled. This is motivated by practical (interpretability) and theoretical reasons (predictive power). To perform these quantizations, ad hoc heuristics are often used, which are empirical and time-consuming for practitioners. They are seen here as a latent variable problem, setting us back to a model selection problem. The high combinatorics of this model space necessitated a new cost-effective and automatic exploration strategy which involves either a particular neural network architecture or a Stochastic-EM algorithm and gives precise statistical guarantees.Third, as an extension to the preceding problem, interactions of covariates may be introduced in the problem in order to improve the predictive performance. This task, up to now again manually processed by practitioners and highly combinatorial, presents an accrued risk of misselecting a ``good'' model. It is performed here with a Metropolis-Hastings sampling procedure which finds the best interactions in an automatic fashion while ensuring its standard convergence properties, thus good predictive performance is guaranteed.Finally, contrary to the preceding problems which tackled a particular scorecard, we look at the scoring system as a whole. It generally consists of a tree-like structure composed of many scorecards (each relative to a particular population segment), which is often not optimized but rather imposed by the company's culture and / or history. Again, ad hoc industrial procedures are used, which lead to suboptimal performance. We propose some lines of approach to optimize this logistic regression tree which result in good empirical performance and new research directions illustrating the predictive strength and interpretability of a mix of parametric and non-parametric models.This manuscript is concluded by a discussion on potential scientific obstacles, among which the high dimensionality (in the number of features). The financial industry is indeed investing massively in unstructured data storage, which remains to this day largely unused for Credit Scoring applications. Doing so will need statistical guarantees to achieve the additional predictive performance that was hoped for.Cette thèse se place dans le cadre des modèles d’apprentissage automatique de classification binaire. Le cas d’application est le scoring de risque de crédit. En particulier, les méthodes proposées ainsi que les approches existantes sont illustrées par des données réelles de Crédit Agricole Consumer Finance, acteur majeur en Europe du crédit à la consommation, à l’origine de cette thèse grâce à un financement CIFRE.Premièrement, on s’intéresse à la problématique dite de ``réintégration des refusés''. L’objectif est de tirer parti des informations collectées sur les clients refusés, donc par définition sans étiquette connue, quant à leur remboursement de crédit. L’enjeu a été de reformuler cette problématique industrielle classique dans un cadre rigoureux, celui de la modélisation pour données manquantes. Cette approche a permis de donner tout d’abord un nouvel éclairage aux méthodes standards de réintégration, et ensuite de conclure qu’aucune d’entre elles n’était réellement à recommander tant que leur modélisation, lacunaire en l’état, interdisait l’emploi de méthodes de choix de modèles statistiques.Une autre problématique industrielle classique correspond à la discrétisation des variables continues et le regroupement des modalités de variables catégorielles avant toute étape de modélisation. La motivation sous-jacente correspond à des raisons à la fois pratiques (interprétabilité) et théoriques (performance de prédiction). Pour effectuer ces quantifications, des heuristiques, souvent manuelles et chronophages, sont cependant utilisées. Nous avons alors reformulé cette pratique courante de perte d’information comme un problème de modélisation à variables latentes, revenant ainsi à une sélection de modèle. Par ailleurs, la combinatoire associé à cet espace de modèles nous a conduit à proposer des stratégies d’exploration, soit basées sur un réseau de neurone avec un gradient stochastique, soit basées sur un algorithme de type EM stochastique.Comme extension du problème précédent, il est également courant d’introduire des interactions entre variables afin, comme toujours, d’améliorer la performance prédictive des modèles. La pratique classiquement répandue est de nouveau manuelle et chronophage, avec des risques accrus étant donnée la surcouche combinatoire que cela engendre. Nous avons alors proposé un algorithme de Metropolis-Hastings permettant de rechercher les meilleures interactions de façon quasi-automatique tout en garantissant de bonnes performances grâce à ses propriétés de convergence standards.La dernière problématique abordée vise de nouveau à formaliser une pratique répandue, consistant à définir le système d’acceptation non pas comme un unique score mais plutôt comme un arbre de scores. Chaque branche de l’arbre est alors relatif à un segment de population particulier. Pour lever la sous-optimalité des méthodes classiques utilisées dans les entreprises, nous proposons une approche globale optimisant le système d’acceptation dans son ensemble. Les résultats empiriques qui en découlent sont particulièrement prometteurs, illustrant ainsi la flexibilité d’un mélange de modélisation paramétrique et non paramétrique.Enfin, nous anticipons sur les futurs verrous qui vont apparaître en Credit Scoring et qui sont pour beaucoup liés la grande dimension (en termes de prédicteurs). En effet, l’industrie financière investit actuellement dans le stockage de données massives et non structurées, dont la prochaine utilisation dans les règles de prédiction devra s’appuyer sur un minimum de garanties théoriques pour espérer atteindre les espoirs de performance prédictive qui ont présidé à cette collecte
    corecore