10 research outputs found

    Topics in Goodness-of-fit Test for Logistic Regression Models with Continuous Covariates

    Get PDF
    There is no phenomenal method practitioners can use as a appropriate tool for model validation when sparse data are presented in multiple logistic regression models. The characteristics of sparsity, i.e. very few number of observations falling in either grouped or individual covariate patterns, will invalidate the asymptotic chi-square distribution which requires large expected frequencies in each group or bin. Among those tests, Hosmer-Lemeshow (HL) is the most well-known and widely used as the standard test in assessing logistic regression models since its introducing. The disefficiencies of Hosmer-Lemeshow method has been pointed out for years, there is no dominate alternative one emerged yet by far, and the research in assessing logistic regression model fit when sparse data are presented is still very active. Two common methods among a few other proposed methods, namely Copas's unweighted residual sum of squares (RSS) and Su and Wei's & Lin's cumulative sums of residuals (CUMSUM), perform seemly better than the HL in some scenarios, however the limitation of those studies are obvious when those alternatives were introduced: (1) the sample size of the simulation is small (up to 500 observations), (2) the design matrix is relatively simple (usually one continuous and one categorical predictor variables), (3) the number of scenarios considered in their studies are limited, (4) the simulation setups are quite subjective. Due to these reasons, there is no well-established guidelines on model validation available for statistical practitioners' daily use when using a multiple logistic regression model with sparse data, a common approach is suggested to check model validation by investigating all those existing goodness-of-fit tests to see if they provide similar evidence of lack of fit. Therefore, it is crucial to assess the performance of each method through a comprehensive comparative study. We designed the comparison differently in at least four directions as we mentioned above: varied and expanded sample size, relatively complicated design matrix, more scenarios including adding (over-fitting) continuous/categorical predictor variables and omitting (under-fitting) main effect and /or interaction terms, and a more flexible or robust simulation setting in terms of many randomly sampled models rather than very few pre-specified models were investigated. Furthermore, we proposed a goodness-of-fit test by introducing a new method to partition the fitted values based on the commonly known conditions for the limiting distribution of chi-square type statistics for grouped data, which to some extend would overcome the disadvantage of the HL test when the expected counts in some bins are small (usually the cut-off is set as less than five). We also conducted the comparative study by including our proposed method. We summarized the varied goodness-of-fit results in terms of empirical level of significance and power and offered recommendations based on our more generalized simulation studies

    Testing Lack-of-Fit of Generalized Linear Models via Laplace Approximation

    Get PDF
    In this study we develop a new method for testing the null hypothesis that the predictor function in a canonical link regression model has a prescribed linear form. The class of models, which we will refer to as canonical link regression models, constitutes arguably the most important subclass of generalized linear models and includes several of the most popular generalized linear models. In addition to the primary contribution of this study, we will revisit several other tests in the existing literature. The common feature among the proposed test, as well as the existing tests, is that they are all based on orthogonal series estimators and used to detect departures from a null model. Our proposal for a new lack-of-fit test is inspired by the recent contribution of Hart and is based on a Laplace approximation to the posterior probability of the null hypothesis. Despite having a Bayesian construction, the resulting statistic is implemented in a frequentist fashion. The formulation of the statistic is based on characterizing departures from the predictor function in terms of Fourier coefficients, and subsequent testing that all of these coefficients are 0. The resulting test statistic can be characterized as a weighted sum of exponentiated squared Fourier coefficient estimators, whereas the weights depend on user-specified prior probabilities. The prior probabilities provide the investigator the flexibility to examine specific departures from the prescribed model. Alternatively, the use of noninformative priors produces a new omnibus lack-of-fit statistic. We present a thorough numerical study of the proposed test and the various existing orthogonal series-based tests in the context of the logistic regression model. Simulation studies demonstrate that the test statistics under consideration possess desirable power properties against alternatives that have been identified in the existing literature as being important

    Prostate Cancer Relapse Prediction with Biomarkers and Logistic Regression

    Get PDF
    Prostate cancer is the second most common cancer among men and the risk evaluation of the cancer prior the treatment can be critical. Risk evaluation of the prostate cancer is based on multiple factors such as clinical assessment. Biomarkers are studied as they would also be beneficial in the risk evaluation. In this thesis we assess the predictive abilities of biomarkers regarding the prostate cancer relapse. The statistical method we utilize is logistic regression model. It is used to model the probability of a dichotomous outcome variable. In this case the outcome variable indicates if the cancer of the observed patient has relapsed. The four biomarkers AR, ERG, PTEN and Ki67 form the explanatory variables. They are the most studied biomarkers in prostate cancer tissue. The biomarkers are usually detected by visual assessment of the expression status or abundance of staining. Artificial intelligence image analysis is not yet in common clinical use, but it is studied as a potential diagnostic assistance. The data contains for each biomarker a visually obtained variable and a variable obtained by artificial intelligence. In the analysis we compare the predictive power of these two differently obtained sets of variables. Due to the larger number of explanatory variables, we seek the best fitting model. When we are seeking the best fitting model, we use an algorithm glmulti for the selection of the explanatory variables. The predictive power of the models is measured by the receiver operating characteristic curve and the area under the curve. The data contains two classifications of the prostate cancer whereas the cancer was visible in the magnetic resonance imaging (MRI). The classification is not exclusive since a patient could have had both, a magnetic resonance imaging visible and an invisible cancer. The data was split into three datasets: MRI visible cancers, MRI invisible cancers and the two datasets combined. By splitting the data we could further analyze if the MRI visible cancers have differences in the relapse prediction compared to the MRI invisible cancers. In the analysis we find that none of the variables from MRI invisible cancers are significant in the prostate cancer relapse prediction. In addition, all the variables regarding the biomarker AR have no predictive power. The best biomarker for predicting prostate cancer relapse is Ki67 where high staining percentage indicates greater probabilities for the prostate cancer relapse. The variables of the biomarker Ki67 were significant in multiple models whereas biomarkers ERG and PTEN had significant variables only in a few models. Artificial intelligence variables show more accurate predictions compared to the visually obtained variables, but we could not conclude that the artificial intelligence variables are purely better. We learn instead that the visual and the artificial intelligence variables complement each other in predicting the cancer relapse

    The predictability of New Mexico’s summative reading assessment by two commonly used early literacy assessments, the Dynamic Indicators of Basic Early Literacy Skills Next (DIBELS Next) and the Developmental Reading Assessment – Second Edition (DRA2)

    Get PDF
    This study will determine the degree to which first grade literacy tests predict third grade reading performance in order to judge their value as early warning systems for reading skills. Reading skills are fundamental to many academic outcomes, so having an early sense of how students are reading is critical for schools. The first grade reading tests being compared are the Developmental Reading Assessment-Second Edition (DRA2) and the Dynamic Indicators of Basic Early Literacy Skills – Next (DIBELS Next). This study will employ two datasets, one with DIBELS Next scores (N=5,456) and one with DRA2 scores (N=2,209). Logistic regression is used to judge the predictability, and all logistic regression models are generated with the Statistical Package for Social Scientists (SPSS) version 20. The dependent variable is operationalized to be scoring proficient or not proficient on the New Mexico third grade English language arts/reading Standards Based Assessment (SBA). The independent variable is the composite score on the early literacy assessment. Covariates are demographic characteristics (i.e., gender, racial group, English language learner status and economic disadvantaged status). For both models, the beginning-of-year composite score had a significant overall effect in predicting student proficiency on the SBA. The DRA2 model had higher percentages of sensitivity, and positive and negative predicted values compared to the DIBELS Next. Conversely, the DIBELS Next had higher false positive and negative rates than the DRA2

    A Quantitative Study Exploring the Perceptions of Body-Worn Camera Use in the Texas Department of Juvenile Justice

    Get PDF
    Social justice issues led to the implementation of body-worn cameras (BWC) in police departments throughout the United States. This widespread implementation provided research results to assist other police agencies in considering implementation; however, no similar criminal justice solution for adult and juvenile corrections has been implemented with the same level of practicality. BWCs have the potential to protect inmates according to the Prison Rape Elimination Act’s (PREA) requirements and represent the most critical social justice issue in corrections: advocating civil rights. The former Texas Youth Commission (TYC) was re-branded as the Texas Juvenile Justice Department (TJJD) due to a history of sexual assault and civil rights abuse (Cate, 2016; Donnelly, 2018). Applying the findings on BWC implementation by law enforcement agencies and the few existing studies in adult prisons reveals that implementing BWCs in juvenile justice provides an opportunity to thwart the perceptions of a lack of legitimacy and procedural justice. Yet, little research exists on implementing BWCs in a corrections environment. This study aims to examine TJJD facility staff perceptions of BWCs using pre-existing surveys following a non-experimental repeated cross-sectional research design exploring their perceptions of BWCs. Recommendations for further research include what BWC implementation procedures differ in corrections based upon differing usage and compliance procedures, requiring differing decision criteria for corrections environments. Keywords: Body-Worn Cameras, Juvenile Justice, Procedural Justic

    Isotonic Distributional Regression

    Get PDF
    Distributional regression estimates the probability distribution of a response variable conditional on covariates. The estimated conditional distribution comprehensively summarizes the available information on the response variable, and allows to derive all statistical quantities of interest, such as the conditional mean, threshold exceedance probabilities, or quantiles. This thesis develops isotonic distributional regression, a method for estimating conditional distributions under the assumption of a monotone relationship between covariates and a response variable. The response variable is univariate and real-valued, and the covariates lie in a partially ordered set. The monotone relationship is formulated in terms of stochastic order constraints, that is, the response variable increases in a stochastic sense as the covariates increase in the partial order. This assumption alone yields a shape-constrained non-parametric estimator, which does not involve any tuning parameters. The estimation of distributions under stochastic order restrictions has already been studied for various stochastic orders, but so far only with totally ordered covariates. Apart from considering more general partially ordered covariates, the first main contribution of this thesis lies in a shift of focus from estimation to prediction. Distributional regression is the backbone of probabilistic forecasting, which aims at quantifying the uncertainty about a future quantity of interest comprehensively in the form of probability distributions. When analyzed with respect to predominant criteria for probabilistic forecast quality, isotonic distributional regression is shown to have desirable properties. In addition, this thesis develops an efficient algorithm for the computation of isotonic distributional regression, and proposes an estimator under a weaker, previously not thoroughly studied stochastic order constraint. A main application of isotonic distributional regression is the uncertainty quantification for point forecasts. Such point forecasts sometimes stem from external sources, like physical models or expert surveys, but often they are generated with statistical models. The second contribution of this thesis is the extension of isotonic distributional regression to allow covariates that are point predictions from a regression model, which may be trained on the same data to which isotonic distributional regression is to be applied. This combination yields a so-called distributional index model. Asymptotic consistency is proved under suitable assumptions, and real data applications demonstrate the usefulness of the method. Isotonic distributional regression provides a benchmark in forecasting problems, as it allows to quantify the merits of a specific, tailored model for the application at hand over a generic method which only relies on monotonicity. In such comparisons it is vital to assess the significance of forecast superiority or of forecast misspecification. The third contribution of this thesis is the development of new, safe methods for forecast evaluation, which require no or minimal assumptions on the data generating processes

    PRE-CRASH EXTRACTION OF THE CONSTELLATION OF A FRONTAL COLLISION BETWEEN TWO MOTOR VEHICLES

    Get PDF
    One of the strategic objectives of the European Commission is to halve the number of road traffic fatalities by 2020. In addition, in 2010, the United Nations General Assembly initiated the "Decade of Action for Road Safety 2011-2020" to reduce the number of fatalities and decrease the number of road traffic injuries. To address the scourge of road traffic accidents, this thesis presents a research study which has devised and evaluated a novel algorithm for extracting the constellation of an unavoidable frontal vehicle-to-vehicle accident. The primary research questions addressed in this work are: • What are the most significant collision parameters which influence the injury severity for a frontal collision between two motor vehicles? • How to extract the constellation of a crash before the accident occurs? In addition, the secondary research questions given below were addressed: • How to integrate physical constraints, imposed on the rate of acceleration of a real vehicle, together with data from vehicle-to-vehicle (V2V) communication, into the crash constellation extraction algorithm? • How to integrate uncertainties, associated with the data captured by sensors of a real vehicle, into a simulation model devised for assessing the performance of crash constellation extraction algorithms? Statistical analysis, conducted to determine significant collision parameters, has identified three significant crash constellation parameters: the point of collision on the vehicle body and the relative velocity between the vehicles; and the vehicle alignment offset (or vehicle overlap). The research reported in this thesis has also produced a novel algorithm for analysing the data captured by vehicle sensors, to extract the constellation of an unavoidable vehicle-to-vehicle frontal accident. The algorithm includes a model of physical constraints on the acceleration of a vehicle, cast as a gradual rise and eventual saturation of vehicle acceleration, together with the acceleration lag relative to the timing of information received from V2V communication. In addition, the research has delivered a simulation model to support the evaluation of the performance of crash constellation extraction algorithms, including a technique for integrating (into the simulation model, so that the simulation can approach real-world behaviour) the uncertainties associated with the data captured by the sensors of a real vehicle. The results of the assessment of the soundness of the simulation model show that the model produces the expected level of estimation errors, when simulation data is considered on its own or when it is compared to data from tests performed with a real vehicle. Simulation experiments, for the performance evaluation of the crash constellation extraction algorithm, show that the uncertainty associated with the estimated time-to-collision decreases as vehicle velocity increases or as the actual time-to-collision decreases. The results also show that a decreasing time-to-collision leads to a decreasing uncertainty associated with the estimated position of the tracked vehicle, the estimated collision point on the ego vehicle, and the estimated relative velocity between the two vehicles about to collide. The results of the performance assessment of the crash constellation extraction algorithm also show that V2V information has a beneficial influence on the precision of the constellation extraction, with regards to the predicted time-to-collision, the predicted position and velocity of the oncoming vehicle against which a collision is possible; the predicted relative velocity between the two vehicles about to collide, and the predicted point of collision on the body of the ego vehicle. It is envisaged that the techniques, developed in the research reported in this thesis, will be used in future integrated safety systems for motor vehicles. They could then strongly impact passenger safety by enabling optimal activation of safety measures to protect the vehicle occupants, as determined from the estimated constellation of the impending crash

    Contribuciones a la predicción de la deserción universitaria a través de minería de datos

    Get PDF
    Identifica una limitada producción científica que analiza factores de deserción desde la perspectiva del estudiante, que es el actor principal de la deserción, y la construcción de modelos híbridos de predicción que permitan comprender mejor manera el problema de la deserción en las universidades. El objetivo consiste en contribuir al proceso de predicción de la deserción estudiantil universitaria a través del estudio integral de factores, técnicas y herramientas de minería de datos usados con este fin. Se concluye que la predicción de la deserción en las universidades puede variar, ya que dependerá de los factores de ingreso, del contexto educativo estudiado, del entorno de educación aplicado, y de los antecedentes de los estudios para los que fueron usados. Por otro lado, se considera importante determinar si es suficiente con predecir la deserción o si se requiere incorporar estudios que establezcan estrategias para mitigar la deserción en las instituciones de educación superior.Tesi
    corecore