1,015 research outputs found

    Global and local distance-based generalized linear models

    Get PDF
    This paper introduces local distance-based generalized linear models. These models extend (weighted) distance-based linear models first to the generalized linear model framework. Then, a nonparametric version of these models is proposed by means of local fitting. Distances between individuals are the only predictor information needed to fit these models. Therefore, they are applicable, among others, to mixed (qualitative and quantitative) explanatory variables or when the regressor is of functional type. An implementation is provided by the R package dbstats, which also implements other distance-based prediction methods. Supplementary material for this article is available online, which reproduces all the results of this article.Peer ReviewedPostprint (author's final draft

    Em Approach on Influence Measures in Competing Risks Via Proportional Hazard Regression Model

    Get PDF
    In a conventional competing risk s model, the time to failure of a particular experimental unit might be censored and the cause of failure can be known or unknown. In this thesis the analysis of this particular model was based on the cause-specific hazard of Cox model. The Expectation Maximization (EM) was considered to obtain the estimate of the parameters. These estimates were then compared to the Newton-Raphson iteration method. A generated data where the failure times were taken as exponentially distributed was used to further compare these two methods of estimation. From the simulation study for this particular case, we can conclude that the EM algorithm proved to be more superior in terms of mean value of parameter estimates, bias and root mean square error. To detect irregularities and peculiarities in the data set, the residuals, Cook distance and the likelihood distance were computed. Unlike the residuals, the perturbation method of Cook's distance and the likelihood distance were effective in the detection of observations that have influenced on the parameter estimates. We considered both the EM approach and the ordinary maximum likelihood estimation (MLE) approach in the computation of the influence measurements. For the ultimate results of influence measurements we utilized the approach of the one step . The EM one-step and the maximum likelihood (ML) one-step gave conclusions that are analogous to the full iteration distance measurements. In comparison, it was found that EM one-step gave better results than the ML one step with respect to the value of Cook's distance and likelihood distance. It was also found that Cook's distance i s better than the likelihood distance with respect to the number of observations detected

    The Standardized Influence Matrix and Its Applications to Generalized Linear Models

    Get PDF
    The standardized influence matrix is a generalization of the standardized influence function and Cook’s approach to local influence. It provides a general and unified approach to judge the suitability of statistical inference based on parametric models. It characterizes the local influence of data deviations from parametric models on various estimators, including generalized linear models. Its use for both robustness measures and diagnostic procedures has been studied. With global robust estimators, diagnostic statistics are proposed and shown to be useful in detecting influential points for linear regression and logistic regression models. Robustness of various estimators is compared via. the standardized influence matrix and a new robust estimator for logistic regression models is presented

    influence.ME: tools for detecting influential data in mixed effects models

    Get PDF
    influence.ME provides tools for detecting influential data in mixed effects models. The application of these models has become common practice, but the development of diagnostic tools has lagged behind. influence.ME calculates standardized measures of influential data for the point estimates of generalized mixed effects models, such as DFBETAS, Cook’s distance, as well as percentile change and a test for changing levels of significance. influence.ME calculates these measures of influence while accounting for the nesting structure of the data. The package and measures of influential data\ud are introduced, a practical example is given, and strategies for dealing with influential data are suggested

    Prostate Cancer Relapse Prediction with Biomarkers and Logistic Regression

    Get PDF
    Prostate cancer is the second most common cancer among men and the risk evaluation of the cancer prior the treatment can be critical. Risk evaluation of the prostate cancer is based on multiple factors such as clinical assessment. Biomarkers are studied as they would also be beneficial in the risk evaluation. In this thesis we assess the predictive abilities of biomarkers regarding the prostate cancer relapse. The statistical method we utilize is logistic regression model. It is used to model the probability of a dichotomous outcome variable. In this case the outcome variable indicates if the cancer of the observed patient has relapsed. The four biomarkers AR, ERG, PTEN and Ki67 form the explanatory variables. They are the most studied biomarkers in prostate cancer tissue. The biomarkers are usually detected by visual assessment of the expression status or abundance of staining. Artificial intelligence image analysis is not yet in common clinical use, but it is studied as a potential diagnostic assistance. The data contains for each biomarker a visually obtained variable and a variable obtained by artificial intelligence. In the analysis we compare the predictive power of these two differently obtained sets of variables. Due to the larger number of explanatory variables, we seek the best fitting model. When we are seeking the best fitting model, we use an algorithm glmulti for the selection of the explanatory variables. The predictive power of the models is measured by the receiver operating characteristic curve and the area under the curve. The data contains two classifications of the prostate cancer whereas the cancer was visible in the magnetic resonance imaging (MRI). The classification is not exclusive since a patient could have had both, a magnetic resonance imaging visible and an invisible cancer. The data was split into three datasets: MRI visible cancers, MRI invisible cancers and the two datasets combined. By splitting the data we could further analyze if the MRI visible cancers have differences in the relapse prediction compared to the MRI invisible cancers. In the analysis we find that none of the variables from MRI invisible cancers are significant in the prostate cancer relapse prediction. In addition, all the variables regarding the biomarker AR have no predictive power. The best biomarker for predicting prostate cancer relapse is Ki67 where high staining percentage indicates greater probabilities for the prostate cancer relapse. The variables of the biomarker Ki67 were significant in multiple models whereas biomarkers ERG and PTEN had significant variables only in a few models. Artificial intelligence variables show more accurate predictions compared to the visually obtained variables, but we could not conclude that the artificial intelligence variables are purely better. We learn instead that the visual and the artificial intelligence variables complement each other in predicting the cancer relapse

    Diagnostics for joint models for longitudinal and survival data

    Get PDF
    Joint models for longitudinal and survival data are a class of models that jointly analyse an outcome repeatedly observed over time such as a bio-marker and associated event times. These models are useful in two practical applications; firstly focusing on survival outcome whilst accounting for time varying covariates measured with error and secondly focusing on the longitudinal outcome while controlling for informative censoring. Interest on the estimation of these joint models has grown in the past two and half decades. However, minimal effort has been directed towards developing diagnostic assessment tools for these models. The available diagnostic tools have mainly been based on separate analysis of residuals for the longitudinal and survival sub-models which could be sub-optimal. In this thesis we make four contributions towards the body of knowledge. We first developed influence diagnostics for the shared parameter joint model for longitudinal and survival data based on Cook's statistics. We evaluated the performance of the diagnostics using simulation studies under different scenarios. We then illustrated these diagnostics using real data set from a multi-center clinical trial on TB pericarditis (IMPI). The second contribution was to implement a variance shift outlier model (VSOM) in the two-stage joint survival model. This was achieved by identifying outlying subjects in the longitudinal sub-model and down-weighting before the second stage of the joint model. The third contribution was to develop influence diagnostics for the multivariate joint model for longitudinal and survival data. In this setting we considered two longitudinal outcomes, square root CD4 cell count which was Gaussian in nature and antiretroviral therapy (ART) uptake which was binary. We achieved this by extending the univariate case i based on Cook's statistics for all parameters. The fourth contribution was to implement influence diagnostics in joint models for longitudinal and survival data with multiple failure types (competing risk). Using IMPI data set we considered two competing events in the joint model; death and constrictive pericarditis. Using simulation studies and IMPI dataset the developed diagnostics identified influential subjects as well as observations. The performance of the diagnostics was over 98% in simulation studies. We further conducted sensitivity analyses to check the impact of influential subjects and/or observations on parameter estimates by excluding them and re-fitting the joint model. We observed subtle differences, overall in the parameter estimates, which gives confidence that the initial inferences are credible and can be relied on. We illustrated case deletion diagnostics using the IMPI trial setting, these diagnostics can also be applied to clinical trials with similar settings. We therefore make a strong recommendation to analysts to conduct influence diagnostics in the joint model for longitudinal and survival data to ascertain the reliability of parameter estimates. We also recommend the implementation of VSOM in the longitudinal part of the two-stage joint model before the second stage

    "Influence Sketching": Finding Influential Samples In Large-Scale Regressions

    Full text link
    There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence Generalized Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware.Comment: fixed additional typo

    Generalized Linear Mixed Modeling to Examine the Relationship Between Self Efficacy and Smoking Cessation

    Get PDF
    The relationship between self efficacy and smoking cessation is unclear. Self efficacy is often viewed as a causal antecedent for future abstinence from smoking, a primary outcome of cessation studies. However, recent research has questioned whether the participant's report of self efficacy is a reflection on previous abstinence success or failure rather than a precursor. To elucidate the dynamic relationship between self efficacy and abstinence status, two generalized linear mixed models were developed. The first examined the ability of self efficacy to predict next day's' abstinence, while the second examined the ability of abstinence to predict self efficacy ratings taken later that same day. All data came from a 2 x 2 crossover trial examining how interest to quit smoking and monetary reinforcement for abstinence affect the short term effects of medication on abstinence from smoking. Participants received both medication and placebo conditions in consecutive phases in a counter-balanced order, with an ad lib smoking washout period in between. Abstinence from smoking and self efficacy was recorded daily during both medication phases. Participants were 124 smokers, mean age 31.1(SE: 1.0), who smoked on average 16.3 (SE: 0.5) cigarettes per day and had a mean FTND score of 4.6 (SE: 0.1). The sample was comprised of 56.5% females. Results indicate that self efficacy is both a predictor of, and a reflection on abstinence status. Models were validated using bootstrapping procedures. These procedures revealed only a small amount of bias in the models. The effects observed in this study may be constrained by the timing of assessments as well as the duration of the cessation attempt. Public Health Importance: Tobacco use accounts for 443,000 deaths each year. Therefore, the development of successful clinical assessments to monitor smoking cessation efforts is of the utmost importance. Self efficacy is a measure of confidence to quit smoking. This study shows that the relationship between self efficacy and smoking cessation is bi-directional which may be influenced by the timing of assessments. Understanding this relationship may lead to more successful use of self efficacy as a clinical tool during smoking cessation attempts

    Model Building on Selectivity of Gas Antisolvent Fractionation Method Using the Solubility Parameter

    Get PDF
    Solubility parameters are widely used in the polymer industry and are often applied in the high pressure field as well as they give the possibility of combining the effects of all operational parameters on solubility in a single term. We demonstrate a statistical methodology to apply solubility parameters in constructing a model to describe antisolvent fractionation based chiral resolution, which is a complex process including a chemical equilibrium, precipitation and extraction as well. The solubility parameter used in this article, is the Hansen parameter. The evaluation of experimental results of resolution and crystallization of ibuprofen with (R)-phenylethylamine based on diastereomeric salt formation by gas antisolvent fractionation method was carried out. Two sets of experiments were performed, one with methanol as organic solvent in an undesigned experiment and one with ethanol in a designed experiment. The utilization of D-optimal design in order to decrease the necessary number of experiments and to overcome the problem of constrained design space was demonstrated. Linear models including dependence of pressure, temperature and the solubility parameter were appropriate to describe the selectivity of the GASF optical resolution method in both sets of experiments
    corecore