1,015 research outputs found
Global and local distance-based generalized linear models
This paper introduces local distance-based generalized linear models. These models extend (weighted) distance-based linear models first to the generalized linear model framework. Then, a nonparametric version of these models is proposed by means of local fitting. Distances between individuals are the only predictor information needed to fit these models. Therefore, they are applicable, among others, to mixed (qualitative and quantitative) explanatory variables or when the regressor is of functional type. An implementation is provided by the R package dbstats, which also implements other distance-based prediction methods. Supplementary material for this article is available online, which reproduces all the results of this article.Peer ReviewedPostprint (author's final draft
Em Approach on Influence Measures in Competing Risks Via Proportional Hazard Regression Model
In a conventional competing risk s model, the time to failure of a particular
experimental unit might be censored and the cause of failure can be known or
unknown. In this thesis the analysis of this particular model was based on the
cause-specific hazard of Cox model. The Expectation Maximization (EM) was
considered to obtain the estimate of the parameters. These estimates were then
compared to the Newton-Raphson iteration method. A generated data where the
failure times were taken as exponentially distributed was used to further compare
these two methods of estimation. From the simulation study for this particular case,
we can conclude that the EM algorithm proved to be more superior in terms of
mean value of parameter estimates, bias and root mean square error. To detect irregularities and peculiarities in the data set, the residuals, Cook
distance and the likelihood distance were computed. Unlike the residuals, the
perturbation method of Cook's distance and the likelihood distance were effective
in the detection of observations that have influenced on the parameter estimates.
We considered both the EM approach and the ordinary maximum likelihood
estimation (MLE) approach in the computation of the influence measurements. For
the ultimate results of influence measurements we utilized the approach of the one step
. The EM one-step and the maximum likelihood (ML) one-step gave
conclusions that are analogous to the full iteration distance measurements. In
comparison, it was found that EM one-step gave better results than the ML one step
with respect to the value of Cook's distance and likelihood distance. It was also
found that Cook's distance i s better than the likelihood distance with respect to the
number of observations detected
The Standardized Influence Matrix and Its Applications to Generalized Linear Models
The standardized influence matrix is a generalization of the standardized influence function and Cook’s approach to local influence. It provides a general and unified approach to judge the suitability of statistical inference based on parametric models. It characterizes the local influence of data deviations from parametric models on various estimators, including generalized linear models. Its use for both robustness measures and diagnostic procedures has been studied. With global robust estimators, diagnostic statistics are proposed and shown to be useful in detecting influential points for linear regression and logistic regression models. Robustness of various estimators is compared via. the standardized influence matrix and a new robust estimator for logistic regression models is presented
influence.ME: tools for detecting influential data in mixed effects models
influence.ME provides tools for detecting influential data in mixed effects models. The application of these models has become common practice, but the development of diagnostic tools has lagged behind. influence.ME calculates standardized measures of influential data for the point estimates of generalized mixed effects models, such as DFBETAS, Cook’s distance, as well as percentile change and a test for changing levels of significance. influence.ME calculates these measures of influence while accounting for the nesting structure of the data. The package and measures of influential data\ud
are introduced, a practical example is given, and strategies for dealing with influential data are suggested
Prostate Cancer Relapse Prediction with Biomarkers and Logistic Regression
Prostate cancer is the second most common cancer among men and the risk evaluation of the cancer
prior the treatment can be critical. Risk evaluation of the prostate cancer is based on multiple
factors such as clinical assessment. Biomarkers are studied as they would also be beneficial in the
risk evaluation. In this thesis we assess the predictive abilities of biomarkers regarding the prostate
cancer relapse.
The statistical method we utilize is logistic regression model. It is used to model the probability
of a dichotomous outcome variable. In this case the outcome variable indicates if the cancer of the
observed patient has relapsed. The four biomarkers AR, ERG, PTEN and Ki67 form the explanatory
variables. They are the most studied biomarkers in prostate cancer tissue.
The biomarkers are usually detected by visual assessment of the expression status or abundance of
staining. Artificial intelligence image analysis is not yet in common clinical use, but it is studied as
a potential diagnostic assistance. The data contains for each biomarker a visually obtained variable
and a variable obtained by artificial intelligence. In the analysis we compare the predictive power of
these two differently obtained sets of variables. Due to the larger number of explanatory variables,
we seek the best fitting model. When we are seeking the best fitting model, we use an algorithm
glmulti for the selection of the explanatory variables. The predictive power of the models is measured
by the receiver operating characteristic curve and the area under the curve.
The data contains two classifications of the prostate cancer whereas the cancer was visible in the
magnetic resonance imaging (MRI). The classification is not exclusive since a patient could have
had both, a magnetic resonance imaging visible and an invisible cancer. The data was split into
three datasets: MRI visible cancers, MRI invisible cancers and the two datasets combined. By
splitting the data we could further analyze if the MRI visible cancers have differences in the relapse
prediction compared to the MRI invisible cancers.
In the analysis we find that none of the variables from MRI invisible cancers are significant in the
prostate cancer relapse prediction. In addition, all the variables regarding the biomarker AR have
no predictive power. The best biomarker for predicting prostate cancer relapse is Ki67 where high
staining percentage indicates greater probabilities for the prostate cancer relapse. The variables
of the biomarker Ki67 were significant in multiple models whereas biomarkers ERG and PTEN
had significant variables only in a few models. Artificial intelligence variables show more accurate
predictions compared to the visually obtained variables, but we could not conclude that the artificial
intelligence variables are purely better. We learn instead that the visual and the artificial intelligence
variables complement each other in predicting the cancer relapse
Diagnostics for joint models for longitudinal and survival data
Joint models for longitudinal and survival data are a class of models that jointly analyse an outcome repeatedly observed over time such as a bio-marker and associated event times. These models are useful in two practical applications; firstly focusing on survival outcome whilst accounting for time varying covariates measured with error and secondly focusing on the longitudinal outcome while controlling for informative censoring. Interest on the estimation of these joint models has grown in the past two and half decades. However, minimal effort has been directed towards developing diagnostic assessment tools for these models. The available diagnostic tools have mainly been based on separate analysis of residuals for the longitudinal and survival sub-models which could be sub-optimal. In this thesis we make four contributions towards the body of knowledge. We first developed influence diagnostics for the shared parameter joint model for longitudinal and survival data based on Cook's statistics. We evaluated the performance of the diagnostics using simulation studies under different scenarios. We then illustrated these diagnostics using real data set from a multi-center clinical trial on TB pericarditis (IMPI). The second contribution was to implement a variance shift outlier model (VSOM) in the two-stage joint survival model. This was achieved by identifying outlying subjects in the longitudinal sub-model and down-weighting before the second stage of the joint model. The third contribution was to develop influence diagnostics for the multivariate joint model for longitudinal and survival data. In this setting we considered two longitudinal outcomes, square root CD4 cell count which was Gaussian in nature and antiretroviral therapy (ART) uptake which was binary. We achieved this by extending the univariate case i based on Cook's statistics for all parameters. The fourth contribution was to implement influence diagnostics in joint models for longitudinal and survival data with multiple failure types (competing risk). Using IMPI data set we considered two competing events in the joint model; death and constrictive pericarditis. Using simulation studies and IMPI dataset the developed diagnostics identified influential subjects as well as observations. The performance of the diagnostics was over 98% in simulation studies. We further conducted sensitivity analyses to check the impact of influential subjects and/or observations on parameter estimates by excluding them and re-fitting the joint model. We observed subtle differences, overall in the parameter estimates, which gives confidence that the initial inferences are credible and can be relied on. We illustrated case deletion diagnostics using the IMPI trial setting, these diagnostics can also be applied to clinical trials with similar settings. We therefore make a strong recommendation to analysts to conduct influence diagnostics in the joint model for longitudinal and survival data to ascertain the reliability of parameter estimates. We also recommend the implementation of VSOM in the longitudinal part of the two-stage joint model before the second stage
"Influence Sketching": Finding Influential Samples In Large-Scale Regressions
There is an especially strong need in modern large-scale data analysis to
prioritize samples for manual inspection. For example, the inspection could
target important mislabeled samples or key vulnerabilities exploitable by an
adversarial attack. In order to solve the "needle in the haystack" problem of
which samples to inspect, we develop a new scalable version of Cook's distance,
a classical statistical technique for identifying samples which unusually
strongly impact the fit of a regression model (and its downstream predictions).
In order to scale this technique up to very large and high-dimensional
datasets, we introduce a new algorithm which we call "influence sketching."
Influence sketching embeds random projections within the influence computation;
in particular, the influence score is calculated using the randomly projected
pseudo-dataset from the post-convergence Generalized Linear Model (GLM). We
validate that influence sketching can reliably and successfully discover
influential samples by applying the technique to a malware detection dataset of
over 2 million executable files, each represented with almost 100,000 features.
For example, we find that randomly deleting approximately 10% of training
samples reduces predictive accuracy only slightly from 99.47% to 99.45%,
whereas deleting the same number of samples with high influence sketch scores
reduces predictive accuracy all the way down to 90.24%. Moreover, we find that
influential samples are especially likely to be mislabeled. In the case study,
we manually inspect the most influential samples, and find that influence
sketching pointed us to new, previously unidentified pieces of malware.Comment: fixed additional typo
Generalized Linear Mixed Modeling to Examine the Relationship Between Self Efficacy and Smoking Cessation
The relationship between self efficacy and smoking cessation is unclear. Self efficacy is often viewed as a causal antecedent for future abstinence from smoking, a primary outcome of cessation studies. However, recent research has questioned whether the participant's report of self efficacy is a reflection on previous abstinence success or failure rather than a precursor. To elucidate the dynamic relationship between self efficacy and abstinence status, two generalized linear mixed models were developed. The first examined the ability of self efficacy to predict next day's' abstinence, while the second examined the ability of abstinence to predict self efficacy ratings taken later that same day. All data came from a 2 x 2 crossover trial examining how interest to quit smoking and monetary reinforcement for abstinence affect the short term effects of medication on abstinence from smoking. Participants received both medication and placebo conditions in consecutive phases in a counter-balanced order, with an ad lib smoking washout period in between. Abstinence from smoking and self efficacy was recorded daily during both medication phases. Participants were 124 smokers, mean age 31.1(SE: 1.0), who smoked on average 16.3 (SE: 0.5) cigarettes per day and had a mean FTND score of 4.6 (SE: 0.1). The sample was comprised of 56.5% females. Results indicate that self efficacy is both a predictor of, and a reflection on abstinence status. Models were validated using bootstrapping procedures. These procedures revealed only a small amount of bias in the models. The effects observed in this study may be constrained by the timing of assessments as well as the duration of the cessation attempt. Public Health Importance: Tobacco use accounts for 443,000 deaths each year. Therefore, the development of successful clinical assessments to monitor smoking cessation efforts is of the utmost importance. Self efficacy is a measure of confidence to quit smoking. This study shows that the relationship between self efficacy and smoking cessation is bi-directional which may be influenced by the timing of assessments. Understanding this relationship may lead to more successful use of self efficacy as a clinical tool during smoking cessation attempts
Model Building on Selectivity of Gas Antisolvent Fractionation Method Using the Solubility Parameter
Solubility parameters are widely used in the polymer industry and are often applied in the high pressure field as well as they give the possibility of combining the effects of all operational parameters on solubility in a single term. We demonstrate a statistical methodology to apply solubility parameters in constructing a model to describe antisolvent fractionation based chiral resolution, which is a complex process including a chemical equilibrium, precipitation and extraction as well. The solubility parameter used in this article, is the Hansen parameter. The evaluation of experimental results of resolution and crystallization of ibuprofen with (R)-phenylethylamine based on diastereomeric salt formation by gas antisolvent fractionation method was carried out. Two sets of experiments were performed, one with methanol as organic solvent in an undesigned experiment and one with ethanol in a designed experiment. The utilization of D-optimal design in order to decrease the necessary number of experiments and to overcome the problem of constrained design space was demonstrated. Linear models including dependence of pressure, temperature and the solubility parameter were appropriate to describe the selectivity of the GASF optical resolution method in both sets of experiments
- …