92 research outputs found
A special case of reduced rank models for identification and modelling of time varying effects in survival analysis
Flexible survival models are in need when modelling data from long term follow-up studies. In many cases, the assumption of proportionality imposed by a Cox model will not be valid. Instead, a model that can identify time varying effects of fixed covariates can be used. Although there are several approaches that deal with this problem, it is not always straightforward how to choose which covariates should be modelled having time varying effects and which not. At the same time, it is up to the researcher to define appropriate time functions that describe the dynamic pattern of the effects. In this work, we suggest a model that can deal with both fixed and time varying effects and uses simple hypotheses tests to distinguish which covariates do have dynamic effects. The model is an extension of the parsimonious reduced rank model of rank 1. As such, the number of parameters is kept low, and thus, a flexible set of time functions, such as b-splines, can be used. The basic theory is illustrated along with an efficient fitting algorithm. The proposed method is applied to a dataset of breast cancer patients and compared with a multivariate fractional polynomials approach for modelling time-varying effects. Copyright © 2016 John Wiley & Sons, Ltd
Two dimensional smoothing via an optimised Whittaker smoother
Background In many applications where moderate to large datasets are used, plotting relationships between pairs of variables can be problematic. A large number of observations will produce a scatter-plot which is difficult to investigate due to a high concentration of points on a simple graph. In this article we review the Whittaker smoother for enhancing scatter-plots and smoothing data in two dimensions. To optimise the behaviour of the smoother an algorithm is introduced, which is easy to programme and computationally efficient. Results The methods are illustrated using a simple dataset and simulations in two dimensions. Additionally, a noisy mammography is analysed. When smoothing scatterplots the Whittaker smoother is a valuable tool that produces enhanced images that are not distorted by the large number of points. The methods is also useful for sharpening patterns or removing noise in distorted images. Conclusion The Whittaker smoother can be a valuable tool in producing better visualisations of big data or filter distorted images. The suggested optimisation method is easy to programme and can be applied with low computational cost
An Ensemble of Optimal Trees for Classification and Regression (OTE)
Predictive performance of a random forest ensemble is highly associated with the strength of individual trees and their diversity. Ensemble of a small number of accurate and diverse trees, if prediction accuracy is not compromised, will also reduce computational burden. We investigate the idea of integrating trees that are accurate and diverse. For this purpose, we utilize out-of-bag observation as validation sample from the training bootstrap samples to choose the best trees based on their individual performance and then assess these trees for diversity using Brier score. Starting from the first best tree, a tree is selected for the final ensemble if its addition to the forest reduces error of the trees that have already been added. A total of 35 bench mark problems on classification and regression are used to assess the performance of the proposed method and compare it with kNN, tree, random forest, node harvest and support vector machine. We compute unexplained variances and classification error rates for all the methods on the corresponding data sets. Our experiments reveal that the size of the ensemble is reduced significantly and better results are obtained in most of the cases. For further verification, a simulation study is also given where four tree style scenarios are considered to generate data sets with several structures
Modeling retail browsing sessions and wearables data
The advent of wearable non-invasive sensors for the consumer market has made it cost-effective to conduct studies that integrate physiological measures such as heart rate into data analysis research. In this paper we investigate the predictive value of heart rate measurements from a commercial wrist wearable device in the context of e-commerce. We look into a dataset comprised of browser-logs and wearables data from 28 individuals in a field experiment over a period of ten days. We are particularly interested in finding predictors for starting a retail session, such as the heart rate at the beginning of a web browsing session. We describe preprocessing tasks applied to the dataset and logistic regression and survival analysis models to retrieve the probability of starting a retail browsing session. Preliminary results show that heart rate has a significant predictive value on starting a retail session if we consider increased and decreased heart rate individual values and the time of day
Modelling long term survival with non-proportional hazards
In this work I consider models for survival data when the assumption of proportionality does not hold. The thesis consists of an Introduction, five papers, a Discussion and an Appendix. The Introduction presents technical information about the Cox model and introduces the ideas behind the extensions of the model proposed later on. In Chapter 2, reduced-rank methods for modelling non-proportional hazards are presented while Chapter 3 presents an algorithm for estimating Cox models with time varying effects of the covariates. The next Chapter deals with the gamma frailty (Burr) model and discusses alternative models with time dependent frailties. In Chapter 5 models with time varying effects of the covariates, frailty models and cure rate models are considered. The usefulness of each one of these models is discussed and their results are compared. The sixth Chapter of the thesis discusses ways of dealing with overdispersion when using generalized linear models. The Discussion is about future directions of the research presented in this thesis. Finally, there is an Appendix about the use of coxvc, a package written in R for fitting Cox models with time varying effects of the covariates.ZonMW project (ZON 912.02.015)UBL - phd migration 201
Ensemble of a subset of kNN classifiers
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines
Delays in Leniency Application: Is There Really a Race to the Enforcer's Door?
This paper studies cartels’ strategic behavior in delaying leniency applications, a take-up decision that has been ignored in the previous literature. Using European Commission decisions issued over a 16-year span, we show, contrary to common beliefs and the existing literature, that conspirators
often apply for leniency long after a cartel collapses. We estimate hazard and probit models to study the determinants of leniency-application delays. Statistical tests find that delays are symmetrically affected by antitrust policies and macroeconomic fluctuations. Our results shed light on the design of
enforcement programs against cartels and other forms of conspiracy
A feature selection method for classification within functional genomics experiments based on the proportional overlapping score
Background: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task.Results: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.Conclusions: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes
A review of spline function procedures in R
Background: With progress on both the theoretical and the computational fronts the use of spline modelling has become an established tool in statistical regression analysis. An important issue in spline modelling is the availability of user friendly, well documented software packages. Following the idea of the STRengthening Analytical Thinking for Observational Studies initiative to provide users with guidance documents on the application of statistical methods in observational research, the aim of this article is to provide an overview of the most widely used spline-based techniques and their implementation in R. Methods: In this work, we focus on the R Language for Statistical Computing which has become a hugely popular statistics software. We identified a set of packages that include functions for spline modelling within a regression framework. Using simulated and real data we provide an introduction to spline modelling and an overview of the most popular spline functions. Results: We present a series of simple scenarios of univariate data, where different basis functions are used to identify the correct functional form of an independent variable. Even in simple data, using routines from different packages would lead to different results. Conclusions: This work illustrate challenges that an analyst faces when working with data. Most differences can be attributed to the choice of hyper-parameters rather than the basis used. In fact an experienced user will know how to obtain a reasonable outcome, regardless of the type of spline used. However, many analysts do not have sufficient knowledge to use these powerful tools adequately and will need more guidance
Atrial fibrillation in embolic stroke of undetermined source: role of advanced imaging of left atrial function
\ua9 The Author(s) 2023. Published by Oxford University Press on behalf of the European Society of Cardiology. AIMS: Atrial fibrillation (AF) is detected in over 30% of patients following an embolic stroke of undetermined source (ESUS) when monitored with an implantable loop recorder (ILR). Identifying AF in ESUS survivors has significant therapeutic implications, and AF risk is essential to guide screening with long-term monitoring. The present study aimed to establish the role of left atrial (LA) function in subsequent AF identification and develop a risk model for AF in ESUS. METHODS AND RESULTS: We conducted a single-centre retrospective case-control study including all patients with ESUS referred to our institution for ILR implantation from December 2009 to September 2019. We recorded clinical variables at baseline and analysed transthoracic echocardiograms in sinus rhythm. Univariate and multivariable analyses were performed to inform variables associated with AF. Lasso regression analysis was used to develop a risk prediction model for AF. The risk model was internally validated using bootstrapping. Three hundred and twenty-three patients with ESUS underwent ILR implantation. In the ESUS population, 293 had a stroke, whereas 30 had suffered a transient ischaemic attack as adjudicated by a senior stroke physician. Atrial fibrillation of any duration was detected in 47.1%. The mean follow-up was 710 days. Following lasso regression with backwards elimination, we combined increasing lateral PA (the time interval from the beginning of the P wave on the surface electrocardiogram to the beginning of the A\u27 wave on pulsed wave tissue Doppler of the lateral mitral annulus) [odds ratio (OR) 1.011], increasing Age (OR 1.035), higher Diastolic blood pressure (OR 1.027), and abnormal LA reservoir Strain (OR 0.973) into a new PADS score. The probability of identifying AF can be estimated using the formula. Model discrimination was good [area under the curve (AUC) 0.72]. The PADS score was internally validated using bootstrapping with 1000 samples of 150 patients showing consistent results with an AUC of 0.73. CONCLUSION: The novel PADS score can identify the risk of AF on prolonged monitoring with ILR following ESUS and should be considered a dedicated risk stratification tool for decision-making regarding the screening strategy for AF in stroke.One-third of patients with a type of stroke called embolic stroke of undetermined source (ESUS) also have a heart condition called atrial fibrillation (AF), which increases their risk of having another stroke. However, we do not know why some patients with ESUS develop AF. To figure this out, we studied 323 patients with ESUS and used a special device to monitor their heart rhythm continuously for up to 3 years, an implantable loop recorder. We also looked at their medical history, performed a heart ultrasound, and identified some factors that increase the risk of identifying AF in the future. Factors associated with future AF include older age, higher diastolic blood pressure, and problems with the co-ordination and function of the upper left chamber of the heart called the left atrium.Based on these factors, we created a new scoring system that can identify patients who are at higher risk of developing AF better than the current scoring systems, the PADS score. This can potentially help doctors provide more targeted and effective treatment to these patients, ultimately aiming to reduce their risk of having another stroke
- …
