36 research outputs found
A review of spline function procedures in R
Background: With progress on both the theoretical and the computational fronts the use of spline modelling has become an established tool in statistical regression analysis. An important issue in spline modelling is the availability of user friendly, well documented software packages. Following the idea of the STRengthening Analytical Thinking for Observational Studies initiative to provide users with guidance documents on the application of statistical methods in observational research, the aim of this article is to provide an overview of the most widely used spline-based techniques and their implementation in R. Methods: In this work, we focus on the R Language for Statistical Computing which has become a hugely popular statistics software. We identified a set of packages that include functions for spline modelling within a regression framework. Using simulated and real data we provide an introduction to spline modelling and an overview of the most popular spline functions. Results: We present a series of simple scenarios of univariate data, where different basis functions are used to identify the correct functional form of an independent variable. Even in simple data, using routines from different packages would lead to different results. Conclusions: This work illustrate challenges that an analyst faces when working with data. Most differences can be attributed to the choice of hyper-parameters rather than the basis used. In fact an experienced user will know how to obtain a reasonable outcome, regardless of the type of spline used. However, many analysts do not have sufficient knowledge to use these powerful tools adequately and will need more guidance
Modeling retail browsing sessions and wearables data
The advent of wearable non-invasive sensors for the consumer market has made it cost-effective to conduct studies that integrate physiological measures such as heart rate into data analysis research. In this paper we investigate the predictive value of heart rate measurements from a commercial wrist wearable device in the context of e-commerce. We look into a dataset comprised of browser-logs and wearables data from 28 individuals in a field experiment over a period of ten days. We are particularly interested in finding predictors for starting a retail session, such as the heart rate at the beginning of a web browsing session. We describe preprocessing tasks applied to the dataset and logistic regression and survival analysis models to retrieve the probability of starting a retail browsing session. Preliminary results show that heart rate has a significant predictive value on starting a retail session if we consider increased and decreased heart rate individual values and the time of day
Ensemble of Optimal Trees, Random Forest and Random Projection Ensemble Classification
The predictive performance of a random forest ensemble is highly associated with the strength of individual trees and their diversity. Ensemble of a small number of accurate and diverse trees, if prediction accuracy is not compromised, will also reduce computational burden. We investigate the idea of integrating trees that are accurate and diverse. For this purpose, we utilize out-of-bag observations as a validation sample from the training bootstrap samples, to choose the best trees based on their individual performance and then assess these trees for diversity using the Brier score on an independent validation sample. Starting from the first best tree, a tree is selected for the final ensemble if its addition to the forest reduces error of the trees that have already been added. Our approach does not use an implicit dimension reduction for each tree as random project ensemble classification. A total of 35 bench mark problems on classification and regression are used to assess the performance of the proposed method and compare it with random forest, random projection ensemble, node harvest, support vector machine, kNN and classification and regression tree (CART). We compute unexplained variances or classification error rates for all the methods on the corresponding data sets. Our experiments reveal that the size of the ensemble is reduced significantly and better results are obtained in most of the cases. Results of a simulation study are also given where four tree style scenarios are considered to generate data sets with several structures
Long-term safety of paclitaxel drug-coated balloon-only angioplasty for de novo coronary artery disease: the SPARTAN DCB study
Objectives: We aimed to investigate long-term survival of paclitaxel DCB for percutaneous coronary intervention (PCI). Background: Safety concerns have been raised over the use of paclitaxel devices for peripheral artery disease recently, following a meta-analysis suggesting increased late mortality. With regard to drug-coated balloon (DCB) angioplasty for coronary artery intervention however, there is limited data to date regarding possible late mortality relating to paclitaxel. Methods: We compared all-cause mortality of patients treated with paclitaxel DCB to those with non-paclitaxel second-generation drug-eluting stents (DES) for stable, de novo coronary artery disease from 1st January 2011 till 31st December 2018. To have homogenous groups allowing data on safety to be interpreted accurately, we excluded patients with previous PCI and patients treated with a combination of both DCB and DES in subsequent PCIs. Data were analysed with Kaplan–Meier curves and Cox regression statistical models. Results: We present 1517 patients; 429 treated with paclitaxel DCB and 1088 treated with DES. On univariate analysis, age, hypercholesterolaemia, hypertension, peripheral vascular disease, prior myocardial infarction, heart failure, smoking, atrial fibrillation, decreasing estimated glomerular filtration rate (eGFR) [and renal failure (eGFR < 45)] were associated with worse survival. DCB intervention showed a non-significant trend towards better prognosis compared to DES (p = 0.08). On multivariable analysis age, decreasing eGFR and smoking associated with worse prognosis. Conclusion: We found no evidence of late mortality associated with DCB angioplasty compared with non-paclitaxel second-generation DES in up to 5 years follow-up. DCB is a safe option for the treatment of de novo coronary artery disease
Relationship of cell proliferation (Ki-67) to (99m)Tc-(V)DMSA uptake in breast cancer
INTRODUCTION: The aim of the present study was to identify the relationships between the uptake of radiotracers – namely pentavalent dimercaptosuccinic acid [(V)DMSA] and sestamibi (MIBI) – and the following parameters in primary breast cancer: steroid receptor concentrations (i.e. estrogen receptor [ER] and progesterone receptor [PR]), Ki-67 expression, tumor size, tumor grade, age, and levels of expression of p53 and c-erbB-2. In addition, by multivariate regression analysis, we further isolated those factors with independent associations with (V)DMSA and/or MIBI uptake in primary breast cancer. METHODS: Thirty-four patients with histologically confirmed breast carcinoma underwent preoperative scintimammography with technetium-99m ((99m)Tc)-(V)DMSA and/or (99m)Tc-MIBI in consecutive sessions 10 and 60 min after administration of 925–1110 MBq of each radiotracer. The tumor-to-background ratio was calculated and correlated with the presence of ER, PR, Ki-67, tumor size, tumor grade, p53, and c-erbB-2. ER, PR, p53, and c-erbB-2 were determined immunohistochemically. The analysis included tumor-to-background ratio of (V)DMSA and MIBI uptake as dependent and all of the other parameters as independent variables. RESULTS: Correlation was positive between Ki-67 and (V)DMSA (r = 0.37 at 10 min, P = 0.038; r = 0.42 at 60 min, P = 0.018) and inverse between PR and (V)DMSA uptake (r = -0.46 at 10 min, P = 0.010; r = -0.51 at 60 min, P = 0.003). Multivariate regression analysis demonstrated a positive correlation between Ki-67 and (V)DMSA at 60 min (P = 0.045). Ki-67 was not significantly correlated with MIBI uptake, whereas tumor size was positively correlated with MIBI uptake at 60 min both in univariate (r = 0.45, P = 0.027) and multivariate analysis (P = 0.024). Negative correlations were observed between (V)DMSA uptake and ER, as well as between ER/PR and MIBI uptake, but these were not significant. CONCLUSION: Ki-67 appears to represent the major independent factor affecting (V)DMSA uptake in breast cancer. Tumor size was the only independent parameter influencing MIBI uptake in breast cancer. (V)DMSA appears to have an advantage over MIBI in that it can be used to visualize tumors with intense proliferative activity, and thus it can identify those tumors that are more aggressive
Atrial fibrillation in embolic stroke of undetermined source: Role of advanced imaging of left atrial function
Background: Atrial fibrillation (AF) is detected in over 30% of patients following an embolic stroke of undetermined source (ESUS) when monitored with an implantable loop recorder (ILR). Identifying AF in ESUS survivors has significant therapeutic implications and AF risk is essential to guide screening with long-term monitoring. The present study aimed to establish the role of Left Atrial (LA) function in subsequent AF identification and develop a risk model for AF in ESUS. Methods: We conducted a single-centre retrospective case-control study including all patients with ESUS referred to our institution for ILR implantation from December 2009 to September 2019. We recorded clinical variables at baseline and analyzed transthoracic echocardiograms in sinus rhythm. Univariate and multivariable analyses were performed to inform variables associated with AF. Lasso regression analysis was used to develop a risk prediction model for AF. The risk model was internally validated using bootstrapping. Results: Three hundred and twenty-three patients with ESUS underwent ILR implantation. In the ESUS population, 293 had a stroke, whereas 30 had suffered a TIA as adjudicated by a senior stroke physician. AF of any duration was detected in 47.1%. Mean follow-up was 710 days. Following lasso regression with backward elimination, we combined increasing lateral PA (the time interval from the beginning of p wave on surface electrocardiogram to the beginning of A’ wave on pulsed wave tissue Doppler of the lateral mitral annulus) (OR 1.011), increasing Age (OR 1.035), higher diastolic blood pressure (DBP) (OR 1.027) and abnormal LA reservoir Strain (OR 0.973) into a new PADS score. The probability of identifying AF can be estimated using the formula: Model discrimination was good (AUC 0.72). The PADS score was internally validated using bootstrapping with 1000 samples of 150 patients showing consistent results with an AUC of 0.73. Conclusions: The novel PADS score can identify the risk of AF on prolonged monitoring with ILR following ESUS and should be considered a dedicated risk-stratification tool for decision-making regarding the screening strategy for AF in stroke
Delays in Leniency Application: Is There Really a Race to the Enforcer's Door?
This paper studies cartels’ strategic behavior in delaying leniency applications, a take-up decision that has been ignored in the previous literature. Using European Commission decisions issued over a 16-year span, we show, contrary to common beliefs and the existing literature, that conspirators
often apply for leniency long after a cartel collapses. We estimate hazard and probit models to study the determinants of leniency-application delays. Statistical tests find that delays are symmetrically affected by antitrust policies and macroeconomic fluctuations. Our results shed light on the design of
enforcement programs against cartels and other forms of conspiracy
Ensemble of a subset of kNN classifiers
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines
A feature selection method for classification within functional genomics experiments based on the proportional overlapping score
Background: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task.Results: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.Conclusions: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes