6,746 research outputs found

    Evolution of statistical analysis in empirical software engineering research: Current state and steps forward

    Full text link
    Software engineering research is evolving and papers are increasingly based on empirical data from a multitude of sources, using statistical tests to determine if and to what degree empirical evidence supports their hypotheses. To investigate the practices and trends of statistical analysis in empirical software engineering (ESE), this paper presents a review of a large pool of papers from top-ranked software engineering journals. First, we manually reviewed 161 papers and in the second phase of our method, we conducted a more extensive semi-automatic classification of papers spanning the years 2001--2015 and 5,196 papers. Results from both review steps was used to: i) identify and analyze the predominant practices in ESE (e.g., using t-test or ANOVA), as well as relevant trends in usage of specific statistical methods (e.g., nonparametric tests and effect size measures) and, ii) develop a conceptual model for a statistical analysis workflow with suggestions on how to apply different statistical methods as well as guidelines to avoid pitfalls. Lastly, we confirm existing claims that current ESE practices lack a standard to report practical significance of results. We illustrate how practical significance can be discussed in terms of both the statistical analysis and in the practitioner's context.Comment: journal submission, 34 pages, 8 figure

    Empirical Likelihood for Regression Discontinuity Design

    Get PDF
    This paper proposes empirical likelihood based inference methods for causal effects identified from regression discontinuity designs. We consider both the sharp and fuzzy regression discontinuity designs and treat the regression functions as nonparametric. The proposed inference procedures do not require asymptotic variance estimation and the confidence sets have natural shapes, unlike the conventional Wald-type method. These features are illustrated by simulations and an empirical example which evaluates the effect of class size on pupils' scholastic achievements. Bandwidth selection methods, higher-order properties, and extensions to incorporate additional covariates and parametric functional forms are also discussed.Empirical likelihood, Nonparametric methods, Regression discontinuity design, Treatment effect

    Short-term load forecasting based on a semi-parametric additive model

    Get PDF
    Short-term load forecasting is an essential instrument in power system planning, operation and control. Many operating decisions are based on load forecasts, such as dispatch scheduling of generating capacity, reliability analysis, and maintenance planning for the generators. Overestimation of electricity demand will cause a conservative operation, which leads to the start-up of too many units or excessive energy purchase, thereby supplying an unnecessary level of reserve. On the contrary, underestimation may result in a risky operation, with insufficient preparation of spinning reserve, causing the system to operate in a vulnerable region to the disturbance. In this paper, semi-parametric additive models are proposed to estimate the relationships between demand and the driver variables. Specifically, the inputs for these models are calendar variables, lagged actual demand observations and historical and forecast temperature traces for one or more sites in the target power system. In addition to point forecasts, prediction intervals are also estimated using a modified bootstrap method suitable for the complex seasonality seen in electricity demand data. The proposed methodology has been used to forecast the half-hourly electricity demand for up to seven days ahead for power systems in the Australian National Electricity Market. The performance of the methodology is validated via out-of-sample experiments with real data from the power system, as well as through on-site implementation by the system operator.Short-term load forecasting, additive model, time series, forecast distribution

    Regression Discontinuity Designs Using Covariates

    Full text link
    We study regression discontinuity designs when covariates are included in the estimation. We examine local polynomial estimators that include discrete or continuous covariates in an additive separable way, but without imposing any parametric restrictions on the underlying population regression functions. We recommend a covariate-adjustment approach that retains consistency under intuitive conditions, and characterize the potential for estimation and inference improvements. We also present new covariate-adjusted mean squared error expansions and robust bias-corrected inference procedures, with heteroskedasticity-consistent and cluster-robust standard errors. An empirical illustration and an extensive simulation study is presented. All methods are implemented in \texttt{R} and \texttt{Stata} software packages

    Hybrid model using logit and nonparametric methods for predicting micro-entity failure

    Get PDF
    Following the calls from literature on bankruptcy, a parsimonious hybrid bankruptcy model is developed in this paper by combining parametric and non-parametric approaches.To this end, the variables with the highest predictive power to detect bankruptcy are selected using logistic regression (LR). Subsequently, alternative non-parametric methods (Multilayer Perceptron, Rough Set, and Classification-Regression Trees) are applied, in turn, to firms classified as either “bankrupt” or “not bankrupt”. Our findings show that hybrid models, particularly those combining LR and Multilayer Perceptron, offer better accuracy performance and interpretability and converge faster than each method implemented in isolation. Moreover, the authors demonstrate that the introduction of non-financial and macroeconomic variables complement financial ratios for bankruptcy prediction

    Recent advances in directional statistics

    Get PDF
    Mainstream statistical methodology is generally applicable to data observed in Euclidean space. There are, however, numerous contexts of considerable scientific interest in which the natural supports for the data under consideration are Riemannian manifolds like the unit circle, torus, sphere and their extensions. Typically, such data can be represented using one or more directions, and directional statistics is the branch of statistics that deals with their analysis. In this paper we provide a review of the many recent developments in the field since the publication of Mardia and Jupp (1999), still the most comprehensive text on directional statistics. Many of those developments have been stimulated by interesting applications in fields as diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics, image analysis, text mining, environmetrics, and machine learning. We begin by considering developments for the exploratory analysis of directional data before progressing to distributional models, general approaches to inference, hypothesis testing, regression, nonparametric curve estimation, methods for dimension reduction, classification and clustering, and the modelling of time series, spatial and spatio-temporal data. An overview of currently available software for analysing directional data is also provided, and potential future developments discussed.Comment: 61 page

    Comparison of different classification algorithms for fault detection and fault isolation in complex systems

    Get PDF
    Due to the lack of sufficient results seen in literature, feature extraction and classification methods of hydraulic systems appears to be somewhat challenging. This paper compares the performance of three classifiers (namely linear support vector machine (SVM), distance-weighted k-nearest neighbor (WKNN), and decision tree (DT) using data from optimized and non-optimized sensor set solutions. The algorithms are trained with known data and then tested with unknown data for different scenarios characterizing faults with different degrees of severity. This investigation is based solely on a data-driven approach and relies on data sets that are taken from experiments on the fuel system. The system that is used throughout this study is a typical fuel delivery system consisting of standard components such as a filter, pump, valve, nozzle, pipes, and two tanks. Running representative tests on a fuel system are problematic because of the time, cost, and reproduction constraints involved in capturing any significant degradation. Simulating significant degradation requires running over a considerable period; this cannot be reproduced quickly and is costly
    • 

    corecore