82 research outputs found

    Measuring the stability of histogram appearance when the anchor position is changed

    Get PDF
    Although the histogram is the most widely used density estimator, it is well--known that the appearance of a constructed histogram for a given bin width can change markedly for different choices of anchor position. In this paper we construct a stability index GG that assesses the potential changes in the appearance of histograms for a given data set and bin width as the anchor position changes. If a particular bin width choice leads to an unstable appearance, the arbitrary choice of any one anchor position is dangerous, and a different bin width should be considered. The index is based on the statistical roughness of the histogram estimate. We show via Monte Carlo simulation that densities with more structure are more likely to lead to histograms with unstable appearance. In addition, ignoring the precision to which the data values are provided when choosing the bin width leads to instability. We provide several real data examples to illustrate the properties of GG. Applications to other binned density estimators are also discussed.Bin width, frequency polygon, Gini index, linear binning, Lorenz curve, Monte Carlo simulation

    Three Sides of Smoothing: Categorical Data Smoothing, Nonparametric Regression, and Density Estimation

    Get PDF
    The past forty years have seen a great deal of research into the construction and properties of nonparametric estimates of smooth functions. This research has focused primarily on two sides of the smoothing problem: nonparametric regression and density estimation. Theoretical results for these two situations are similar, and multivariate density estimation was an early justification for the Nadaraya-Watson kernel regression estimator. A third, less well-explored, strand of applications of smoothing is to the estimation of probabilities in categorical data. In this paper the position of categorical data smoothing as a bridge between nonparametric regression and density estimation is explored. Nonparametric regression provides a paradigm for the construction of effective categorical smoothing estimates, and use of an appropriate likelihood function yields cell probability estimates with many desirable properties. Such estimates can be used to construct regression estimates when one or more of the categorical variables are viewed as response variables. They also lead naturally to the construction of well-behaved density estimates using local or penalized likelihood estimation, which can then be used in a regression context. Several real data sets are used to illustrate these points.Statistics Working Papers Serie

    An Empirical Study of Factors Relating to the Success of Broadway Shows

    Get PDF
    This article uses the Cox proportional hazards model to analyze recent Broadway show data to investigate the factors that relate to the longevity of shows. The type of show, whether a show is a revival, and first-week attendance for the show are predictive for longevity. Favorable critic reviews in the New York Daily News are related to greater success, but reviews in the New York Times are not. Winning major Tony Awards is associated with a longer run for a show, but being nominated for Tonys and then losing is associated with a shorter postaward run.Statistics Working Papers Serie

    An Investigation of Missing Data Methods for Classiffcation Trees

    Get PDF
    There are many different missing data methods used by classification tree algorithms, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees. We show that in the context of classification trees, the relationship between the missingness and the dependent variable, rather than the standard missingness classification approach of Little and Rubin (2002) (missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR)), is the most helpful criterion to distinguish different missing data methods. We make recommendations as to the best method to use in various situations. The paper concludes with discussion of a real data set related to predicting bankruptcy of a firm.Statistics Working Papers Serie

    An Investigation of Missing Data Methods for Classification Trees

    Get PDF
    There are many different missing data methods used by classification tree algorithms, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees. We show that in the context of classification trees, the relationship between the missingness and the dependent variable, rather than the standard missingness classification approach of Little and Rubin (2002) (missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR)), is the most helpful criterion to distinguish different missing data methods. We make recommendations as to the best method to use in various situations. The paper concludes with discussion of a real data set related to predicting bankruptcy of a firm.Statistics Working Papers Serie
    • …
    corecore