82 research outputs found
Measuring the stability of histogram appearance when the anchor position is changed
Although the histogram is the most widely used density estimator, it is well--known that the appearance of a constructed histogram for a given bin width can change markedly for different choices of anchor position. In this paper we construct a stability index that assesses the potential changes in the appearance of histograms for a given data set and bin width as the anchor position changes. If a particular bin width choice leads to an unstable appearance, the arbitrary choice of any one anchor position is dangerous, and a different bin width should be considered. The index is based on the statistical roughness of the histogram estimate. We show via Monte Carlo simulation that densities with more structure are more likely to lead to histograms with unstable appearance. In addition, ignoring the precision to which the data values are provided when choosing the bin width leads to instability. We provide several real data examples to illustrate the properties of . Applications to other binned density estimators are also discussed.Bin width, frequency polygon, Gini index, linear binning, Lorenz curve, Monte Carlo simulation
Three Sides of Smoothing: Categorical Data Smoothing, Nonparametric Regression, and Density Estimation
The past forty years have seen a great deal of research into the construction and properties of nonparametric
estimates of smooth functions. This research has focused primarily on two sides of the smoothing
problem: nonparametric regression and density estimation. Theoretical results for these two situations
are similar, and multivariate density estimation was an early justification for the Nadaraya-Watson
kernel regression estimator.
A third, less well-explored, strand of applications of smoothing is to the estimation of probabilities in
categorical data. In this paper the position of categorical data smoothing as a bridge between nonparametric
regression and density estimation is explored. Nonparametric regression provides a paradigm
for the construction of effective categorical smoothing estimates, and use of an appropriate likelihood
function yields cell probability estimates with many desirable properties. Such estimates can be used
to construct regression estimates when one or more of the categorical variables are viewed as response
variables. They also lead naturally to the construction of well-behaved density estimates using local or
penalized likelihood estimation, which can then be used in a regression context. Several real data sets are
used to illustrate these points.Statistics Working Papers Serie
An Empirical Study of Factors Relating to the Success of Broadway Shows
This article uses the Cox
proportional hazards
model to analyze recent
Broadway show data to
investigate the factors
that relate to the longevity
of shows. The type of
show, whether a show is
a revival, and first-week
attendance for the show
are predictive for longevity.
Favorable critic reviews
in the New York
Daily News are related to
greater success, but reviews
in the New York
Times are not. Winning
major Tony Awards is associated
with a longer
run for a show, but being
nominated for Tonys and
then losing is associated
with a shorter postaward
run.Statistics Working Papers Serie
An Investigation of Missing Data Methods for Classiffcation Trees
There are many different missing data methods used by classification tree algorithms, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees. We show that in the context of classification trees, the relationship between the missingness and the dependent variable, rather than the standard missingness classification approach of
Little and Rubin (2002) (missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR)), is the most
helpful criterion to distinguish different missing data methods. We make recommendations as to the best method to use in various situations. The paper concludes with discussion of a real data set related to predicting bankruptcy of a firm.Statistics Working Papers Serie
An Investigation of Missing Data Methods for Classification Trees
There are many different missing data methods used by classification tree algorithms, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees. We show that in the context of classification trees, the relationship between the missingness and the dependent variable, rather than the standard missingness classification approach of Little and Rubin (2002) (missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR)), is the most helpful criterion to distinguish different missing data methods. We make recommendations as to the best method to use in various situations. The paper concludes with discussion of a real data set related to predicting bankruptcy of a firm.Statistics Working Papers Serie
- …