28 research outputs found

    Three Sides of Smoothing: Categorical Data Smoothing, Nonparametric Regression, and Density Estimation

    Get PDF
    The past forty years have seen a great deal of research into the construction and properties of nonparametric estimates of smooth functions. This research has focused primarily on two sides of the smoothing problem: nonparametric regression and density estimation. Theoretical results for these two situations are similar, and multivariate density estimation was an early justification for the Nadaraya-Watson kernel regression estimator. A third, less well-explored, strand of applications of smoothing is to the estimation of probabilities in categorical data. In this paper the position of categorical data smoothing as a bridge between nonparametric regression and density estimation is explored. Nonparametric regression provides a paradigm for the construction of effective categorical smoothing estimates, and use of an appropriate likelihood function yields cell probability estimates with many desirable properties. Such estimates can be used to construct regression estimates when one or more of the categorical variables are viewed as response variables. They also lead naturally to the construction of well-behaved density estimates using local or penalized likelihood estimation, which can then be used in a regression context. Several real data sets are used to illustrate these points.Statistics Working Papers Serie

    Three Sides of Smoothing: Categorical Data Smoothing, Nonparametric Regression, and Density Estimation

    Get PDF
    The past forty years have seen a great deal of research into the construction and properties of nonparametric estimates of smooth functions. This research has focused primarily on two sides of the smoothing problem: nonparametric regression and density estimation. Theoretical results for these two situations are similar, and multivariate density estimation was an early justification for the Nadaraya-Watson kernel regression estimator. A third, less well-explored, strand of applications of smoothing is to the estimation of probabilities in categorical data. In this paper the position of categorical data smoothing as a bridge between nonparametric regression and density estimation is explored. Nonparametric regression provides a paradigm for the construction of effective categorical smoothing estimates, and use of an appropriate likelihood function yields cell probability estimates with many desirable properties. Such estimates can be used to construct regression estimates when one or more of the categorical variables are viewed as response variables. They also lead naturally to the construction of well-behaved density estimates using local or penalized likelihood estimation, which can then be used in a regression context. Several real data sets are used to illustrate these points.Statistics Working Papers Serie

    Crowdsourcing Without a Crowd: Reliable Online Species Identification Using Bayesian Models to Minimize Crowd Size

    Get PDF
    We present an incremental Bayesian model that resolves key issues of crowd size and data quality for consensus labeling. We evaluate our method using data collected from a real-world citizen science program, BeeWatch, which invites members of the public in the United Kingdom to classify (label) photographs of bumblebees as one of 22 possible species. The biological recording domain poses two key and hitherto unaddressed challenges for consensus models of crowdsourcing: (1) the large number of potential species makes classification difficult, and (2) this is compounded by limited crowd availability, stemming from both the inherent difficulty of the task and the lack of relevant skills among the general public. We demonstrate that consensus labels can be reliably found in such circumstances with very small crowd sizes of around three to five users (i.e., through group sourcing). Our incremental Bayesian model, which minimizes crowd size by re-evaluating the quality of the consensus label following each species identification solicited from the crowd, is competitive with a Bayesian approach that uses a larger but fixed crowd size and outperforms majority voting. These results have important ecological applicability: biological recording programs such as BeeWatch can sustain themselves when resources such as taxonomic experts to confirm identifications by photo submitters are scarce (as is typically the case), and feedback can be provided to submitters in a timely fashion. More generally, our model provides benefits to any crowdsourced consensus labeling task where there is a cost (financial or otherwise) associated with soliciting a label

    Nonparametric Econometrics: The np Package

    Get PDF
    We describe the R np package via a series of applications that may be of interest to applied econometricians. The np package implements a variety of nonparametric and semiparametric kernel-based estimators that are popular among econometricians. There are also procedures for nonparametric tests of significance and consistent model specification tests for parametric mean regression models and parametric quantile regression models, among others. The np package focuses on kernel methods appropriate for the mix of continuous, discrete, and categorical data often found in applied settings. Data-driven methods of bandwidth selection are emphasized throughout, though we caution the user that data-driven bandwidth selection methods can be computationally demanding.

    Imposing Economic Constraints in Nonparametric Regression: Survey, Implementation and Extension

    Get PDF
    Economic conditions such as convexity, homogeneity, homotheticity, and monotonicity are all important assumptions or consequences of assumptions of economic functionals to be estimated. Recent research has seen a renewed interest in imposing constraints in nonparametric regression. We survey the available methods in the literature, discuss the challenges that present themselves when empirically implementing these methods and extend an existing method to handle general nonlinear constraints. A heuristic discussion on the empirical implementation for methods that use sequential quadratic programming is provided for the reader and simulated and empirical evidence on the distinction between constrained and unconstrained nonparametric regression surfaces is covered.identification, concavity, Hessian, constraint weighted bootstrapping, earnings function

    A Non-Asymptotic Bandwidth Selection Method for Kernel Density Estimation of Discrete Data

    Get PDF
    In this paper we explore a method for modeling of categorical data derived from the principles of the Generalized Cross Entropy method. The method builds on standard kernel density estimation techniques by providing a novel non-asymptotic data-driven bandwidth selection rule. In addition to this, the Entropic approach provides model sparsity not present in the standard kernel approach. Numerical experiments with 10 dimensional binary medical data are conducted. The experiments suggest that the Generalized Cross Entropy approach is a viable method for density estimation, discriminant analysis and classification

    Semiparametric and Additive Model Selection Using an Improved Akaike Information Criterion

    Get PDF
    An improved AIC-based criterion is derived for model selection in general smoothing-based modeling, including semiparametric models and additive models. Examples are provided of applications to goodness-of-fit, smoothing parameter and variable selection in an additive model and semiparametric models, and variable selection in a model with a nonlinear function of linear terms.Statistics Working Papers Serie

    Processing thermogravimetric analysis data for isoconversional kinetic analysis of lignocellulosic biomass pyrolysis:Case study of corn stalk

    Get PDF
    Modeling of lignocellulosic biomass pyrolysis processes can be used to determine their key operating and design parameters. This requires significant amount of information about pyrolysis kinetic parameters, in particular the activation energy. Thermogravimetric analysis (TGA) is the most commonly used tool to obtain experimental kinetic data, and isoconversional kinetic analysis is the most effective way for processing TGA data to calculate effective activation energies for lignocellulosic biomass pyrolysis. This paper reviews the overall procedure of processing TGA data for isoconversional kinetic analysis of lignocellulosic biomass pyrolysis by using the Friedman isoconversional method. This includes the removal of “error” data points and dehydration stage from original TGA data, transformation of TGA data to conversion data, differentiation of conversion data and smoothing of derivative conversion data, interpolation of conversion and derivative conversion data, isoconversional calculations, and reconstruction of kinetic process. The detailed isoconversional kinetic analysis of TGA data obtained from the pyrolysis of corn stalk at five heating rates were presented. The results have shown that the effective activation energies of corn stalk pyrolysis vary from 148 to 473 kJ mol−1 when the conversion ranges from 0.05 to 0.85
    corecore