4 research outputs found

    Bandwidth choice for nonparametric classification

    Full text link
    It is shown that, for kernel-based classification with univariate distributions and two populations, optimal bandwidth choice has a dichotomous character. If the two densities cross at just one point, where their curvatures have the same signs, then minimum Bayes risk is achieved using bandwidths which are an order of magnitude larger than those which minimize pointwise estimation error. On the other hand, if the curvature signs are different, or if there are multiple crossing points, then bandwidths of conventional size are generally appropriate. The range of different modes of behavior is narrower in multivariate settings. There, the optimal size of bandwidth is generally the same as that which is appropriate for pointwise density estimation. These properties motivate empirical rules for bandwidth choice.Comment: Published at http://dx.doi.org/10.1214/009053604000000959 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets

    Get PDF
    Missing data handling is an important preparation step for most data discrimination or mining tasks. Inappropriate treatment of missing data may cause large errors or false results. In this paper, we study the effect of a missing data recovery method, namely the pseudo- nearest neighbor substitution approach, on Gaussian distributed data sets that represent typical cases in data discrimination and data mining applications. The error rate of the proposed recovery method is evaluated by comparing the clustering results of the recovered data sets to the clustering results obtained on the originally complete data sets. The results are also compared with that obtained by applying two other missing data handling methods, the constant default value substitution and the missing data ignorance (non-substitution) methods. The experiment results provided a valuable insight to the improvement of the accuracy for data discrimination and knowledge discovery on large data sets containing missing values

    Missing data treatment and data fusion toward travel time estimation for ATIS

    Get PDF
    [[abstract]]This study develops a travel time estimation process by integrating a missing data treatment and data-fusion-based approaches. In missing data treatment, this study develops a grey time-series model and a grey-theory-based pseudo-nearest-neighbor method to recover, respectively, temporal and spatial missing values in traffic detector data sets. Both spatial and temporal patterns of traffic data are also considered in travel time data fusion. In travel time data fusion, this study presents a speed-based link travel time extrapolation model for analytical travel time estimation and further develops a recurrent neural network (RNN) integrated with grey models for real-time travel time estimation. In the case study, field data from the national freeway no. 1 in Taiwan is used as a case study for testing the proposed models. Study results showed that the grey-theory-based missing data treatment models were accurate for recovering missing values. The grey-based RNN models were capable of accurately predicting travel times. Consequently, the results of this study indicated that the proposed missing data treatment and data fusion approaches can ensure the accuracy of travel time estimation with incomplete data sets, and are therefore suited to implementation for ATIS.[[notice]]補正完
    corecore