76,515 research outputs found

    Extended Methods to Handle Classification Biases

    Get PDF
    Classifiers can provide counts of items per class, but systematic classification errors yield biases (e.g., if a class is often misclassified as another, its size may be under-estimated). To handle classification biases, the statistics and epidemiology domains devised methods for estimating unbiased class sizes (or class probabilities) without identifying which individual items are misclassified. These bias correction methods are applicable to machine learning classifiers, but in some cases yield high result variance and increased biases. We present the applicability and drawbacks of existing methods and extend them with three novel methods. Our Sample-to-Sample method provides accurate confidence intervals for the bias correction results. Our Maximum Determinant method predicts which classifier yields the least result variance. Our Ratio-to-TP method details the error decomposition in classifier outputs (i.e., how many items classified as class Cy truly belong to Cx, for all possible classes) and has properties of interest for applying the Maximum Determinant method. Our methods are demonstrated empirically, and we discuss the need for establishing theory and guidelines for choosing the methods and classifier to apply

    Random Forests for Big Data

    Get PDF
    Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

    Challenges of Big Data Analysis

    Full text link
    Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

    A particle filtering approach for joint detection/estimation of multipath effects on GPS measurements

    Get PDF
    Multipath propagation causes major impairments to Global Positioning System (GPS) based navigation. Multipath results in biased GPS measurements, hence inaccurate position estimates. In this work, multipath effects are considered as abrupt changes affecting the navigation system. A multiple model formulation is proposed whereby the changes are represented by a discrete valued process. The detection of the errors induced by multipath is handled by a Rao-Blackwellized particle filter (RBPF). The RBPF estimates the indicator process jointly with the navigation states and multipath biases. The interest of this approach is its ability to integrate a priori constraints about the propagation environment. The detection is improved by using information from near future GPS measurements at the particle filter (PF) sampling step. A computationally modest delayed sampling is developed, which is based on a minimal duration assumption for multipath effects. Finally, the standard PF resampling stage is modified to include an hypothesis test based decision step

    Automated Classification of Periodic Variable Stars detected by the Wide-field Infrared Survey Explorer

    Get PDF
    We describe a methodology to classify periodic variable stars identified using photometric time-series measurements constructed from the Wide-field Infrared Survey Explorer (WISE) full-mission single-exposure Source Databases. This will assist in the future construction of a WISE Variable Source Database that assigns variables to specific science classes as constrained by the WISE observing cadence with statistically meaningful classification probabilities. We have analyzed the WISE light curves of 8273 variable stars identified in previous optical variability surveys (MACHO, GCVS, and ASAS) and show that Fourier decomposition techniques can be extended into the mid-IR to assist with their classification. Combined with other periodic light-curve features, this sample is then used to train a machine-learned classifier based on the random forest (RF) method. Consistent with previous classification studies of variable stars in general, the RF machine-learned classifier is superior to other methods in terms of accuracy, robustness against outliers, and relative immunity to features that carry little or redundant class information. For the three most common classes identified by WISE: Algols, RR Lyrae, and W Ursae Majoris type variables, we obtain classification efficiencies of 80.7%, 82.7%, and 84.5% respectively using cross-validation analyses, with 95% confidence intervals of approximately +/-2%. These accuracies are achieved at purity (or reliability) levels of 88.5%, 96.2%, and 87.8% respectively, similar to that achieved in previous automated classification studies of periodic variable stars.Comment: 48 pages, 17 figures, 1 table, accepted by A

    Chemical and biological reactions of solidification of peat using ordinary portland cement (OPC) and coal ashes

    Get PDF
    Construction over peat area have often posed a challenge to geotechnical engineers. After decades of study on peat stabilisation techniques, there are still no absolute formulation or guideline that have been established to handle this issue. Some researchers have proposed solidification of peat but a few researchers have also discovered that solidified peat seemed to decrease its strength after a certain period of time. Therefore, understanding the chemical and biological reaction behind the peat solidification is vital to understand the limitation of this treatment technique. In this study, all three types of peat; fabric, hemic and sapric were mixed using Mixing 1 and Mixing 2 formulation which consisted of ordinary Portland cement, fly ash and bottom ash at various ratio. The mixtures of peat-binder-filler were subjected to the unconfined compressive strength (UCS) test, bacterial count test and chemical elemental analysis by using XRF, XRD, FTIR and EDS. Two pattern of strength over curing period were observed. Mixing 1 samples showed a steadily increase in strength over curing period until Day 56 while Mixing 2 showed a decrease in strength pattern at Day 28 and Day 56. Samples which increase in strength steadily have less bacterial count and enzymatic activity with increase quantity of crystallites. Samples with lower strength recorded increase in bacterial count and enzymatic activity with less crystallites. Analysis using XRD showed that pargasite (NaCa2[Mg4Al](Si6Al2)O22(OH)2) was formed in the higher strength samples while in the lower strength samples, pargasite was predicted to be converted into monosodium phosphate and Mg(OH)2 as bacterial consortium was re-activated. The Michaelis�Menten coefficient, Km of the bio-chemical reaction in solidified peat was calculated as 303.60. This showed that reaction which happened during solidification work was inefficient. The kinetics for crystallite formation with enzymatic effect is modelled as 135.42 (1/[S] + 0.44605) which means, when pargasite formed is lower, the amount of enzyme secretes is higher
    corecore