76,515 research outputs found
Extended Methods to Handle Classification Biases
Classifiers can provide counts of items per class, but systematic classification errors yield biases (e.g., if a class is often misclassified as another, its size may be under-estimated). To handle classification biases, the statistics and epidemiology domains devised methods for estimating unbiased class sizes (or class probabilities) without identifying which individual items are misclassified. These bias correction methods are applicable to machine learning classifiers, but in some cases yield high result variance and increased biases. We present the applicability and drawbacks of existing methods and extend them with three novel methods. Our Sample-to-Sample method provides accurate confidence intervals for the bias correction results. Our Maximum Determinant method predicts which classifier yields the least result variance. Our Ratio-to-TP method details the error decomposition in classifier outputs (i.e., how many items classified as class Cy truly belong to Cx, for all possible classes) and has properties of interest for applying the Maximum Determinant method. Our methods are demonstrated empirically, and we discuss the need for establishing theory and guidelines for choosing the methods and classifier to apply
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
A particle filtering approach for joint detection/estimation of multipath effects on GPS measurements
Multipath propagation causes major impairments to Global
Positioning System (GPS) based navigation. Multipath results in biased GPS measurements, hence inaccurate position estimates. In this work, multipath effects are considered as abrupt changes affecting the navigation system. A multiple model formulation is proposed whereby the changes are represented by a discrete valued process. The detection of the errors induced by multipath is handled by a Rao-Blackwellized particle filter (RBPF). The RBPF estimates the indicator process jointly with the navigation states and multipath biases. The interest of this approach is its ability to integrate a priori constraints about the propagation environment. The detection is improved by using information from near future GPS measurements at the particle filter (PF) sampling step. A computationally modest delayed sampling is developed, which is based on a minimal duration assumption for multipath effects. Finally, the standard PF resampling stage is modified to include an hypothesis test based decision step
Automated Classification of Periodic Variable Stars detected by the Wide-field Infrared Survey Explorer
We describe a methodology to classify periodic variable stars identified
using photometric time-series measurements constructed from the Wide-field
Infrared Survey Explorer (WISE) full-mission single-exposure Source Databases.
This will assist in the future construction of a WISE Variable Source Database
that assigns variables to specific science classes as constrained by the WISE
observing cadence with statistically meaningful classification probabilities.
We have analyzed the WISE light curves of 8273 variable stars identified in
previous optical variability surveys (MACHO, GCVS, and ASAS) and show that
Fourier decomposition techniques can be extended into the mid-IR to assist with
their classification. Combined with other periodic light-curve features, this
sample is then used to train a machine-learned classifier based on the random
forest (RF) method. Consistent with previous classification studies of variable
stars in general, the RF machine-learned classifier is superior to other
methods in terms of accuracy, robustness against outliers, and relative
immunity to features that carry little or redundant class information. For the
three most common classes identified by WISE: Algols, RR Lyrae, and W Ursae
Majoris type variables, we obtain classification efficiencies of 80.7%, 82.7%,
and 84.5% respectively using cross-validation analyses, with 95% confidence
intervals of approximately +/-2%. These accuracies are achieved at purity (or
reliability) levels of 88.5%, 96.2%, and 87.8% respectively, similar to that
achieved in previous automated classification studies of periodic variable
stars.Comment: 48 pages, 17 figures, 1 table, accepted by A
Chemical and biological reactions of solidification of peat using ordinary portland cement (OPC) and coal ashes
Construction over peat area have often posed a challenge to geotechnical engineers.
After decades of study on peat stabilisation techniques, there are still no absolute
formulation or guideline that have been established to handle this issue. Some
researchers have proposed solidification of peat but a few researchers have also
discovered that solidified peat seemed to decrease its strength after a certain period of
time. Therefore, understanding the chemical and biological reaction behind the peat
solidification is vital to understand the limitation of this treatment technique. In this
study, all three types of peat; fabric, hemic and sapric were mixed using Mixing 1 and
Mixing 2 formulation which consisted of ordinary Portland cement, fly ash and bottom
ash at various ratio. The mixtures of peat-binder-filler were subjected to the
unconfined compressive strength (UCS) test, bacterial count test and chemical
elemental analysis by using XRF, XRD, FTIR and EDS. Two pattern of strength over
curing period were observed. Mixing 1 samples showed a steadily increase in strength
over curing period until Day 56 while Mixing 2 showed a decrease in strength pattern
at Day 28 and Day 56. Samples which increase in strength steadily have less bacterial
count and enzymatic activity with increase quantity of crystallites. Samples with lower
strength recorded increase in bacterial count and enzymatic activity with less
crystallites. Analysis using XRD showed that pargasite
(NaCa2[Mg4Al](Si6Al2)O22(OH)2) was formed in the higher strength samples while in
the lower strength samples, pargasite was predicted to be converted into monosodium
phosphate and Mg(OH)2 as bacterial consortium was re-activated. The Michaelis�Menten coefficient, Km of the bio-chemical reaction in solidified peat was calculated
as 303.60. This showed that reaction which happened during solidification work was
inefficient. The kinetics for crystallite formation with enzymatic effect is modelled as
135.42 (1/[S] + 0.44605) which means, when pargasite formed is lower, the amount
of enzyme secretes is higher
- …