724,788 research outputs found
Distributed Online Big Data Classification Using Context Information
Distributed, online data mining systems have emerged as a result of
applications requiring analysis of large amounts of correlated and
high-dimensional data produced by multiple distributed data sources. We propose
a distributed online data classification framework where data is gathered by
distributed data sources and processed by a heterogeneous set of distributed
learners which learn online, at run-time, how to classify the different data
streams either by using their locally available classification functions or by
helping each other by classifying each other's data. Importantly, since the
data is gathered at different locations, sending the data to another learner to
process incurs additional costs such as delays, and hence this will be only
beneficial if the benefits obtained from a better classification will exceed
the costs. We model the problem of joint classification by the distributed and
heterogeneous learners from multiple data sources as a distributed contextual
bandit problem where each data is characterized by a specific context. We
develop a distributed online learning algorithm for which we can prove
sublinear regret. Compared to prior work in distributed online data mining, our
work is the first to provide analytic regret results characterizing the
performance of the proposed algorithm
Semantic HMC for Big Data Analysis
Analyzing Big Data can help corporations to im-prove their efficiency. In
this work we present a new vision to derive Value from Big Data using a
Semantic Hierarchical Multi-label Classification called Semantic HMC based in a
non-supervised Ontology learning process. We also proposea Semantic HMC
process, using scalable Machine-Learning techniques and Rule-based reasoning
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
- …