Search CORE

11 research outputs found

Incremental Decision Tree based on order statistics

Author: Lemaire Vincent
Salperwyck Christophe
Publication venue: HAL CCSD
Publication date: 01/01/2012
Field of study

International audienceNew application domains generate data which are not persistent anymore but volatile: network management, web profile modeling... These data arrive quickly, massively and are visible just once. Thus they necessarily have to be learnt according to their arrival orders. For classification problems online decision trees are known to perform well and are widely used on streaming data. In this paper, we propose a new decision tree method based on order statistics. The construction of an online tree usually needs summaries in the leaves. Our solution uses bounded error quantiles summaries. A robust and performing discretization or grouping method uses these summaries to provide, at the same time, a criterion to find the best split and better density estimations. This estimation is then used to build a na¨ıve Bayes classifier in the leaves to improve the prediction in the early learning stage

HAL - Lille 3

CiteSeerX

INRIA a CCSD electronic archive server

Stumping along a Summary for Exploration & Exploitation Challenge 2011

Author: Salperwyck Christophe
Urvoy Tanguy
Publication venue: HAL CCSD
Publication date: 02/07/2011
Field of study

International audienceThe Pascal Exploration & Exploitation challenge 2011 seeks to evaluate algorithms for the online website content selection problem. This article presents the solution we used to achieve second place in this challenge and some side-experiments we performed. The methods we evaluated are all structured in three layers. The rst layer provides an online summary of the data stream for continuous and nominal data. Continuous data are handled using an online quantile summary. Nominal data are summarized with a hash-based counting structure. With these techniques, we managed to build an accurate stream summary with a small memory footprint. The second layer uses the summary to build predictors. We exploited several kinds of trees from simple decision stumps to deep multivariate ones. For the last layer, we explored several combination strategies: online bagging, exponential weighting, linear ranker, and simple averaging

HAL - Lille 3

INRIA a CCSD electronic archive server

Apprentissage incrémental en ligne sur flux de données

Author: Salperwyck Christophe
Publication venue: HAL CCSD
Publication date: 30/11/2012
Field of study

Statistical learning provides numerous algorithms to build predictive models on past observations. These techniques proved their ability to deal with large scale realistic problems. However, new domains generate more and more data which are only visible once and need to be processes sequentially. These volatile data, known as data streams, come from telecommunication network management, social network, web mining. The challenge is to build new algorithms able to learn under these constraints. We proposed to build new summaries for supervised classification. Our summaries are based on two levels. The first level is an online incremental summary which uses low processing and address the precision/memory tradeoff. The second level uses the first layer summary to build the final sumamry with an effcient offline method. Building these sumamries is a pre-processing stage to develop new classifiers for data streams. We propose new versions for the naive-Bayes and decision trees classifiers using our summaries. As data streams might contain concept drifts, we also propose a new technique to detect these drifts and update classifiers accordingly.L'apprentissage statistique propose un vaste ensemble de techniques capables de construire des modèles prédictifs à partir d'observations passées. Ces techniques ont montré leurs capacités à traiter des volumétries importantes de données sur des problèmes réels. Cependant, de nouvelles applications génèrent de plus en plus de données qui sont seulement visibles sous la forme d'un flux et doivent être traitées séquentiellement. Parmi ces applications on citera : la gestion de réseaux de télécommunications, la modélisation des utilisateurs au sein d'un réseau social, le web mining. L'un des défis techniques est de concevoir des algorithmes permettant l'apprentissage avec les nouvelles contraintes imposées par les flux de données. Nous proposons d'abord ce problème en proposant de nouvelles techniques de résumé de flux de données dans le cadre de l'apprentissage supervisé. Notre méthode est constituée de deux niveaux. Le premier niveau utilise des techniques incrémentales de résumé en-ligne pour les flux qui prennent en compte les ressources mémoire et processeur et possèdent des garanties en termes d'erreur. Le second niveau utilise les résumés de faible taille, issus du premier niveau, pour construire le résumé final à l'aide d'une méthode supervisée performante hors-ligne. Ces résumés constituent un prétraitement qui nous permet de proposer de nouvelles versions du classifieur bayésien naïf et des arbres de décision fonctionnant en-ligne sur flux de données. Les flux de données peuvent ne pas être stationnaires mais comporter des changements de concept. Nous proposons aussi une nouvelle technique pour détecter ces changements et mettre à jour nos classifieurs

INRIA a CCSD electronic archive server

Incremental online learning on data streams

Author: Salperwyck Christophe
Publication venue
Publication date: 30/11/2012
Field of study

L'apprentissage statistique propose un vaste ensemble de techniques capables de construire des modèles prédictifs à partir d'observations passées. Ces techniques ont montré leurs capacités à traiter des volumétries importantes de données sur des problèmes réels. Cependant, de nouvelles applications génèrent de plus en plus de données qui sont seulement visibles sous la forme d'un flux et doivent être traitées séquentiellement. Parmi ces applications on citera : la gestion de réseaux de télécommunications, la modélisation des utilisateurs au sein d'un réseau social, le web mining. L'un des défis techniques est de concevoir des algorithmes permettant l'apprentissage avec les nouvelles contraintes imposées par les flux de données. Nous proposons d'abord ce problème en proposant de nouvelles techniques de résumé de flux de données dans le cadre de l'apprentissage supervisé. Notre méthode est constituée de deux niveaux. Le premier niveau utilise des techniques incrémentales de résumé en-ligne pour les flux qui prennent en compte les ressources mémoire et processeur et possèdent des garanties en termes d'erreur. Le second niveau utilise les résumés de faible taille, issus du premier niveau, pour construire le résumé final à l'aide d'une méthode supervisée performante hors-ligne. Ces résumés constituent un prétraitement qui nous permet de proposer de nouvelles versions du classifieur bayésien naïf et des arbres de décision fonctionnant en-ligne sur flux de données. Les flux de données peuvent ne pas être stationnaires mais comporter des changements de concept. Nous proposons aussi une nouvelle technique pour détecter ces changements et mettre à jour nos classifieurs.Statistical learning provides numerous algorithms to build predictive models on past observations. These techniques proved their ability to deal with large scale realistic problems. However, new domains generate more and more data which are only visible once and need to be processes sequentially. These volatile data, known as data streams, come from telecommunication network management, social network, web mining. The challenge is to build new algorithms able to learn under these constraints. We proposed to build new summaries for supervised classification. Our summaries are based on two levels. The first level is an online incremental summary which uses low processing and address the precision/memory tradeoff. The second level uses the first layer summary to build the final sumamry with an effcient offline method. Building these sumamries is a pre-processing stage to develop new classifiers for data streams. We propose new versions for the naive-Bayes and decision trees classifiers using our summaries. As data streams might contain concept drifts, we also propose a new technique to detect these drifts and update classifiers accordingly

Theses.fr

Apprentissage incrémental en ligne sur flux de données

Author: Salperwyck Christophe
Publication venue: HAL CCSD
Publication date: 30/11/2012
Field of study

HAL - Lille 3

Thèses en Ligne

INRIA a CCSD electronic archive server

ABSTRACT GhostDB: Hiding Data from Prying Eyes

Author: Christophe Salperwyck
Dennis Shasha
Luc Bouganim
Mehdi Benzine
Nicolas Anciaux
Publication venue
Publication date
Field of study

Imagine that you have been entrusted with private data, such as corporate product information, sensitive government information, or symptom and treatment information about hospital patients. You may want to issue queries whose result will combine private and public data, but private data must not be revealed, say, to the prying eyes of some insurance fraudster. GhostDB is an architecture and system to achieve this. You carry private data in a smart USB device (a large Flash persistent store combined with a tamper and snoop-resistant CPU and small RAM). When the key is plugged in, you can issue queries that link private and public data and be sure that the only information revealed to a potential spy is which queries you pose and the public data you access. Queries linking public and private data entail novel distributed processing techniques on extremely unequal devices (standard computer and smart USB device) in which data flows in only one direction: from public to private. This demonstration shows GhostDB’s query processing in action. 1

CiteSeerX

GhostDB: Hiding Data from Prying Eyes

Author: Anciaux Nicolas
Benzine Mehdi
Bouganim Luc
Pucheral Philippe
Salperwyck Christophe
Shasha Dennis
Publication venue: HAL CCSD
Publication date: 01/09/2007
Field of study

National audiencecorporate product information, sensitive government information, or symptom and treatment information about hospital patients. You may want to issue queries whose result will combine private and public data, but private data must not be revealed, say, to the prying eyes of some insurance fraudster. GhostDB is an architecture and system to achieve this. You carry private data in a smart USB device (a large Flash persistent store combined with a tamper and snoop-resistant CPU and small RAM). When the key is plugged in, you can issue queries that link private and public data and be sure that the only information revealed to a potential spy is which queries you pose and the public data you access. Queries linking public and private data entail novel distributed processing techniques on extremely unequal devices (standard computer and smart USB device) in which data flows in only one direction: from public to private. This demonstration shows GhostDB's query processing in action

INRIA a CCSD electronic archive server

HAL UVSQ