16 research outputs found
VIPE: A NEW INTERACTIVE CLASSIFICATION FRAMEWORK FOR LARGE SETS OF SHORT TEXTS - APPLICATION TO OPINION MINING
International audienceThis paper presents a new interactive opinion mining tool that helps users to classify large sets of short texts originated from Web opinion polls, technical forums or Twitter. From a manual multi-label pre-classification of a very limited text subset, a learning algorithm predicts the labels of the remaining texts of the corpus and the texts most likely associated to a selected label. Using a fast matrix factorization, the algorithm is able to handle large corpora and is well-adapted to interactivity by integrating the corrections proposed by the users on the fly. Experimental results on classical datasets of various sizes and feedbacks of users from marketing services of the telecommunication company Orange confirm the quality of the obtained results
Supervised Feature Space Reduction for Multi-Label Nearest Neighbors
International audienceWith the ability to process many real-world problems, multi-label classification has received a large attention in recent years and the instance-based ML-kNN classifier is today considered as one of the most efficient. But it is sensitive to noisy and redundant features and its performances decrease with increasing data dimensionality. To overcome these problems, dimensionality reduction is an alternative but current methods optimize reduction objectives which ignore the impact on the ML-kNN classification. We here propose ML-ARP, a novel dimensionality reduction algorithm which, using a variable neighborhood search meta-heuristic, learns a linear projection of the feature space which specifically optimizes the ML-kNN classification loss. Numerical comparisons have confirmed that ML-ARP outperforms ML-kNN without data processing and four standard multi-label dimensionality reduction algorithms
Incremental learning strategies for credit cards fraud detection.
very second, thousands of credit or debit card transactions are processed in financial institutions. This extensive amount of data and its sequential nature make the problem of fraud detection particularly challenging. Most analytical strategies used in production are still based on batch learning, which is inadequate for two reasons: Models quickly become outdated and require sensitive data storage. The evolving nature of bank fraud enshrines the importance of having up-to-date models, and sensitive data retention makes companies vulnerable to infringements of the European General Data Protection Regulation. For these reasons, evaluating incremental learning strategies is recommended. This paper designs and evaluates incremental learning solutions for real-world fraud detection systems. The aim is to demonstrate the competitiveness of incremental learning over conventional batch approaches and, consequently, improve its accuracy employing ensemble learning, diversity and transfer learning. An experimental analysis is conducted on a full-scale case study including five months of e-commerce transactions and made available by our industry partner, Worldline
Transfer learning for credit card fraud detection : A journey from research to production.
The dark face of digital commerce generalization is the increase of fraud attempts. To prevent any type of attacks, state-of-the-art fraud detection systems are now embedding Machine Learning (ML) modules. The conception of such modules is only communicated at the level of research and papers mostly focus on results for isolated benchmark datasets and metrics. But research is only a part of the journey, preceded by the right formulation of the business problem and collection of data, and followed by a practical integration. In this paper, we give a wider vision of the process, on a case study of transfer learning for fraud detection, from business to research, and back to business
Transfer learning for credit card fraud detection : A journey from research to production.
The dark face of digital commerce generalization is the increase of fraud attempts. To prevent any type of attacks, state-of-the-art fraud detection systems are now embedding Machine Learning (ML) modules. The conception of such modules is only communicated at the level of research and papers mostly focus on results for isolated benchmark datasets and metrics. But research is only a part of the journey, preceded by the right formulation of the business problem and collection of data, and followed by a practical integration. In this paper, we give a wider vision of the process, on a case study of transfer learning for fraud detection, from business to research, and back to business
Apprentissage multi-label extrĂŞme : Comparaisons d'approches et nouvelles propositions
Stimulated by many applications such as documents or images annotation, multi- label learning have gained a strong interest during the last decade. But, standard algorithms cannot cope with the volumes of the recent extreme multi-label data (XML) where the number of labels can reach millions. This thesis explores three directions to address the complexity in time and memory of the problem: multi-label dimension reduction, optimization and implementation tricks, and tree-based methods. It proposes to unify the reduction approaches through a typology and two generic formulations and to identify the most efficient ones with an original meta-analysis of the results of the literature. A new approach is developed to analyze the interest of coupling the reduction problem and the classification problem. To reduce the memory complexity of a classical one-vs-rest regression model while maintaining its predictive performances, we also propose an algorithm for estimating the largest useful parameters that follows a strategy inspired by data stream analysis. Finally, we present a new algorithm called CRAFTML that learns an ensemble of diversified decision trees. Each tree performs a joint random reduction of the feature and the label spaces and implements a very fast recursive partitioning strategy. CRAFTML performs better than other XML tree-based methods and is competitive with the most accurate methods that require supercomputers. The contributions of the thesis are completed by the presentation of a software called VIPE that is developed with Orange Labs for multi- label opinion analysis.Stimulé par des applications comme l’annotation de documents ou d’images, l’apprentissage multi-label a connu un fort développement cette dernière décennie. Mais les algorithmes classiques se heurtent aux nouveaux volumes des données multi-label extrême (XML) où le nombre de labels peut atteindre le million. Cette thèse explore trois directions pour aborder la complexité en temps et en mémoire du problème : la réduction de dimension multi-label, les astuces d’optimisation et d’implémentation et le découpage arborescent. Elle propose d’unifier les approches de réduction à travers une typologie et deux formulations génériques et d’identifier des plus performantes avec une méta-analyse originale des résultats de la littérature. Une nouvelle approche est développée pour analyser l’apport du couplage entre le problème de réduction et celui de classification. Pour réduire la complexité mémoire en maintenant les capacités prédictives, nous proposons également un algorithme d’estimation des plus grands paramètres utiles d’un modèle classique de régression one-vs-rest qui suit une stratégie inspirée de l’analyse de données en flux. Enfin, nous présentons un nouvel algorithme CRAFTML qui apprend un ensemble d’arbres de décision diversifiés. Chaque arbre effectue une réduction aléatoire conjointe des espaces d’attributs et de labels et implémente un partitionnement récursif très rapide. CRAFTML est plus performant que les autres méthodes arborescentes XML et compétitif avec les meilleures méthodes qui nécessitent des supercalculateurs. Les apports de la thèse sont complétés par la présentation d’un outil logiciel VIPE développé avec Orange Labs pour l’analyse d’opinions multi-label
Extreme multi-label learning : comparisons of approaches and new proposals
Stimulé par des applications comme l’annotation de documents ou d’images, l’apprentissage multi-label a connu un fort développement cette dernière décennie. Mais les algorithmes classiques se heurtent aux nouveaux volumes des données multi-label extrême (XML) où le nombre de labels peut atteindre le million. Cette thèse explore trois directions pour aborder la complexité en temps et en mémoire du problème : la réduction de dimension multi-label, les astuces d’optimisation et d’implémentation et le découpage arborescent. Elle propose d’unifier les approches de réduction à travers une typologie et deux formulations génériques et d’identifier des plus performantes avec une méta-analyse originale des résultats de la littérature. Une nouvelle approche est développée pour analyser l’apport du couplage entre le problème de réduction et celui de classification. Pour réduire la complexité mémoire en maintenant les capacités prédictives, nous proposons également un algorithme d’estimation des plus grands paramètres utiles d’un modèle classique de régression one-vs-rest qui suit une stratégie inspirée de l’analyse de données en flux. Enfin, nous présentons un nouvel algorithme CRAFTML qui apprend un ensemble d’arbres de décision diversifiés. Chaque arbre effectue une réduction aléatoire conjointe des espaces d’attributs et de labels et implémente un partitionnement récursif très rapide. CRAFTML est plus performant que les autres méthodes arborescentes XML et compétitif avec les meilleures méthodes qui nécessitent des supercalculateurs. Les apports de la thèse sont complétés par la présentation d’un outil logiciel VIPE développé avec Orange Labs pour l’analyse d’opinions multi-label.Stimulated by many applications such as documents or images annotation, multilabel learning have gained a strong interest during the last decade. But, standard algorithms cannot cope with the volumes of the recent extreme multi-label data (XML) where the number of labels can reach millions. This thesis explores three directions to address the complexity in time and memory of the problem: multi-label dimension reduction, optimization and implementation tricks, and tree-based methods. It proposes to unify the reduction approaches through a typology and two generic formulations and to identify the most efficient ones with an original meta-analysis of the results of the literature. A new approach is developed to analyze the interest of coupling the reduction problem and the classification problem. To reduce the memory complexity of a classical one-vs-rest regression model while maintaining its predictive performances, we also propose an algorithm for estimating the largest useful parameters that follows a strategy inspired by data stream analysis. Finally, we present a new algorithm called CRAFTML that learns an ensemble of diversified decision trees. Each tree performs a joint random reduction of the feature and the label spaces and implements a very fast recursive partitioning strategy. CRAFTML performs better than other XML tree-based methods and is competitive with the most accurate methods that require supercomputers. The contributions of the thesis are completed by the presentation of a software called VIPE that is developed with Orange Labs for multilabel opinion analysis
Apprentissage multi-label extrĂŞme : Comparaisons d'approches et nouvelles propositions
Stimulated by many applications such as documents or images annotation, multi- label learning have gained a strong interest during the last decade. But, standard algorithms cannot cope with the volumes of the recent extreme multi-label data (XML) where the number of labels can reach millions. This thesis explores three directions to address the complexity in time and memory of the problem: multi-label dimension reduction, optimization and implementation tricks, and tree-based methods. It proposes to unify the reduction approaches through a typology and two generic formulations and to identify the most efficient ones with an original meta-analysis of the results of the literature. A new approach is developed to analyze the interest of coupling the reduction problem and the classification problem. To reduce the memory complexity of a classical one-vs-rest regression model while maintaining its predictive performances, we also propose an algorithm for estimating the largest useful parameters that follows a strategy inspired by data stream analysis. Finally, we present a new algorithm called CRAFTML that learns an ensemble of diversified decision trees. Each tree performs a joint random reduction of the feature and the label spaces and implements a very fast recursive partitioning strategy. CRAFTML performs better than other XML tree-based methods and is competitive with the most accurate methods that require supercomputers. The contributions of the thesis are completed by the presentation of a software called VIPE that is developed with Orange Labs for multi- label opinion analysis.Stimulé par des applications comme l’annotation de documents ou d’images, l’apprentissage multi-label a connu un fort développement cette dernière décennie. Mais les algorithmes classiques se heurtent aux nouveaux volumes des données multi-label extrême (XML) où le nombre de labels peut atteindre le million. Cette thèse explore trois directions pour aborder la complexité en temps et en mémoire du problème : la réduction de dimension multi-label, les astuces d’optimisation et d’implémentation et le découpage arborescent. Elle propose d’unifier les approches de réduction à travers une typologie et deux formulations génériques et d’identifier des plus performantes avec une méta-analyse originale des résultats de la littérature. Une nouvelle approche est développée pour analyser l’apport du couplage entre le problème de réduction et celui de classification. Pour réduire la complexité mémoire en maintenant les capacités prédictives, nous proposons également un algorithme d’estimation des plus grands paramètres utiles d’un modèle classique de régression one-vs-rest qui suit une stratégie inspirée de l’analyse de données en flux. Enfin, nous présentons un nouvel algorithme CRAFTML qui apprend un ensemble d’arbres de décision diversifiés. Chaque arbre effectue une réduction aléatoire conjointe des espaces d’attributs et de labels et implémente un partitionnement récursif très rapide. CRAFTML est plus performant que les autres méthodes arborescentes XML et compétitif avec les meilleures méthodes qui nécessitent des supercalculateurs. Les apports de la thèse sont complétés par la présentation d’un outil logiciel VIPE développé avec Orange Labs pour l’analyse d’opinions multi-label