Search CORE

19 research outputs found

Automated data pre-processing via meta-learning

Author: A Guazzelli
A Kalousis
D Pyle
F Serban
J Vanschoren
J-U Kietz
M Hall
MA Munson
SF Crone
T Dasu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

The final publication is available at link.springer.comA data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and nonexperienced users become overwhelmed. We show that this problem can be addressed by an automated approach, leveraging ideas from metalearning. Specifically, we consider a wide range of data pre-processing techniques and a set of data mining algorithms. For each data mining algorithm and selected dataset, we are able to predict the transformations that improve the result of the algorithm on the respective dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

PRESISTANT: Learning based assistant for data pre-processing

Author: Abelló Alberto
Aluja-Banet Tomàs
Bilalli Besim
Wrembel Robert
Publication venue
Publication date: 02/03/2018
Field of study

Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

ASlib: A Benchmark Library for Algorithm Selection

Author: Bischl Bernd
Frechette Alexandre
Hoos Holger
Hutter Frank
Kerschke Pascal
Kotthoff Lars
Leyton-Brown Kevin
Lindauer Marius
Malitsky Yuri
Tierney Kevin
Vanschoren Joaquin
Publication venue
Publication date: 01/01/2016
Field of study

The task of algorithm selection involves choosing an algorithm from a set of algorithms on a per-instance basis in order to exploit the varying performance of algorithms over a set of instances. The algorithm selection problem is attracting increasing attention from researchers and practitioners in AI. Years of fruitful applications in a number of domains have resulted in a large amount of data, but the community lacks a standard format or repository for this data. This situation makes it difficult to share and compare different approaches effectively, as is done in other, more established fields. It also unnecessarily hinders new researchers who want to work in this area. To address this problem, we introduce a standardized format for representing algorithm selection scenarios and a repository that contains a growing number of data sets from the literature. Our format has been designed to be able to express a wide variety of different scenarios. Demonstrating the breadth and power of our platform, we describe a set of example experiments that build and evaluate algorithm selection models through a common interface. The results display the potential of algorithm selection to achieve significant performance improvements across a broad range of problems and algorithms.Comment: Accepted to be published in Artificial Intelligence Journa

arXiv.org e-Print Archive

Repository TU/e

Crossref

Pure OAI Repository

Publications at Bielefeld University

Intelligent assistance for data pre-processing

Author: Abelló Gamazo Alberto
Aluja Banet Tomàs
Bilalli Besim
Wrembel Robert
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

A data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. Typically, a dataset needs to be pre-processed before being mined. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives. As a consequence, non-experienced users become overwhelmed with pre-processing alternatives. In this paper, we show that the problem can be addressed by automating the pre-processing with the support of meta-learning. To this end, we analyzed a wide range of data pre-processing techniques and a set of classification algorithms. For each classification algorithm that we consider and a given dataset, we are able to automatically suggest the transformations that improve the quality of the results of the algorithm on the dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.Postprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Data Analytics on Online Labor Markets: Opportunities and Challenges

Author: Bernstein Abraham
Feldman Michael
Juldaschewa Frida
Publication venue
Publication date: 06/07/2017
Field of study

The data-driven economy has led to a significant shortage of data scientists. To address this shortage, this study explores the prospects of outsourcing data analysis tasks to freelancers available on online labor markets (OLMs) by identifying the essential factors for this endeavor. Specifically, we explore the skills required from freelancers, collect information about the skills present on major OLMs, and identify the main hurdles for out-/crowd-sourcing data analysis. Adopting a sequential mixed-method approach, we interviewed 20 data scientists and subsequently surveyed 80 respondents from OLMs. Besides confirming the need for expected skills such as technical/mathematical capabilities, it also identifies less known ones such as domain understanding, an eye for aesthetic data visualization, good communication skills, and a natural understanding of the possibilities/limitations of data analysis in general. Finally, it elucidates obstacles for crowdsourcing like the communication overhead, knowledge gaps, quality assurance, and data confidentiality, which need to be mitigated

ZORA

Data Analytics on Online Labor Markets: Opportunities and Challenges

Author: Bernstein Abraham
Feldman Michael
Juldaschewa Frida
Publication venue
Publication date: 06/07/2017
Field of study

arXiv.org e-Print Archive

ZORA

An intelligent decision support system for machine learning algorithms recommendation

Author: Mihai Andrei
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2017
Field of study

Machine learning is a very central topic in Artificial Intelligence and even computer science in general. Nowadays, its use in Big Data problems is quite well known. However, while the big data, and machine learning problems in general, are quite varied and in needing of different kinds of solutions, there are as well many different methods in machine learning that can be used. In this work, we propose an application that might help deciding on which machine learning methods a user needs for a specified problem. The application is an Intelligent Decision Support System for Machine Learning Algorithm Recommendation for which we present the design, which is centered around the combined use of the Case-Based Reasoning and RuleBased Reasoning, for the recommending process, while also trying to make the system easy to use and manage. We present a prototype of such a system, and the implementation details of the two recommender algorithms. The preliminary testing of the prototype shows it to be a promising tool

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC