Search CORE

65 research outputs found

PRESISTANT: Learning based assistant for data pre-processing

Author: Abelló Alberto
Aluja-Banet Tomàs
Bilalli Besim
Wrembel Robert
Publication venue
Publication date: 02/03/2018
Field of study

Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

Advances in Databases and Information Systems - 26th European Conference, ADBIS 2022, Turin, Italy, September 5-8, 2022, Proceedings

Author: Cerquitelli Tania
Chiusano SILVIA ANNA
Silvia Chiusano Tania Cerquitelli, Robert Wrembel (Eds.)
Wrembel Robert
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Intelligent assistance for data pre-processing

Author: Abelló Gamazo Alberto
Aluja Banet Tomàs
Bilalli Besim
Wrembel Robert
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

A data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. Typically, a dataset needs to be pre-processed before being mined. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives. As a consequence, non-experienced users become overwhelmed with pre-processing alternatives. In this paper, we show that the problem can be addressed by automating the pre-processing with the support of meta-learning. To this end, we analyzed a wide range of data pre-processing techniques and a set of classification algorithms. For each classification algorithm that we consider and a given dataset, we are able to automatically suggest the transformations that improve the quality of the results of the algorithm on the dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.Postprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

An alternative view on data processing pipelines from the DOLAP 2019 perspective

Author: Romero Moral Óscar
Song Il-Yeol
Wrembel Robert
Publication venue: 'Elsevier BV'
Publication date: 01/09/2020
Field of study

Data science requires constructing data processing pipelines (DPPs), which span diverse phases such as data integration, cleaning, pre-processing, and analysis. However, current solutions lack a strong data engineering perspective. As consequence, DPPs are error-prone, inefficient w.r.t. human efforts, and inefficient w.r.t. execution time. We claim that DPP design, development, testing, deployment, and execution should benefit from a standardized DPP architecture and from well-known data engineering solutions. This claim is supported by our experience in real projects and trends in the field, and it opens new paths for research and technology. With this spirit, we outline five research opportunities that represent novel trends towards building DPPs. Finally, we highlight that the best DOLAP 2019 papers selected for the DOLAP 2019 Information Systems Special Issue fall in this category and highlight the relevance of advanced data engineering for data science.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Towards a Hybrid Imputation Approach Using Web Tables

Author: Ahmadov Ahmad
Eberius Julian
Lehner Wolfgang
Thiele Maik
Wrembel Robert
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/01/2023
Field of study

Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing values is addressed by the data imputation community using statistical techniques, we complement these approaches by using external data sources from the data lake or even the Web to lookup missing values. In this paper we propose a novel hybrid data imputation strategy that, takes into account the characteristics of an incomplete dataset and based on that chooses the best imputation approach, i.e. either a statistical approach such as regression analysis or a Web-based lookup or a combination of both. We formalize and implement both imputation approaches, including a Web table retrieval and matching system and evaluate them extensively using a corpus with 125M Web tables. We show that applying statistical techniques in conjunction with external data sources will lead to a imputation system which is robust, accurate, and has high coverage at the same time

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

New Trends in Database and Information Systems - ADBIS 2022 Short Papers, Doctoral Consortium and Workshops: DOING, K-GALS, MADEISD, MegaData, SWODCH, Turin, Italy, September 5-8, 2022, Proceedings

Author: Catania Barbara
Cerquitelli Tania
Chiusano SILVIA ANNA
Norvaag Kjetil
Silvia Chiusano Tania Cerquitelli, Robert Wrembel, Kjetil Nørvåg, Barbara Catania, Genoveva Vargas-Solar, Ester Zumpano (Eds.)
Vargas-Solar Genoveva
Wrembel Robert
Zumpano Ester
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval

Author: Wolfgang Lehner Robert Wrembel Julian Eberius
Gentile Anna Lisa
Milne David
Paolo Merialdo Denilson Barbosa
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/05/2019
Field of study

Tables contain valuable knowledge in a structured form. We employ neural language modeling approaches to embed tabular data into vector spaces. Specifically, we consider different table elements, such caption, column headings, and cells, for training word and entity embeddings. These embeddings are then utilized in three particular table-related tasks, row population, column population, and table retrieval, by incorporating them into existing retrieval models as additional semantic similarity signals. Evaluation results show that table embeddings can significantly improve upon the performance of state-of-the-art baselines.Comment: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '19), 201

arXiv.org e-Print Archive

Crossref

Creation and management of versions in multiversion data warehouse

Author: Bartosz Bębel
Christian Koncilia
Johann Eder
Robert Wrembel
Tadeusz Morzy
Publication venue
Publication date: 01/01/2004
Field of study

ABSTRACT A data warehouse (DW) provides an information for analytical processing, decision making, and data mining tools. On the one hand, the structure and content of a data warehouse reflects a real world, i.e. data stored in a DW come from real production systems. On the other hand, a DW and its tools may be used for predicting trends and simulating a virtual business scenarios. This activity is often called the what-if analysis. Traditional DW systems have static structure of their schemas and relationships between data, and therefore they are not able to support any dynamics in their structure and content. For these purposes, multiversion data warehouses seem to be very promising. In this paper we present a concept and an ongoing implementation of a multiversion data warehouse that is capable of handling changes in the structure of its schema as well as simulating alternative business scenarios

CiteSeerX