Search CORE

646,412 research outputs found

Exploratory Data Mining and Data Cleaning

Author: Nicholas Cox
Publication venue
Publication date
Field of study

Data Cleaning

Author: UiT The Arctic University of Norway
Publication venue
Publication date: 25/09/2023
Field of study

Course material for the webinar “Data Cleaning”, part of a webinar series on research data management (RDM) organized by UiT The Arctic University of Norway. For more information, please visit: site.uit.no/rdmtrainin

ZENODO

CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Author: Blase Jennifer
Chu Xu
Li Peng
Rao Xi
Zhang Ce
Zhang Yue
Publication venue
Publication date: 01/01/2020
Field of study

Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

arXiv.org e-Print Archive

Repository for Publications and Research Data

Data Cleaning Methods for Client and Proxy Logs

Author: Herder E.
Obendorf H.
Weinreich H.
Publication venue: Dalhousie University
Publication date: 01/01/2006
Field of study

In this paper we present our experiences with the cleaning of Web client and proxy usage logs, based on a long-term browsing study with 25 participants. A detailed clickstream log, recorded using a Web intermediary, was combined with a second log of user interface actions, which was captured by a modified Firefox browser for a subset of the participants. The consolidated data from both records revealed many page requests that were not directly related to user actions. For participants who had no ad-filtering system installed, these artifacts made up one third of all transferred Web pages. Three major reasons could be identified: HTML Frames and iFrames, advertisements, and automatic page reloads. The experiences made during the data cleaning process might help other researchers to choose adequate filtering methods for their data

CiteSeerX

University of Twente Research Information

Cheetah Experimental Platform Web 1.0: Cleaning Pupillary Data

Author: Maran Thomas
Neurauter Manuel
Pinggera Jakob
Weber Barbara
Zugal Stefan
Publication venue
Publication date: 01/01/2017
Field of study

Recently, researchers started using cognitive load in various settings, e.g., educational psychology, cognitive load theory, or human-computer interaction. Cognitive load characterizes a tasks' demand on the limited information processing capacity of the brain. The widespread adoption of eye-tracking devices led to increased attention for objectively measuring cognitive load via pupil dilation. However, this approach requires a standardized data processing routine to reliably measure cognitive load. This technical report presents CEP-Web, an open source platform to providing state of the art data processing routines for cleaning pupillary data combined with a graphical user interface, enabling the management of studies and subjects. Future developments will include the support for analyzing the cleaned data as well as support for Task-Evoked Pupillary Response (TEPR) studies

arXiv.org e-Print Archive

Online Research Database In Technology