845,094 research outputs found
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
HUBUNGAN POLA PERILAKU BERSIH TELINGA DENGAN KEJADIAN OTITIS EKSTERNA PADA PEKERJA PABRIK PT WIJAYA KARYA BETON PASURUAN
Ear cleaning behavior patterns are procedures carried out to clean dirt or foreign objects in the ear.
Cleaning the ears too often and using tools such as cotton buds can damage the lining of the ear canal,
triggering otitis externa. Otitis externa is an inflammation that occurs in the ear canal due to bacterial,
fungal, or viral infections. This research used an analytical observational method with a cross sectional
approach with a sample size of 42 workers. Researchers collected data using a questionnaire. Data
analysis using SPSS software with chi-square statistical test. It was found that there was a relationship
between the frequency of ear cleaning and the incidence of otitis externa (P = 0.046), there was no
relationship between the location of ear cleaning and the incidence of otitis externa (P = 0.214), there
was no relationship between the tools used for ear cleaning and the incidence of otitis externa (P =
0.387), there was no relationship between the reason for ear cleaning and the incidence of otitis externa
(P = 1.000) and there was a relationship between the symptoms that appeared due to ear cleaning and
the incidence of otitis externa (P = 0.000)
A revival of integrity constraints for data cleaning
Integrity constraints,
a.k.a
. data dependencies, are being widely used for improving
the quality of schema
. Recently constraints have enjoyed a revival for
improving the quality of data
. The tutorial aims to provide an overview of recent advances in constraint-based data cleaning.
</jats:p
Data Cleaning Methods for Client and Proxy Logs
In this paper we present our experiences with the cleaning of Web client and proxy usage logs, based on a long-term browsing study with 25 participants. A detailed clickstream log, recorded using a Web intermediary, was combined with a second log of user interface actions, which was captured by a modified Firefox browser for a subset of the participants. The consolidated data from both records revealed many page requests that were not directly related to user actions. For participants who had no ad-filtering system installed, these artifacts made up one third of all transferred Web pages. Three major reasons could be identified: HTML Frames and iFrames, advertisements, and automatic page reloads. The experiences made during the data cleaning process might help other researchers to choose adequate filtering methods for their data
- …
