13,375 research outputs found
Network Model Selection Using Task-Focused Minimum Description Length
Networks are fundamental models for data used in practically every
application domain. In most instances, several implicit or explicit choices
about the network definition impact the translation of underlying data to a
network representation, and the subsequent question(s) about the underlying
system being represented. Users of downstream network data may not even be
aware of these choices or their impacts. We propose a task-focused network
model selection methodology which addresses several key challenges. Our
approach constructs network models from underlying data and uses minimum
description length (MDL) criteria for selection. Our methodology measures
efficiency, a general and comparable measure of the network's performance of a
local (i.e. node-level) predictive task of interest. Selection on efficiency
favors parsimonious (e.g. sparse) models to avoid overfitting and can be
applied across arbitrary tasks and representations. We show stability,
sensitivity, and significance testing in our methodology
Machine learning based data mining for Milky Way filamentary structures reconstruction
We present an innovative method called FilExSeC (Filaments Extraction,
Selection and Classification), a data mining tool developed to investigate the
possibility to refine and optimize the shape reconstruction of filamentary
structures detected with a consolidated method based on the flux derivative
analysis, through the column-density maps computed from Herschel infrared
Galactic Plane Survey (Hi-GAL) observations of the Galactic plane. The present
methodology is based on a feature extraction module followed by a machine
learning model (Random Forest) dedicated to select features and to classify the
pixels of the input images. From tests on both simulations and real
observations the method appears reliable and robust with respect to the
variability of shape and distribution of filaments. In the cases of highly
defined filament structures, the presented method is able to bridge the gaps
among the detected fragments, thus improving their shape reconstruction. From a
preliminary "a posteriori" analysis of derived filament physical parameters,
the method appears potentially able to add a sufficient contribution to
complete and refine the filament reconstruction.Comment: Proceeding of WIRN 2015 Conference, May 20-22, Vietri sul Mare,
Salerno, Italy. Published in Smart Innovation, Systems and Technology,
Springer, ISSN 2190-3018, 9 pages, 4 figure
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
- …