1,007 research outputs found
Towards Certain Fixes with Editing Rules and Master Data
A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find
certain fixes
that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of
certain regions
, and a class of
editing rules
. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple,
relative
to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm.
</jats:p
Multi-Path Bound for DAG Tasks
This paper studies the response time bound of a DAG (directed acyclic graph)
task. Recently, the idea of using multiple paths to bound the response time of
a DAG task, instead of using a single longest path in previous results, was
proposed and leads to the so-called multi-path bound. Multi-path bounds can
greatly reduce the response time bound and significantly improve the
schedulability of DAG tasks. This paper derives a new multi-path bound and
proposes an optimal algorithm to compute this bound. We further present a
systematic analysis on the dominance and the sustainability of three existing
multi-path bounds and the proposed multi-path bound. Our bound theoretically
dominates and empirically outperforms all existing multi-path bounds. What's
more, the proposed bound is the only multi-path bound that is proved to be
self-sustainable
Constructing Multilingual Code Search Dataset Using Neural Machine Translation
Code search is a task to find programming codes that semantically match the
given natural language queries. Even though some of the existing datasets for
this task are multilingual on the programming language side, their query data
are only in English. In this research, we create a multilingual code search
dataset in four natural and four programming languages using a neural machine
translation model. Using our dataset, we pre-train and fine-tune the
Transformer-based models and then evaluate them on multiple code search test
sets. Our results show that the model pre-trained with all natural and
programming language data has performed best in most cases. By applying
back-translation data filtering to our dataset, we demonstrate that the
translation quality affects the model's performance to a certain extent, but
the data size matters more.Comment: To appear in the Proceedings of the ACL2023 Student Research Workshop
(SRW
Calliope-Net: Automatic Generation of Graph Data Facts via Annotated Node-link Diagrams
Graph or network data are widely studied in both data mining and
visualization communities to review the relationship among different entities
and groups. The data facts derived from graph visual analysis are important to
help understand the social structures of complex data, especially for data
journalism. However, it is challenging for data journalists to discover graph
data facts and manually organize correlated facts around a meaningful topic due
to the complexity of graph data and the difficulty to interpret graph
narratives. Therefore, we present an automatic graph facts generation system,
Calliope-Net, which consists of a fact discovery module, a fact organization
module, and a visualization module. It creates annotated node-link diagrams
with facts automatically discovered and organized from network data. A novel
layout algorithm is designed to present meaningful and visually appealing
annotated graphs. We evaluate the proposed system with two case studies and an
in-lab user study. The results show that Calliope-Net can benefit users in
discovering and understanding graph data facts with visually pleasing annotated
visualizations
Beyond Numbers: Creating Analogies to Enhance Data Comprehension and Communication with Generative AI
Unfamiliar measurements usually hinder readers from grasping the scale of the
numerical data, understanding the content, and feeling engaged with the
context. To enhance data comprehension and communication, we leverage analogies
to bridge the gap between abstract data and familiar measurements. In this
work, we first conduct semi-structured interviews with design experts to
identify design problems and summarize design considerations. Then, we collect
an analogy dataset of 138 cases from various online sources. Based on the
collected dataset, we characterize a design space for creating data analogies.
Next, we build a prototype system, AnalogyMate, that automatically suggests
data analogies, their corresponding design solutions, and generated visual
representations powered by generative AI. The study results show the usefulness
of AnalogyMate in aiding the creation process of data analogies and the
effectiveness of data analogy in enhancing data comprehension and
communication
- ā¦