Search CORE

1,007 research outputs found

Towards Certain Fixes with Editing Rules and Master Data

Author: Fan Wenfei
Li Jianzhong
Ma Shuai
Tang Nan
Yu Wenyuan
Publication venue
Publication date: 01/01/2010
Field of study

A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions , and a class of editing rules . A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm. </jats:p

Crossref

Edinburgh Research Explorer

CerFix: A System for Cleaning Data with Certain Fixes

Author: Fan Wenfei
Li Jianzhong
Ma Shuai
Tang Nan
Yu Wenyuan
Publication venue
Publication date: 01/01/2011
Field of study

Edinburgh Research Explorer

Graph Pattern Matching: From Intractable to Polynomial Time

Author: Fan Wenfei
Li Jianzhong
Ma Shuai
Tang Nan
Wu Yinghui
Wu Yunpeng
Publication venue
Publication date: 01/01/2010
Field of study

Edinburgh Research Explorer

Multi-Path Bound for DAG Tasks

Author: Guan Nan
He Qingqiang
Lv Mingsong
Zhao Shuai
Publication venue
Publication date: 23/10/2023
Field of study

This paper studies the response time bound of a DAG (directed acyclic graph) task. Recently, the idea of using multiple paths to bound the response time of a DAG task, instead of using a single longest path in previous results, was proposed and leads to the so-called multi-path bound. Multi-path bounds can greatly reduce the response time bound and significantly improve the schedulability of DAG tasks. This paper derives a new multi-path bound and proposes an optimal algorithm to compute this bound. We further present a systematic analysis on the dominance and the sustainability of three existing multi-path bounds and the proposed multi-path bound. Our bound theoretically dominates and empirically outperforms all existing multi-path bounds. What's more, the proposed bound is the only multi-path bound that is proved to be self-sustainable

arXiv.org e-Print Archive

Constructing Multilingual Code Search Dataset Using Neural Machine Translation

Author: Duan Nan
Lu Shuai
Sekizawa Ryo
Yanaka Hitomi
Publication venue
Publication date: 27/06/2023
Field of study

Code search is a task to find programming codes that semantically match the given natural language queries. Even though some of the existing datasets for this task are multilingual on the programming language side, their query data are only in English. In this research, we create a multilingual code search dataset in four natural and four programming languages using a neural machine translation model. Using our dataset, we pre-train and fine-tune the Transformer-based models and then evaluate them on multiple code search test sets. Our results show that the model pre-trained with all natural and programming language data has performed best in most cases. By applying back-translation data filtering to our dataset, we demonstrate that the translation quality affects the model's performance to a certain extent, but the data size matters more.Comment: To appear in the Proceedings of the ACL2023 Student Research Workshop (SRW

arXiv.org e-Print Archive

Calliope-Net: Automatic Generation of Graph Data Facts via Annotated Node-link Diagrams

Author: Cao Nan
Chen Nan
Chen Qing
Shuai Wei
Tong Hanghang
Wu Guande
Xu Zhe
Publication venue
Publication date: 11/08/2023
Field of study

Graph or network data are widely studied in both data mining and visualization communities to review the relationship among different entities and groups. The data facts derived from graph visual analysis are important to help understand the social structures of complex data, especially for data journalism. However, it is challenging for data journalists to discover graph data facts and manually organize correlated facts around a meaningful topic due to the complexity of graph data and the difficulty to interpret graph narratives. Therefore, we present an automatic graph facts generation system, Calliope-Net, which consists of a fact discovery module, a fact organization module, and a visualization module. It creates annotated node-link diagrams with facts automatically discovered and organized from network data. A novel layout algorithm is designed to present meaningful and visually appealing annotated graphs. We evaluate the proposed system with two case studies and an in-lab user study. The results show that Calliope-Net can benefit users in discovering and understanding graph data facts with visually pleasing annotated visualizations

arXiv.org e-Print Archive

Beyond Numbers: Creating Analogies to Enhance Data Comprehension and Communication with Generative AI

Author: Cao Nan
Chen Qing
Shuai Wei
Sun Zhida
Zhang Jiyao
Publication venue
Publication date: 31/01/2024
Field of study

Unfamiliar measurements usually hinder readers from grasping the scale of the numerical data, understanding the content, and feeling engaged with the context. To enhance data comprehension and communication, we leverage analogies to bridge the gap between abstract data and familiar measurements. In this work, we first conduct semi-structured interviews with design experts to identify design problems and summarize design considerations. Then, we collect an analogy dataset of 138 cases from various online sources. Based on the collected dataset, we characterize a design space for creating data analogies. Next, we build a prototype system, AnalogyMate, that automatically suggests data analogies, their corresponding design solutions, and generated visual representations powered by generative AI. The study results show the usefulness of AnalogyMate in aiding the creation process of data analogies and the effectiveness of data analogy in enhancing data comprehension and communication

arXiv.org e-Print Archive