Search CORE

3 research outputs found

Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery

Author: Castro Fernandez Raul
Elmagarmid Ahmed
Ilyas Ihab
Madden Samuel
Mansour Essam
Ouzzani Mourad
Qahtan Abdulhakim A.
Stonebraker Michael
Tang Nan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/06/2019
Field of study

© 2018 IEEE. Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links

DSpace@MIT

Crossref

Building Data Civilizer Pipelines with an Advanced Workflow Engine

Author: Abedjan Ziawasch
Castro Fernandez Raul
Deng Dong
Elmagarmid Ahmed
Ilyas Ihab F.
Madden Samuel R
Mansour Essam
Ouzzani Mourad
Qahtan Abdulhakim A.
Stonebraker Michael
Tang Nan
Tao Wenbo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/06/2019
Field of study

© 2018 IEEE. In order for an enterprise to gain insight into its internal business and the changing outside environment, it is essential to provide the relevant data for in-depth analysis. Enterprise data is usually scattered across departments and geographic regions and is often inconsistent. Data scientists spend the majority of their time finding, preparing, integrating, and cleaning relevant data sets. Data Civilizer is an end-To-end data preparation system. In this paper, we present the complete system, focusing on our new workflow engine, a superior system for entity matching and consolidation, and new cleaning tools. Our workflow engine allows data scientists to author, execute and retrofit data preparation pipelines of different data discovery and cleaning services. Our end-To-end demo scenario is based on data from the MIT data warehouse and e-commerce data sets

DSpace@MIT

Crossref

A Demo of the Data Civilizer System

Author: Abedjan Ziawasch
Castro Fernandez Raul
Deng Dong
Elmagarmid Ahmed
Ilyas Ihab F.
Madden Samuel R
Mansour Essam
Ouzzani Mourad
Qahtan Abdulhakim A.
Stonebraker Michael
Tang Nan
Tao Wenbo
Wenbo Tao
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/06/2019
Field of study

Finding relevant data for a specific task from the numerous data sources available in any organization is a daunting task. This is not only because of the number of possible data sources where the data of interest resides, but also due to the data being scattered all over the enterprise and being typically dirty and inconsistent. In practice, data scientists are routinely reporting that the majority (more than 80%) of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand. We propose to demonstrate Data Civilizer to ease the pain faced in analyzing data "in the wild". Data Civilizer is an end-to-end big data management system with components for data discovery, data integration and stitching, data cleaning, and querying data from a large variety of storage engines, running in large enterprises

DSpace@MIT