15 research outputs found
Recommended from our members
Learning Semantic Annotations for Tabular Data
The usefulness of tabular data such as web tables critically depends on understanding their semantics. This study focuses on column type prediction for tables without any meta data. Unlike traditional lexical matching-based methods, we propose a deep prediction model that can fully exploit a table's contextual semantics, including table locality features learned by a Hybrid Neural Network (HNN), and inter-column semantics features learned by a knowledge base (KB) lookup and query answering algorithm.It exhibits good performance not only on individual table sets, but also when transferring from one table set to another
Integrating and querying similar tables from PDF documents using deep learning
Large amount of public data produced by enterprises are in semi-structured
PDF form. Tabular data extraction from reports and other published data in PDF
format is of interest for various data consolidation purposes such as analysing
and aggregating financial reports of a company. Queries into the structured
tabular data in PDF format are normally processed in an unstructured manner
through means like text-match. This is mainly due to that the binary format of
PDF documents is optimized for layout and rendering and do not have great
support for automated parsing of data. Moreover, even the same table type in
PDF files varies in schema, row or column headers, which makes it difficult for
a query plan to cover all relevant tables. This paper proposes a deep learning
based method to enable SQL-like query and analysis of financial tables from
annual reports in PDF format. This is achieved through table type
classification and nearest row search. We demonstrate that using word embedding
trained on Google news for header match clearly outperforms the text-match
based approach in traditional database. We also introduce a practical system
that uses this technology to query and analyse finance tables in PDF documents
from various sources
Extracting novel facts from tables for Knowledge Graph completion
We propose a new end-to-end method for extending a Knowledge Graph (KG) from tables. Existing techniques tend to interpret tables by focusing on information that is already in the KG, and therefore tend to extract many redundant facts. Our method aims to find more novel facts. We introduce a new technique for table interpretation based on a scalable graphical model using entity similarities. Our method further disambiguates cell values using KG embeddings as additional ranking method. Other distinctive features are the lack of assumptions about the underlying KG and the enabling of a fine-grained tuning of the precision/recall trade-off of extracted facts. Our experiments show that our approach has a higher recall during the interpretation process than the state-of-the-art, and is more resistant against the bias observed in extracting mostly redundant facts since it produces more novel extractions
Extracting Novel Facts from Tables for Knowledge Graph Completion (Extended version)
We propose a new end-to-end method for extending a Knowledge Graph (KG) from tables. Existing techniques tend to interpret tables by focusing on information that is already in the KG, and therefore tend to extract many redundant facts. Our method aims to find more novel facts. We introduce a new technique for table interpretation based on a scalable graphical model using entity similarities. Our method further disambiguates cell values using KG embeddings as additional ranking method. Other distinctive features are the lack of assumptions about the underlying KG and the enabling of a fine-grained tuning of the precision/recall trade-off of extracted facts. Our experiments show that our approach has a higher recall during the interpretation process than the state-of-the-art, and is more resistant against the bias observed in extracting mostly redundant facts since it produces more novel extractions
A Longitudinal Analysis of Job Skills for Entry-Level Data Analysts
The explosive growth of the data analytics field has continued over the past decade with no signs of slowing down. Given the fast pace of technology changes and the need for IT professionals to constantly keep up with the field, it is important to analyze the job skills and knowledge required in the data analyst and business intelligence (BI) analyst job market. In this research, we examine over 9,000 job postings for entry-level data analytics jobs over five years (2014-2018). Using a text mining approach and a custom text mining dictionary, we identify a preliminary set of analytic competencies sought in practice. Further, the longitudinal data also demonstrates how these key skills have evolved over time. We find that the three biggest trends include proficiency with Python, Tableau, and R. We also find that an increasing number of jobs emphasize data visualization. Some skills, like Microsoft Access, SAP, and Cognos, declined in popularity over the time frame studied. Using the results of the study, universities can make informed curriculum decisions, and instructors can decide what skills to teach based on industry needs. Our custom text mining dictionary can be added to the growing literature and assist other researchers in this space
Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective
Data-centric AI is at the center of a fundamental shift in software
engineering where machine learning becomes the new software, powered by big
data and computing infrastructure. Here software engineering needs to be
re-thought where data becomes a first-class citizen on par with code. One
striking observation is that a significant portion of the machine learning
process is spent on data preparation. Without good data, even the best machine
learning algorithms cannot perform well. As a result, data-centric AI practices
are now becoming mainstream. Unfortunately, many datasets in the real world are
small, dirty, biased, and even poisoned. In this survey, we study the research
landscape for data collection and data quality primarily for deep learning
applications. Data collection is important because there is lesser need for
feature engineering for recent deep learning approaches, but instead more need
for large amounts of data. For data quality, we study data validation,
cleaning, and integration techniques. Even if the data cannot be fully cleaned,
we can still cope with imperfect data during model training using robust model
training techniques. In addition, while bias and fairness have been less
studied in traditional data management research, these issues become essential
topics in modern machine learning applications. We thus study fairness measures
and unfairness mitigation techniques that can be applied before, during, or
after model training. We believe that the data management community is well
poised to solve these problems