Search CORE

15 research outputs found

Recommended from our members

Learning Semantic Annotations for Tabular Data

Author: Chen J.
Horrocks I.
Jimenez-Ruiz E.
Sutton C.
Publication venue: International Joint Conferences on Artifical Intelligence (IJCAI)
Publication date
Field of study

The usefulness of tabular data such as web tables critically depends on understanding their semantics. This study focuses on column type prediction for tables without any meta data. Unlike traditional lexical matching-based methods, we propose a deep prediction model that can fully exploit a table's contextual semantics, including table locality features learned by a Hybrid Neural Network (HNN), and inter-column semantics features learned by a knowledge base (KB) lookup and query answering algorithm.It exhibits good performance not only on individual table sets, but also when transferring from one table set to another

City Research Online

Integrating and querying similar tables from PDF documents using deep learning

Author: Anand Rahul
Paik Hye-Young
Wang Cheng
Publication venue
Publication date: 15/01/2019
Field of study

Large amount of public data produced by enterprises are in semi-structured PDF form. Tabular data extraction from reports and other published data in PDF format is of interest for various data consolidation purposes such as analysing and aggregating financial reports of a company. Queries into the structured tabular data in PDF format are normally processed in an unstructured manner through means like text-match. This is mainly due to that the binary format of PDF documents is optimized for layout and rendering and do not have great support for automated parsing of data. Moreover, even the same table type in PDF files varies in schema, row or column headers, which makes it difficult for a query plan to cover all relevant tables. This paper proposes a deep learning based method to enable SQL-like query and analysis of financial tables from annual reports in PDF format. This is achieved through table type classification and nearest row search. We demonstrate that using word embedding trained on Google news for header match clearly outperforms the text-match based approach in traditional database. We also introduce a practical system that uses this technology to query and analyse finance tables in PDF documents from various sources

arXiv.org e-Print Archive

UNSWorks

Extracting novel facts from tables for Knowledge Graph completion

Author: CS Bhagavatula
G Limaye
I Ermilov
J Pearl
J Wang
M Cafarella
M Nickel
M Pham
P Venetis
S Auer
S Neumaier
V Efthymiou
V Mulwad
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

We propose a new end-to-end method for extending a Knowledge Graph (KG) from tables. Existing techniques tend to interpret tables by focusing on information that is already in the KG, and therefore tend to extract many redundant facts. Our method aims to find more novel facts. We introduce a new technique for table interpretation based on a scalable graphical model using entity similarities. Our method further disambiguates cell values using KG embeddings as additional ranking method. Other distinctive features are the lack of assumptions about the underlying KG and the enabling of a fine-grained tuning of the precision/recall trade-off of extracted facts. Our experiments show that our approach has a higher recall during the interpretation process than the state-of-the-art, and is more resistant against the bias observed in extracting mostly redundant facts since it produces more novel extractions

VU Research Portal

Crossref

CWI's Institutional Repository

Extracting Novel Facts from Tables for Knowledge Graph Completion (Extended version)

Author: Boncz P.A. (Peter)
Kruit B.B. (Benno)
Urbani J. (Jacopo)
Publication venue
Publication date: 15/07/2019
Field of study

arXiv.org e-Print Archive

CWI's Institutional Repository

Extracting Contextualized Quantity Facts from Web Tables

Author: Berberich K.
Ho V.
Pal K.
Razniewski S.
Weikum G.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

MPG.PuRe

A Longitudinal Analysis of Job Skills for Entry-Level Data Analysts

Author: Dong Tianxi
Triche J.
Publication venue: Digital Commons @ Trinity
Publication date: 01/10/2020
Field of study

The explosive growth of the data analytics field has continued over the past decade with no signs of slowing down. Given the fast pace of technology changes and the need for IT professionals to constantly keep up with the field, it is important to analyze the job skills and knowledge required in the data analyst and business intelligence (BI) analyst job market. In this research, we examine over 9,000 job postings for entry-level data analytics jobs over five years (2014-2018). Using a text mining approach and a custom text mining dictionary, we identify a preliminary set of analytic competencies sought in practice. Further, the longitudinal data also demonstrates how these key skills have evolved over time. We find that the three biggest trends include proficiency with Python, Tableau, and R. We also find that an increasing number of jobs emphasize data visualization. Some skills, like Microsoft Access, SAP, and Cognos, declined in popularity over the time frame studied. Using the results of the study, universities can make informed curriculum decisions, and instructors can decide what skills to teach based on industry needs. Our custom text mining dictionary can be added to the growing literature and assist other researchers in this space

Trinity University

AIS Electronic Library (AISeL)

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Author: Lee Jae-Gil
Roh Yuji
Song Hwanjun
Whang Steven Euijong
Publication venue
Publication date: 04/08/2022
Field of study

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems

arXiv.org e-Print Archive