Search CORE

7 research outputs found

XLIndy: interactive recognition and information extraction in spreadsheets

Author: Gonsior Julius
Koci Elvis
Kuban Dana
Lehner Wolfgang
Luetting Nico
Olwig Dominik
Romero Moral Óscar
Thiele Maik
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Over the years, spreadsheets have established their presence in many domains, including business, government, and science. However, challenges arise due to spreadsheets being partially-structured and carrying implicit (visual and textual) information. This translates into a bottleneck, when it comes to automatic analysis and extraction of information. Therefore, we present XLIndy, a Microsoft Excel add-in with a machine learning back-end, written in Python. It showcases our novel methods for layout inference and table recognition in spreadsheets. For a selected task and method, users can visually inspect the results, change configurations, and compare different runs. This enables iterative fine-tuning. Additionally, users can manually revise the predicted layout and tables, and subsequently save them as annotations. The latter is used to measure performance and (re-)train classifiers. Finally, data in the recognized tables can be extracted for further processing. XLIndy supports several standard formats, such as CSV and JSON.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

A Novel Approach to Data Extraction on Hyperlinked Webpages

Author: Khushi Matloob
Masood Nayyer
Shaukat Kamran
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.publishedVersio

Multidisciplinary Digital Publishing Institute

NORA - Norwegian Open Research Archives

UiS Brage

A clustering approach to extract data from HTML tables

Author: Corchuelo Gil Rafael
Jiménez Aguirre Patricia
Roldán Salvador Juan Carlos
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiencyMinisterio de Ciencia e Innovación PID2020-112540RB-C44Ministerio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-106

idUS. Depósito de Investigación Universidad de Sevilla

TOMATE: A heuristic-based approach to extract data from HTML tables

Author: Corchuelo Gil Rafael
Jiménez Aguirre Patricia
Roldán Salvador Juan Carlos
Szekely Pedro
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

Extracting data from user-friendly HTML tables is difficult because of their different lay outs, formats, and encoding problems. In this article, we present a new proposal that first applies several pre-processing heuristics to clean the tables, then performs functional anal ysis, and finally applies some post-processing heuristics to produce the output. Our most important contribution is regarding functional analysis, which we address by projecting the cells onto a high-dimensional feature space in which a standard clustering technique is used to make the meta-data cells apart from the data cells. We experimented with two large repositories of real-world HTML tables and our results confirm that our proposal can extract data from them with an F1 score of 89:50% in just 0:09 CPU seconds per table. We confronted our proposal with several competitors and the statistical analysis confirmed its superiority in terms of effectiveness, while it keeps very competitive in terms of efficiency.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-1060Ministerio de Ciencia e Innovación PID2020-112540RB-C4

idUS. Depósito de Investigación Universidad de Sevilla

A hybrid quantum approach to leveraging data from HTML tables

Author: Corchuelo Gil Rafael
Jiménez Aguirre Patricia
Roldán Salvador Juan Carlos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

The Web provides many data that are encoded using HTML tables. This facilitates rendering them, but obfuscates their structure and makes it difficult for automated business processes to leverage them. This has motivated many authors to work on proposals to extract them as automatically as possible. In this article, we present a new unsupervised proposal that uses a hybrid approach in which a standard computer is used to perform pre and post-processing tasks and a quantum computer is used to perform the core task: guessing whether the cells have labels or values. The problem is addressed using a clustering approach that is known to be NP using standard computers, but our proposal can solve it in polynomial time, which implies a significant performance improvement. It is novel in that it relies on an entropy-preservation metaphor that has proven to work very well on two large collections of real-world tables from the Wikipedia and the Dresden Web Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art proposal in terms of both effectiveness and efficiency; the key difference is that our proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-106

idUS. Depósito de Investigación Universidad de Sevilla

Clustering header categories extracted from web tables

Author: Abu-Tarif
Adelfio
Astrakhantsev
Ball
Bing
Chandran
Costa e Silva
Dalvi
Embley
Gatterbauer
Gonzalez
Handley
Hirayama
Hu
Hurst
Hurst
Itonori
Kieninger
Kieninger
Laurentini
Long
McQueen
Pinto
Pyreddy
Shamalian
Venetis
Wang
Zuyev
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date
Field of study

Crossref

Transforming web tables to a relational database

Author: Embley David W.
Nagy George
Seth Sharad C.
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2014
Field of study

HTML tables represent a significant fraction of web data. The often complex headers of such tables are determined accurately using their indexing property. Isolated headers are factored to extract category hierarchies. Web tables are then transformed into a canonical form and imported into a relational database. The proposed processing allows for the formulation of arbitrary SQL queries over the collection of induced relational tables

CiteSeerX

Crossref

DigitalCommons@University of Nebraska