7 research outputs found

    XLIndy: interactive recognition and information extraction in spreadsheets

    Get PDF
    Over the years, spreadsheets have established their presence in many domains, including business, government, and science. However, challenges arise due to spreadsheets being partially-structured and carrying implicit (visual and textual) information. This translates into a bottleneck, when it comes to automatic analysis and extraction of information. Therefore, we present XLIndy, a Microsoft Excel add-in with a machine learning back-end, written in Python. It showcases our novel methods for layout inference and table recognition in spreadsheets. For a selected task and method, users can visually inspect the results, change configurations, and compare different runs. This enables iterative fine-tuning. Additionally, users can manually revise the predicted layout and tables, and subsequently save them as annotations. The latter is used to measure performance and (re-)train classifiers. Finally, data in the recognized tables can be extracted for further processing. XLIndy supports several standard formats, such as CSV and JSON.Peer ReviewedPostprint (author's final draft

    A Novel Approach to Data Extraction on Hyperlinked Webpages

    Get PDF
    The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.publishedVersio

    A clustering approach to extract data from HTML tables

    Get PDF
    HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiencyMinisterio de Ciencia e Innovación PID2020-112540RB-C44Ministerio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-106

    TOMATE: A heuristic-based approach to extract data from HTML tables

    Get PDF
    Extracting data from user-friendly HTML tables is difficult because of their different lay outs, formats, and encoding problems. In this article, we present a new proposal that first applies several pre-processing heuristics to clean the tables, then performs functional anal ysis, and finally applies some post-processing heuristics to produce the output. Our most important contribution is regarding functional analysis, which we address by projecting the cells onto a high-dimensional feature space in which a standard clustering technique is used to make the meta-data cells apart from the data cells. We experimented with two large repositories of real-world HTML tables and our results confirm that our proposal can extract data from them with an F1 score of 89:50% in just 0:09 CPU seconds per table. We confronted our proposal with several competitors and the statistical analysis confirmed its superiority in terms of effectiveness, while it keeps very competitive in terms of efficiency.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-1060Ministerio de Ciencia e Innovación PID2020-112540RB-C4

    A hybrid quantum approach to leveraging data from HTML tables

    Get PDF
    The Web provides many data that are encoded using HTML tables. This facilitates rendering them, but obfuscates their structure and makes it difficult for automated business processes to leverage them. This has motivated many authors to work on proposals to extract them as automatically as possible. In this article, we present a new unsupervised proposal that uses a hybrid approach in which a standard computer is used to perform pre and post-processing tasks and a quantum computer is used to perform the core task: guessing whether the cells have labels or values. The problem is addressed using a clustering approach that is known to be NP using standard computers, but our proposal can solve it in polynomial time, which implies a significant performance improvement. It is novel in that it relies on an entropy-preservation metaphor that has proven to work very well on two large collections of real-world tables from the Wikipedia and the Dresden Web Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art proposal in terms of both effectiveness and efficiency; the key difference is that our proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-106

    Transforming web tables to a relational database

    Get PDF
    HTML tables represent a significant fraction of web data. The often complex headers of such tables are determined accurately using their indexing property. Isolated headers are factored to extract category hierarchies. Web tables are then transformed into a canonical form and imported into a relational database. The proposed processing allows for the formulation of arbitrary SQL queries over the collection of induced relational tables
    corecore