Search CORE

1,047 research outputs found

HTML Tables

Author: Suselo Joko
Publication venue: Joko Suselo
Publication date
Field of study

Htab2RDF: Mapping HTML Tables to RDF Triples

Author: Alghamdi Abdullah
Alnafjan Khalid
Bouchiha Djelloul
Malki Mimoun
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 09/02/2018
Field of study

The Web has become a tremendously huge data source hidden under linked documents. A significant number of Web documents include HTML tables generated dynamically from relational databases. Often, there is no direct public access to the databases themselves. On the other hand, RDF (Resource Description Framework) gives an efficient mechanism to represent directly data on the Web based on a Web-scalable architecture for identification and interpretation of terms. This leads to the concept of Linked Data on the Web. To allow direct access to data on the Web as Linked Data, we propose in this paper an approach to transform HTML tables into RDF triples. It consists of three main phases: refining, pre-treatment and mapping. The whole process is assisted by a domain ontology and the WordNet lexical database. A tool called Htab2RDF has been implemented. Experiments have been carried out to evaluate and show efficiency of the proposed approach

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

A clustering approach to extract data from HTML tables

Author: Corchuelo Gil Rafael
Jiménez Aguirre Patricia
Roldán Salvador Juan Carlos
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiencyMinisterio de Ciencia e Innovación PID2020-112540RB-C44Ministerio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-106

idUS. Depósito de Investigación Universidad de Sevilla

A hybrid quantum approach to leveraging data from HTML tables

Author: Corchuelo Gil Rafael
Jiménez Aguirre Patricia
Roldán Salvador Juan Carlos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

The Web provides many data that are encoded using HTML tables. This facilitates rendering them, but obfuscates their structure and makes it difficult for automated business processes to leverage them. This has motivated many authors to work on proposals to extract them as automatically as possible. In this article, we present a new unsupervised proposal that uses a hybrid approach in which a standard computer is used to perform pre and post-processing tasks and a quantum computer is used to perform the core task: guessing whether the cells have labels or values. The problem is addressed using a clustering approach that is known to be NP using standard computers, but our proposal can solve it in polynomial time, which implies a significant performance improvement. It is novel in that it relies on an entropy-preservation metaphor that has proven to work very well on two large collections of real-world tables from the Wikipedia and the Dresden Web Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art proposal in terms of both effectiveness and efficiency; the key difference is that our proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-106

idUS. Depósito de Investigación Universidad de Sevilla

DeExcelerator: A Framework for Extracting Relational Data From Partially Structured Documents

Author: Braunschweig Katrin
Dannecker Lars
Eberius Julian
Lehner Wolfgang
Thiele Maik
Werner Christopher
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/06/2021
Field of study

Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

TOMATE: A heuristic-based approach to extract data from HTML tables

Author: Corchuelo Gil Rafael
Jiménez Aguirre Patricia
Roldán Salvador Juan Carlos
Szekely Pedro
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

Extracting data from user-friendly HTML tables is difficult because of their different lay outs, formats, and encoding problems. In this article, we present a new proposal that first applies several pre-processing heuristics to clean the tables, then performs functional anal ysis, and finally applies some post-processing heuristics to produce the output. Our most important contribution is regarding functional analysis, which we address by projecting the cells onto a high-dimensional feature space in which a standard clustering technique is used to make the meta-data cells apart from the data cells. We experimented with two large repositories of real-world HTML tables and our results confirm that our proposal can extract data from them with an F1 score of 89:50% in just 0:09 CPU seconds per table. We confronted our proposal with several competitors and the statistical analysis confirmed its superiority in terms of effectiveness, while it keeps very competitive in terms of efficiency.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-1060Ministerio de Ciencia e Innovación PID2020-112540RB-C4

idUS. Depósito de Investigación Universidad de Sevilla

texreg: conversion of statistical model output in R to LaTeX and HTML tables

Author: Leifeld Philip
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/01/2013
Field of study

A recurrent task in applied statistics is the (mostly manual) preparation of model output for inclusion in LaTeX, Microsoft Word, or HTML documents – usually with more than one model presented in a single table along with several goodness-of-fit statistics. However, statistical models in R have diverse object structures and summary methods, which makes this process cumbersome. This article first develops a set of guidelines for converting statistical model output to LaTeX and HTML tables, then assesses to what extent existing packages meet these requirements, and finally presents the texreg package as a solution that meets all of the criteria set out in the beginning. After providing various usage examples, a blueprint for writing custom model extensions is proposed

CiteSeerX

Crossref

Directory of Open Access Journals

Journal of Statistical Software

Enlighten

MPG.PuRe

UC-7 Software Engineer – Clarity LLC

Author: Mullins Amy
Publication venue: DigitalCommons@Kennesaw State University
Publication date: 27/04/2021
Field of study

Clarity makes an app called CaptionMate that does closed captions for phone calls.. During the internship, a website was made to visualize metrics that are collected on users such as calls made, minutes used, time active, region, age, theme, font, and platform used. Bar charts are used to show minutes and calls used on days of the week. 100% bar charts are used to show how much a day contributes to the usage of the app; a user contributes to minutes, calls and platform usage; and show calls incoming vs. outgoing. Line graphs were created to show growth in app usage and number of new users over time. All the data iis displayed in tables too. Who used the CaptionMate app, how much and when? Where are these users located? what age group do they fall into? what themes and fonts do they use? What platforms are they on? Visual studio with C# .NET Core, SQL Server, and ChartJS were used in the making of the website. C# was used for the backend of the website; it executed queries and stored results in lists. SQL Server was used to write queries. These results are displayed using HTML tables. ChartJS was then used to make visualizations after retreiving necessary data form the HTML tables. Many results were found. Users use more minutes and make more calls during the week then on weekends. Users use IOS more than any other platform. They like the gray theme most out of all themes and are located all around the US. The number of users has been growing slowly since the app was released. Some users stop using the app after some time of using it while other users do not try to make calls or make only unsuccessful calls.Advisors(s): Prof. Dawn Tatum [email protected](s): Data/Data AnalyticsCSE 498

DigitalCommons@Kennesaw State University

Indexing relations on the web

Author: Freire Juliana
Mergen Sergio Luis Sardi
Publication venue: EDBT
Publication date: 01/01/2010
Field of study

Journal ArticleThere has been a substantial increase in the volume of (semi) structured data on the Web. This opens new opportunities for exploring and querying these data that goes beyond the keyword-based queries traditionally used on the Web. But supporting queries over a very large number of apparently disconnected Web sources is challenging. In this paper we propose index methods that capture both the structure of the sources and connections between them. The indexes are designed for data that is represented as relations, such as HTML tables, and support queries with predicates. We show how associations between overlapping sources are discovered, captured in the indexes, and used to derive query rewritings that join multiple sources. We demonstrate, through an experimental evaluation

The University of Utah: J. Willard Marriott Digital Library