1,047 research outputs found

    HTML Tables

    Get PDF

    Htab2RDF: Mapping HTML Tables to RDF Triples

    Get PDF
    The Web has become a tremendously huge data source hidden under linked documents. A significant number of Web documents include HTML tables generated dynamically from relational databases. Often, there is no direct public access to the databases themselves. On the other hand, RDF (Resource Description Framework) gives an efficient mechanism to represent directly data on the Web based on a Web-scalable architecture for identification and interpretation of terms. This leads to the concept of Linked Data on the Web. To allow direct access to data on the Web as Linked Data, we propose in this paper an approach to transform HTML tables into RDF triples. It consists of three main phases: refining, pre-treatment and mapping. The whole process is assisted by a domain ontology and the WordNet lexical database. A tool called Htab2RDF has been implemented. Experiments have been carried out to evaluate and show efficiency of the proposed approach

    A clustering approach to extract data from HTML tables

    Get PDF
    HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiencyMinisterio de Ciencia e Innovación PID2020-112540RB-C44Ministerio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-106

    A hybrid quantum approach to leveraging data from HTML tables

    Get PDF
    The Web provides many data that are encoded using HTML tables. This facilitates rendering them, but obfuscates their structure and makes it difficult for automated business processes to leverage them. This has motivated many authors to work on proposals to extract them as automatically as possible. In this article, we present a new unsupervised proposal that uses a hybrid approach in which a standard computer is used to perform pre and post-processing tasks and a quantum computer is used to perform the core task: guessing whether the cells have labels or values. The problem is addressed using a clustering approach that is known to be NP using standard computers, but our proposal can solve it in polynomial time, which implies a significant performance improvement. It is novel in that it relies on an entropy-preservation metaphor that has proven to work very well on two large collections of real-world tables from the Wikipedia and the Dresden Web Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art proposal in terms of both effectiveness and efficiency; the key difference is that our proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-106

    DeExcelerator: A Framework for Extracting Relational Data From Partially Structured Documents

    Get PDF
    Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables

    TOMATE: A heuristic-based approach to extract data from HTML tables

    Get PDF
    Extracting data from user-friendly HTML tables is difficult because of their different lay outs, formats, and encoding problems. In this article, we present a new proposal that first applies several pre-processing heuristics to clean the tables, then performs functional anal ysis, and finally applies some post-processing heuristics to produce the output. Our most important contribution is regarding functional analysis, which we address by projecting the cells onto a high-dimensional feature space in which a standard clustering technique is used to make the meta-data cells apart from the data cells. We experimented with two large repositories of real-world HTML tables and our results confirm that our proposal can extract data from them with an F1 score of 89:50% in just 0:09 CPU seconds per table. We confronted our proposal with several competitors and the statistical analysis confirmed its superiority in terms of effectiveness, while it keeps very competitive in terms of efficiency.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-1060Ministerio de Ciencia e Innovación PID2020-112540RB-C4

    texreg: conversion of statistical model output in R to LaTeX and HTML tables

    Get PDF
    A recurrent task in applied statistics is the (mostly manual) preparation of model output for inclusion in LaTeX, Microsoft Word, or HTML documents – usually with more than one model presented in a single table along with several goodness-of-fit statistics. However, statistical models in R have diverse object structures and summary methods, which makes this process cumbersome. This article first develops a set of guidelines for converting statistical model output to LaTeX and HTML tables, then assesses to what extent existing packages meet these requirements, and finally presents the texreg package as a solution that meets all of the criteria set out in the beginning. After providing various usage examples, a blueprint for writing custom model extensions is proposed

    UC-7 Software Engineer – Clarity LLC

    Get PDF
    Clarity makes an app called CaptionMate that does closed captions for phone calls.. During the internship, a website was made to visualize metrics that are collected on users such as calls made, minutes used, time active, region, age, theme, font, and platform used. Bar charts are used to show minutes and calls used on days of the week. 100% bar charts are used to show how much a day contributes to the usage of the app; a user contributes to minutes, calls and platform usage; and show calls incoming vs. outgoing. Line graphs were created to show growth in app usage and number of new users over time. All the data iis displayed in tables too. Who used the CaptionMate app, how much and when? Where are these users located? what age group do they fall into? what themes and fonts do they use? What platforms are they on? Visual studio with C# .NET Core, SQL Server, and ChartJS were used in the making of the website. C# was used for the backend of the website; it executed queries and stored results in lists. SQL Server was used to write queries. These results are displayed using HTML tables. ChartJS was then used to make visualizations after retreiving necessary data form the HTML tables. Many results were found. Users use more minutes and make more calls during the week then on weekends. Users use IOS more than any other platform. They like the gray theme most out of all themes and are located all around the US. The number of users has been growing slowly since the app was released. Some users stop using the app after some time of using it while other users do not try to make calls or make only unsuccessful calls.Advisors(s): Prof. Dawn Tatum [email protected](s): Data/Data AnalyticsCSE 498

    Indexing relations on the web

    Get PDF
    Journal ArticleThere has been a substantial increase in the volume of (semi) structured data on the Web. This opens new opportunities for exploring and querying these data that goes beyond the keyword-based queries traditionally used on the Web. But supporting queries over a very large number of apparently disconnected Web sources is challenging. In this paper we propose index methods that capture both the structure of the sources and connections between them. The indexes are designed for data that is represented as relations, such as HTML tables, and support queries with predicates. We show how associations between overlapping sources are discovered, captured in the indexes, and used to derive query rewritings that join multiple sources. We demonstrate, through an experimental evaluation
    corecore