Search CORE

22 research outputs found

Identifying Web Tables - Supporting a Neglected Type of Content on the Web

Author: A Silva
D Embley
J Hu
M Kolchin
MA Babyak
Y Tijerino
Publication venue
Publication date: 23/03/2015
Field of study

The abundance of the data in the Internet facilitates the improvement of extraction and processing tools. The trend in the open data publishing encourages the adoption of structured formats like CSV and RDF. However, there is still a plethora of unstructured data on the Web which we assume contain semantics. For this reason, we propose an approach to derive semantics from web tables which are still the most popular publishing tool on the Web. The paper also discusses methods and services of unstructured data extraction and processing as well as machine learning techniques to enhance such a workflow. The eventual result is a framework to process, publish and visualize linked open data. The software enables tables extraction from various open data sources in the HTML format and an automatic export to the RDF format making the data linked. The paper also gives the evaluation of machine learning techniques in conjunction with string similarity functions to be applied in a tables recognition task.Comment: 9 pages, 4 figure

arXiv.org e-Print Archive

Crossref

Web Search using Improved Concept Based Query Refinement

Author: Suresh Ralla
V Swetha
Vemuri Saritha
Publication venue: Institute for Project Management Pvt. Ltd
Publication date: 29/08/2020
Field of study

The information extracted from Web pages can be used for effective query expansion. The aspect needed to improve accuracy of web search engines is the inclusion of metadata, not only to analyze Web content, but also to interpret. With the Web of today being unstructured and semantically heterogeneous, keyword-based queries are likely to miss important results. . Using data mining methods, our system derives dependency rules and applies them to concept-based queries. This paper presents a novel approach for query expansion that applies dependence rules mined from a large Web World, combining several existing techniques for data extraction and mining, to integrate the system into COMPACT, our prototype implementation of a concept-based search engine

Interscience Research Network

On-the-fly Table Generation

Author: Nguyen Thanh Tam
Sekhavat Yoones A.
Yahya Mohamed
Yin Pengcheng
Zwicklbauer Stefan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/05/2018
Field of study

Many information needs revolve around entities, which would be better answered by summarizing results in a tabular format, rather than presenting them as a ranked list. Unlike previous work, which is limited to retrieving existing tables, we aim to answer queries by automatically compiling a table in response to a query. We introduce and address the task of on-the-fly table generation: given a query, generate a relational table that contains relevant entities (as rows) along with their key properties (as columns). This problem is decomposed into three specific subtasks: (i) core column entity ranking, (ii) schema determination, and (iii) value lookup. We employ a feature-based approach for entity ranking and schema determination, combining deep semantic features with task-specific signals. We further show that these two subtasks are not independent of each other and can assist each other in an iterative manner. For value lookup, we combine information from existing tables and a knowledge base. Using two sets of entity-oriented queries, we evaluate our approach both on the component level and on the end-to-end table generation task.Comment: The 41st International ACM SIGIR Conference on Research and Development in Information Retrieva

arXiv.org e-Print Archive

Crossref

Automatic Web Table Transcoding for Mobile Devices Based on Table Classification

Author: 周清江
Publication venue: International Association for Development of the Information Society(IADIS)
Publication date
Field of study

[[abstract]]Many techniques have been proposed to improve web browsing experiences on the mobile devices by transcoding the original web content. However, the original semantics of web tables tend to be broken in the transcoded results. We capture basic features of web tables from their DOM-tree (Document Object Model Tree) semantic information. We propose a new table feature called Cell Extension Direction (CED) to capture the extension direction of cell content as one-directional or bi-directional. CED is computed by checking the difference between the average composite object type (ACOT) of rows and that of columns. These features are used to classify web tables into data tables and layout tables. The classification results, along with CC/PP configurations of the mobile device, are then utilized to guide the applications of the following three transcoding strategies for tables: zooming, transposition, and one-column-view. We demonstrate that the table semantics could be preserved in the transcoding results.[[conferencetype]]國際[[conferencedate]]20110722~2011072

Tamkang University Institutional Repository

Automatic Wrapper Generation and Maintenance

Author: Ge Fujiang
Xia Yingju
Yang Yuhang
Zhang Shu
Publication venue: Yu, Hao
Publication date: 01/01/2011
Field of study

Waseda University Repository

Cell Classification for Layout Recognition in Spreadsheets

Author: Koci Elvis
Lehner Wolfgang
Romero Oscar
Thiele Maik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/07/2021
Field of study

Spreadsheets compose a notably large and valuable dataset of documents within the enterprise settings and on the Web. Although spreadsheets are intuitive to use and equipped with powerful functionalities, extracting and reusing data from them remains a cumbersome and mostly manual task. Their greatest strength, the large degree of freedom they provide to the user, is at the same time also their greatest weakness, since data can be arbitrarily structured. Therefore, in this paper we propose a supervised learning approach for layout recognition in spreadsheets. We work on the cell level, aiming at predicting their correct layout role, out of five predefined alternatives. For this task we have considered a large number of features not covered before by related work. Moreover, we gather a considerably large dataset of annotated cells, from spreadsheets exhibiting variability in format and content. Our experiments, with five different classification algorithms, show that we can predict cell layout roles with high accuracy. Subsequently, in this paper we focus on revising the classification results, with the aim of repairing misclassifications. We propose a sophisticated approach, composed of three steps, which effectively corrects a reasonable number of inaccurate predictions

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Automatic Table Recognition and Extrac-tion from Heterogeneous Documents

Author: B A Ojokoh
F F Babatunde
S A Oluwadare
Publication venue
Publication date: 24/04/2020
Field of study

Abstract This paper examines automatic recognition and extraction of tables from a large collection of heterogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) is then applied to the HTML code in order to extract the tables. The model was trained and tested with five hundred and twenty six self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. The system was evaluated in terms of accuracy, precision, recall and f-measure. The overall evaluation results show 88.8% accuracy, 96.8% precision, 91.7% recall and 88.8% F-measure revealing that the method is good at solving the problem of table extraction

CiteSeerX