Search CORE

4 research outputs found

Clustering header categories extracted from web tables

Author: Abu-Tarif
Adelfio
Astrakhantsev
Ball
Bing
Chandran
Costa e Silva
Dalvi
Embley
Gatterbauer
Gonzalez
Handley
Hirayama
Hu
Hurst
Hurst
Itonori
Kieninger
Kieninger
Laurentini
Long
McQueen
Pinto
Pyreddy
Shamalian
Venetis
Wang
Zuyev
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date
Field of study

Crossref

Multi-hypothesis CSV parsing

Author: Boncz P.A. (Peter)
Döhmen T.R. (Till)
Mühleisen H.F. (Hannes)
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/06/2017
Field of study

Crossref

CWI's Institutional Repository

Information Extraction and Classification on Journal Papers

Author: Yu Lei
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/11/2021
Field of study

The importance of journals for diffusing the results of scientific research has increased considerably. In the digital era, Portable Document Format (PDF) became the established format of electronic journal articles. This structured form, combined with a regular and wide dissemination, spread scientific advancements easily and quickly. However, the rapidly increasing numbers of published scientific articles requires more time and effort on systematic literature reviews, searches and screens. The comprehension and extraction of useful information from the digital documents is also a challenging task, due to the complex structure of PDF. To help a soil science team from the United States Department of Agriculture (USDA) build a queryable journal paper system, we used web crawler to download articles on soil science from the digital library. We applied named entity recognition and table analysis to extract useful information including authors, journal name and type, publish date, abstract, DOI, experiment location in papers and highlight the paper characteristics in a computer queryable format in the system. Text classification is applied on to identify the parts of interest to the users and save their search time. We used traditional machine learning techniques including logistic regression, support vector machine, decision tree, naive bayes, k-nearest neighbors, random forest, ensemble modeling, and neural networks in text classification and compare the advantages of these approaches in the end. Advisor: Stephen D. Scot

DigitalCommons@University of Nebraska

Clustering header categories extracted from web tables

Author: Embley David W.
Krishnamoorthy Mukkai
Nagy George
Seth Sharad C.
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/02/2015
Field of study

Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table cell is classified according to the fundamental indexing property of row and column headers. The categories that correspond to the multi-dimensional data cube view of a table are extracted by factoring the (often multi-row/column) headers. To reveal commonalities between tables from diverse sources, the Jaccard distances between pairs of category headers (and also table titles) are computed. We show how about one third of our heterogeneous collection can be clustered into a dozen groups that exhibit table-title and header similarities that can be exploited for queries

Crossref

DigitalCommons@University of Nebraska