Search CORE

6 research outputs found

Structured digital tables on the Semantic Web: toward a structured digital literature

Author: Gatterbauer W
Good B
Jensen L
Kei‐Hoi Cheung
Lam HY
Mark B Gerstein
Matthias Samwald
Raymond K Auerbach
Publication venue: Nature Publishing Group
Publication date: 01/01/2010
Field of study

In parallel to the growth in bioscience databases, biomedical publications have increased exponentially in the past decade. However, the extraction of high-quality information from the corpus of scientific literature has been hampered by the lack of machine-interpretable content, despite text-mining advances. To address this, we propose creating a structured digital table as part of an overall effort in developing machine-readable, structured digital literature. In particular, we envision transforming publication tables into standardized triples using Semantic Web approaches. We identify three canonical types of tables (conveying information about properties, networks, and concept hierarchies) and show how more complex tables can be built from these basic types. We envision that authors would create tables initially using the structured triples for canonical types and then have them visually rendered for publication, and we present examples for converting representative tables into triples. Finally, we discuss how ‘stub' versions of structured digital tables could be a useful bridge for connecting together the literature with databases, allowing the former to more precisely document the later

Crossref

Directory of Open Access Journals

PubMed Central

Access to Research at National University of Ireland, Galway

ARDA: Automatic Relational Data Augmentation for Machine Learning

Author: Chepurko Nadiia
Fernandez Raul Castro
Karger David
Kraska Tim
Marcus Ryan
Zgraggen Emanuel
Publication venue
Publication date: 21/03/2020
Field of study

Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement. We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets

arXiv.org e-Print Archive

DSpace@MIT

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Table extraction using spatial reasoning on the CSS2 visual box model

Author: Paul Bohunsky
Wolfgang Gatterbauer
Publication venue
Publication date
Field of study

Tables on web pages contain a huge amount of semantically explicit information, which makes them a worthwhile target for automatic information extraction and knowledge acquisition from the Web. However, the task of table extraction from web pages is difficult, because of HTML’s design purpose to convey visual instead of semantic information. In this paper, we propose a robust technique for table extraction from arbitrary web pages. This technique relies upon the positional information of visualized DOM element nodes in a browser and, hereby, separates the intricacies of code implementation from the actual intended visual appearance. The novel aspect of the proposed web table extraction technique is the effective use of spatial reasoning on the CSS2 visual box model, which shows a high level of robustness even without any form of learning (F-measure ≈ 90%). We describe the ideas behind our approach, the tabular pattern recognition algorithm operating on a double topographical grid structure and allowing for effective and robust extraction, and general observations on web tables that should be borne in mind by any automatic web table extraction mechanism

CiteSeerX

A teachable semi-automatic web information extraction system based on evolved regular expression patterns

Author: Nor Zainah Siau (7169549)
Publication venue
Publication date: 01/01/2014
Field of study

This thesis explores Web Information Extraction (WIE) and how it has been used in decision making and to support businesses in their daily operations. The research focuses on a WIE system based on Genetic Programming (GP) with an extensible model to enhance the automatic extractor. This uses a human as a teacher to identify and extract relevant information from the semi-structured HTML webpages. Regular expressions, which have been chosen as the pattern matching tool, are automatically generated based on the training data to provide an improved grammar and lexicon. This particularly benefits the GP system which may need to extend its lexicon in the presence of new tokens in the web pages. These tokens allow the GP method to produce new extraction patterns for new requirements

Loughborough University Institutional Repository

Intelligent Information Systems for Web Product Search

Author: Vandic D. (Damir)
Publication venue: Over the last few years, we have experienced an increase in online shopping. Consequently, there is a need for efficient and effective product search engines. The rapid growth of e-commerce, however, has also introduced some challenges. Studies show that users can get overwhelmed by the information and offerings presented online while searching for products. In an attempt to lighten this information overload burden on consumers, there are several product search engines that aggregate product descriptions and price information from the Web and allow the user to easily query this information. Most of these search engines expect to receive the data from the participating Web shops in a specific format, which means Web shops need to transform their data more than once, as each product search engine requires a different format. Because currently most product information aggregation services rely on Web shops to send them their data, there is a big opportunity for solutions that aim to tackle this problem using a more automated approach. This dissertation addresses key aspects of implementing such a system, including hierarchical product classification, entity resolution, ontology population and schema mapping, and lastly, the optimization of faceted user interfaces. The findings of this work show us how one can design Web product search engines that automatically aggregate product information while allowing users to perform effective and efficient queries.
Publication date: 10/02/2017
Field of study

Over the last few years, we have experienced an increase in online shopping. Consequently, there is a need for efficient and effective product search engines. The rapid growth of e-commerce, however, has also introduced some challenges. Studies show that users can get overwhelmed by the information and offerings presented online while searching for products. In an attempt to lighten this information overload burden on consumers, there are several product search engines that aggregate product descriptions and price information from the Web and allow the user to easily query this information. Most of these search engines expect to receive the data from the participating Web shops in a specific format, which means Web shops need to transform their data more than once, as each product search engine requires a different format. Because currently most product information aggregation services rely on Web shops to send them their data, there is a big opportunity for solutions that aim to tackle this problem using a more automated approach. This dissertation addresses key aspects of implementing such a system, including hierarchical product classification, entity resolution, ontology population and schema mapping, and lastly, the optimization of faceted user interfaces. The findings of this work show us how one can design Web product search engines that automatically aggregate product information while allowing users to perform effective and efficient queries

EUR Research Repository

Erasmus University Digital Repository