Search CORE

75,841 research outputs found

Automatic extraction of knowledge from web documents

Author: Alani Harith
Hall Wendy
Kim Sanghee
Lewis Paul H.
Millard David E.
Shadbolt Nigel R.
Weal Mark J.
Publication venue
Publication date: 01/01/2003
Field of study

A large amount of digital information available is written as text documents in the form of web pages, reports, papers, emails, etc. Extracting the knowledge of interest from such documents from multiple sources in a timely fashion is therefore crucial. This paper provides an update on the Artequakt system which uses natural language tools to automatically extract knowledge about artists from multiple documents based on a predefined ontology. The ontology represents the type and form of knowledge to extract. This knowledge is then used to generate tailored biographies. The information extraction process of Artequakt is detailed and evaluated in this paper

CiteSeerX

Southampton (e-Prints Soton)

Open Research Online (The Open University)

ExaCT: automatic extraction of clinical trial characteristics from journal publications

Author: Carini Simona
de Bruijn Berry
Kiritchenko Svetlana
Martin Joel
Sim Ida
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Clinical trials are one of the most important sources of evidence for guiding evidence-based practice and the design of new trials. However, most of this information is available only in free text - e.g., in journal publications - which is labour intensive to process for systematic reviews, meta-analyses, and other evidence synthesis studies. This paper presents an automatic information extraction system, called ExaCT, that assists users with locating and extracting key trial characteristics (e.g., eligibility criteria, sample size, drug dosage, primary outcomes) from full-text journal articles reporting on randomized controlled trials (RCTs). Methods ExaCT consists of two parts: an information extraction (IE) engine that searches the article for text fragments that best describe the trial characteristics, and a web browser-based user interface that allows human reviewers to assess and modify the suggested selections. The IE engine uses a statistical text classifier to locate those sentences that have the highest probability of describing a trial characteristic. Then, the IE engine's second stage applies simple rules to these sentences to extract text fragments containing the target answer. The same approach is used for all 21 trial characteristics selected for this study. Results We evaluated ExaCT using 50 previously unseen articles describing RCTs. The text classifier (<it>first stage</it>) was able to recover 88% of relevant sentences among its top five candidates (top5 recall) with the topmost candidate being relevant in 80% of cases (top1 precision). Precision and recall of the extraction rules (<it>second stage</it>) were 93% and 91%, respectively. Together, the two stages of the extraction engine were able to provide (partially) correct solutions in 992 out of 1050 test tasks (94%), with a majority of these (696) representing fully correct and complete answers. Conclusions Our experiments confirmed the applicability and efficacy of ExaCT. Furthermore, they demonstrated that combining a statistical method with 'weak' extraction rules can identify a variety of study characteristics. The system is flexible and can be extended to handle other characteristics and document types (e.g., study protocols).</p

NRC Publications Archive

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

BlogForever D2.6: Data Extraction Methodology

Author: Banos V.
Davis R.
Gkotsis G.
Pincent E.
Stepanyan K.
Publication venue
Publication date: 25/10/2013
Field of study

This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Intelligent Self-Repairable Web Wrappers

Author: A. Laender
B. Chidlovskii
E. Ferrara
K. Lerman
N. Kushmerick
P. Bille
R. Baumgartner
S. Sarawagi
S. Selkow
W. Yang
X. Meng
Y. Kim
Publication venue
Publication date: 01/01/2011
Field of study

The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.\u

arXiv.org e-Print Archive

Crossref

CogPrints Cognitive Sciences Eprint Archive

The impact of learning styles on student grouping for collaborative learning: a case study

Author: Alfonseca Cubero Enrique
Carro Rosa M.
Martín Estefanía
Ortigosa Álvaro
Paredes Barragán Pedro
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

The final publication is available at Springer via http://dx.doi.org/10.1007/s11257-006-9012-7Learning style models constitute a valuable tool for improving individual learning by the use of adaptation techniques based on them. In this paper, we present how the benefit of considering learning styles with adaptation purposes, as part of the user model, can be extended to the context of collaborative learning as a key feature for group formation. We explore the effects that the combination of students with different learning styles in specific groups may have in the final results of the tasks accomplished by them collaboratively. With this aim, a case study with 166 students of computer science has been carried out, from which conclusions are drawn. We also describe how an existing web-based system can take advantage of learning style information in order to form more productive groups. Our ongoing work concerning the automatic extraction of grouping rules starting from data about previous interactions within the system is also outlined. Finally, we present our challenges, related to the continuous improvement of collaboration by the use and dynamic modification of automatic grouping rules.This project has been funded by the Spanish Ministry of Science and Education, TIN2004-03140

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

An Infrastructure for acquiring high quality semantic metadata

Author: B. Popov
C. Fellbaum
F. Harmelen van
L. Stojanovic
M. Vargas-Vera
N.F. Noy
T. Berners-Lee
V. Lopez
Y. Sure
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Because metadata that underlies semantic web applications is gathered from distributed and heterogeneous data sources, it is important to ensure its quality (i.e., reduce duplicates, spelling errors, ambiguities). However, current infrastructures that acquire and integrate semantic data have only marginally addressed the issue of metadata quality. In this paper we present our metadata acquisition infrastructure, ASDI, which pays special attention to ensuring that high quality metadata is derived. Central to the architecture of ASDI is a erification engine that relies on several semantic web tools to check the quality of the derived data. We tested our prototype in the context of building a semantic web portal for our lab, KMi. An experimental evaluation omparing the automatically extracted data against manual annotations indicates that the verification engine enhances the quality of the extracted semantic metadata

CiteSeerX

Crossref

Open Research Online (The Open University)