Search CORE

2,507 research outputs found

Harvesting Entities from the Web Using Unique Identifiers -- IBEX

Author: Banko M.
Baumgartner R.
Crescenzi V.
Freitag D.
Nakashole N.
Probst K.
Putthividhya D.
Talaika A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/05/2015
Field of study

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.Comment: 30 pages, 5 figures, 9 tables. Complete technical report for A. Talaika, J. A. Biega, A. Amarilli, and F. M. Suchanek. IBEX: Harvesting Entities from the Web Using Unique Identifiers. WebDB workshop, 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Intelligent Information Access to Linked Data - Weaving the Cultural Heritage Web

Author: Kummer Robert
Publication venue
Publication date: 01/01/2013
Field of study

The subject of the dissertation is an information alignment experiment of two cultural heritage information systems (ALAP): The Perseus Digital Library and Arachne. In modern societies, information integration is gaining importance for many tasks such as business decision making or even catastrophe management. It is beyond doubt that the information available in digital form can offer users new ways of interaction. Also, in the humanities and cultural heritage communities, more and more information is being published online. But in many situations the way that information has been made publicly available is disruptive to the research process due to its heterogeneity and distribution. Therefore integrated information will be a key factor to pursue successful research, and the need for information alignment is widely recognized. ALAP is an attempt to integrate information from Perseus and Arachne, not only on a schema level, but to also perform entity resolution. To that end, technical peculiarities and philosophical implications of the concepts of identity and co-reference are discussed. Multiple approaches to information integration and entity resolution are discussed and evaluated. The methodology that is used to implement ALAP is mainly rooted in the fields of information retrieval and knowledge discovery. First, an exploratory analysis was performed on both information systems to get a first impression of the data. After that, (semi-)structured information from both systems was extracted and normalized. Then, a clustering algorithm was used to reduce the number of needed entity comparisons. Finally, a thorough matching was performed on the different clusters. ALAP helped with identifying challenges and highlighted the opportunities that arise during the attempt to align cultural heritage information systems

Kölner UniversitätsPublikationsServer

A structural SVM approach for reference parsing

Author: A Takasu
AK McCallum
AR Aronson
CC Chang
D Besagni
D Lee
Daniel X Le
E Cortez
E Herbst
F Parmentier
G Chowdhury
George R Thoma
I Kim
I Tsochantaridis
IA Huang
IG Councill
J Lafferty
J Zou
Jie Zou
MY Day
MY Day
S Lawrence
T Joachims
T Okada
Xiaoli Zhang
Y Ding
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Improved bibliographic reference parsing based on repeated patterns

Author: Böhm Klemens
Sautter Guido
Publication venue
Publication date: 30/04/2014
Field of study

uploaded by Plaz

ZENODO

RefConcile – automated online reconciliation of bibliographic references

Author: A. Polaszek
D. Defays
D. Geer
G. Sautter
H. Köpcke
H. Köpcke
J. Beall
K. Davies
K.S. Jones
M.A. Jaro
M.A. Jaro
T. Blakely
V.I. Levenshtein
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Comprehensive bibliographies often rely on community contributions. In such a setting, de-duplication is mandatory for the bibliography to be useful. Ideally, it works online, i.e., during the addition of new references, so the bibliography remains duplicate-free at all times. While de-duplication is well researched, generic approaches do not achieve the result quality required for automated reconciliation. To overcome this problem, we propose a new duplicate detection and reconciliation technique called RefConcile. Aimed specifically at bibliographic references, it uses dedicated blocking and matching techniques tailored to this type of data. Our evaluation based on a large real-world collection of bibliographic references shows that RefConcile scales well, and that it detects and reconciles duplicates highly accurately

Crossref

Open Research Online (The Open University)

Sailing the Information Ocean with Awareness of Currents: Discovery and Application of Source Dependence

Author: Berti-Equille Laure
Dong
Marian Amelie
Sarma Anish Das
Srivastava Divesh
Xin
Publication venue
Publication date: 01/01/2009
Field of study

The Web has enabled the availability of a huge amount of useful information, but has also eased the ability to spread false information and rumors across multiple sources, making it hard to distinguish between what is true and what is not. Recent examples include the premature Steve Jobs obituary, the second bankruptcy of United airlines, the creation of Black Holes by the operation of the Large Hadron Collider, etc. Since it is important to permit the expression of dissenting and conflicting opinions, it would be a fallacy to try to ensure that the Web provides only consistent information. However, to help in separating the wheat from the chaff, it is essential to be able to determine dependence between sources. Given the huge number of data sources and the vast volume of conflicting data available on the Web, doing so in a scalable manner is extremely challenging and has not been addressed by existing work yet. In this paper, we present a set of research problems and propose some preliminary solutions on the issues involved in discovering dependence between sources. We also discuss how this knowledge can benefit a variety of technologies, such as data integration and Web 2.0, that help users manage and access the totality of the available information from various sources.Comment: CIDR 200

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Recommended from our members

The Syllabus Based Web Content Extractor (SBWCE)

Author: Hilal Saba
Rizvi S.A.M.
Publication venue: CSUSB ScholarWorks
Publication date: 02/06/2014
Field of study

Syllabus Based Web Content Extractor (SBWCE) introduces a new technique of Syllabus Based Web Content Mining. It makes the Syllabus Based Web Content Extraction easy and creates an instant online book view based on the links relevant to the given Syllabus. Three important contributions are made by the current work. First, as multiple format educational information is needed for Syllabus based content; the technique used makes the finding of such content easier. Second, a new approach for capturing and recording the heuristics involved during searching by experts is used. Third, the grouping of Syllabus Words for precise extraction is exploited. This paper introduces SBWCE and presents the related details

CSUSB ScholarWorks