2,507 research outputs found

    Harvesting Entities from the Web Using Unique Identifiers -- IBEX

    Full text link
    In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.Comment: 30 pages, 5 figures, 9 tables. Complete technical report for A. Talaika, J. A. Biega, A. Amarilli, and F. M. Suchanek. IBEX: Harvesting Entities from the Web Using Unique Identifiers. WebDB workshop, 201

    Intelligent Information Access to Linked Data - Weaving the Cultural Heritage Web

    Get PDF
    The subject of the dissertation is an information alignment experiment of two cultural heritage information systems (ALAP): The Perseus Digital Library and Arachne. In modern societies, information integration is gaining importance for many tasks such as business decision making or even catastrophe management. It is beyond doubt that the information available in digital form can offer users new ways of interaction. Also, in the humanities and cultural heritage communities, more and more information is being published online. But in many situations the way that information has been made publicly available is disruptive to the research process due to its heterogeneity and distribution. Therefore integrated information will be a key factor to pursue successful research, and the need for information alignment is widely recognized. ALAP is an attempt to integrate information from Perseus and Arachne, not only on a schema level, but to also perform entity resolution. To that end, technical peculiarities and philosophical implications of the concepts of identity and co-reference are discussed. Multiple approaches to information integration and entity resolution are discussed and evaluated. The methodology that is used to implement ALAP is mainly rooted in the fields of information retrieval and knowledge discovery. First, an exploratory analysis was performed on both information systems to get a first impression of the data. After that, (semi-)structured information from both systems was extracted and normalized. Then, a clustering algorithm was used to reduce the number of needed entity comparisons. Finally, a thorough matching was performed on the different clusters. ALAP helped with identifying challenges and highlighted the opportunities that arise during the attempt to align cultural heritage information systems

    Improved bibliographic reference parsing based on repeated patterns

    Get PDF
    uploaded by Plaz

    RefConcile – automated online reconciliation of bibliographic references

    Get PDF
    Comprehensive bibliographies often rely on community contributions. In such a setting, de-duplication is mandatory for the bibliography to be useful. Ideally, it works online, i.e., during the addition of new references, so the bibliography remains duplicate-free at all times. While de-duplication is well researched, generic approaches do not achieve the result quality required for automated reconciliation. To overcome this problem, we propose a new duplicate detection and reconciliation technique called RefConcile. Aimed specifically at bibliographic references, it uses dedicated blocking and matching techniques tailored to this type of data. Our evaluation based on a large real-world collection of bibliographic references shows that RefConcile scales well, and that it detects and reconciles duplicates highly accurately

    Sailing the Information Ocean with Awareness of Currents: Discovery and Application of Source Dependence

    Full text link
    The Web has enabled the availability of a huge amount of useful information, but has also eased the ability to spread false information and rumors across multiple sources, making it hard to distinguish between what is true and what is not. Recent examples include the premature Steve Jobs obituary, the second bankruptcy of United airlines, the creation of Black Holes by the operation of the Large Hadron Collider, etc. Since it is important to permit the expression of dissenting and conflicting opinions, it would be a fallacy to try to ensure that the Web provides only consistent information. However, to help in separating the wheat from the chaff, it is essential to be able to determine dependence between sources. Given the huge number of data sources and the vast volume of conflicting data available on the Web, doing so in a scalable manner is extremely challenging and has not been addressed by existing work yet. In this paper, we present a set of research problems and propose some preliminary solutions on the issues involved in discovering dependence between sources. We also discuss how this knowledge can benefit a variety of technologies, such as data integration and Web 2.0, that help users manage and access the totality of the available information from various sources.Comment: CIDR 200
    corecore