2,812 research outputs found

    Improved bibliographic reference parsing based on repeated patterns

    Get PDF
    uploaded by Plaz

    The NASA Astrophysics Data System: Architecture

    Full text link
    The powerful discovery capabilities available in the ADS bibliographic services are possible thanks to the design of a flexible search and retrieval system based on a relational database model. Bibliographic records are stored as a corpus of structured documents containing fielded data and metadata, while discipline-specific knowledge is segregated in a set of files independent of the bibliographic data itself. The creation and management of links to both internal and external resources associated with each bibliography in the database is made possible by representing them as a set of document properties and their attributes. To improve global access to the ADS data holdings, a number of mirror sites have been created by cloning the database contents and software on a variety of hardware and software platforms. The procedures used to create and manage the database and its mirrors have been written as a set of scripts that can be run in either an interactive or unsupervised fashion. The ADS can be accessed at http://adswww.harvard.eduComment: 25 pages, 8 figures, 3 table

    RefConcile – automated online reconciliation of bibliographic references

    Get PDF
    Comprehensive bibliographies often rely on community contributions. In such a setting, de-duplication is mandatory for the bibliography to be useful. Ideally, it works online, i.e., during the addition of new references, so the bibliography remains duplicate-free at all times. While de-duplication is well researched, generic approaches do not achieve the result quality required for automated reconciliation. To overcome this problem, we propose a new duplicate detection and reconciliation technique called RefConcile. Aimed specifically at bibliographic references, it uses dedicated blocking and matching techniques tailored to this type of data. Our evaluation based on a large real-world collection of bibliographic references shows that RefConcile scales well, and that it detects and reconciles duplicates highly accurately

    Building linguistic corpora from Wikipedia articles and discussions

    Get PDF
    Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus - DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings

    The Symbiotic Relationship Between Information Retrieval and Informetrics

    Get PDF
    Informetrics and information retrieval (IR) represent fundamental areas of study within information science. Historically, researchers have not fully capitalized on the potential research synergies that exist between these two areas. Data sources used in traditional informetrics studies have their analogues in IR, with similar types of empirical regularities found in IR system content and use. Methods for data collection and analysis used in informetrics can help to inform IR system development and evaluation. Areas of application have included automatic indexing, index term weighting and understanding user query and session patterns through the quantitative analysis of user transaction logs. Similarly, developments in database technology have made the study of informetric phenomena less cumbersome, and recent innovations used in IR research, such as language models and ranking algorithms, provide new tools that may be applied to research problems of interest to informetricians. Building on the author’s previous work (Wolfram 2003), this paper reviews a sample of relevant literature published primarily since 2000 to highlight how each area of study may help to inform and benefit the other
    • …
    corecore