3,721 research outputs found

    A geo-temporal information extraction service for processing descriptive metadata in digital libraries

    Get PDF
    In the context of digital map libraries, resources are usually described according to metadata records that define the relevant subject, location, time-span, format and keywords. On what concerns locations and time-spans, metadata records are often incomplete or they provide information in a way that is not machine-understandable (e.g. textual descriptions). This paper presents techniques for extracting geotemporal information from text, using relatively simple text mining methods that leverage on a Web gazetteer service. The idea is to go from human-made geotemporal referencing (i.e. using place and period names in textual expressions) into geo-spatial coordinates and time-spans. A prototype system, implementing the proposed methods, is described in detail. Experimental results demonstrate the efficiency and accuracy of the proposed approaches

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

    User evaluation of a pilot terminologies server for a distributed multi-scheme environment

    Get PDF
    The present paper reports on a user-centred evaluation of a pilot terminology service developed as part of the High Level Thesaurus (HILT) project at the Centre for Digital Library Research (CDLR) in the University of Strathclyde in Glasgow. The pilot terminology service was developed as an experimental platform to investigate issues relating to mapping between various subject schemes, namely Dewey Decimal Classification (DDC), Library of Congress Subject Headings (LCSH), the Unesco thesaurus, and the MeSH thesaurus, in order to cater for cross-browsing and cross-searching across distributed digital collections and services. The aim of the evaluation reported here was to investigate users' thought processes, perceptions, and attitudes towards the pilot terminology service and to identify user requirements for developing a full-blown pilot terminology service

    Onebox: Free-Text Interfaces as an Alternative to Complex Web Forms

    Get PDF
    This paper investigates the problem of translating free-text\ud queries into key-value pairs as an alternative means for searching `behind' web forms. We introduce a novel specication language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks

    Word-Entity Duet Representations for Document Ranking

    Full text link
    This paper presents a word-entity duet framework for utilizing knowledge bases in ad-hoc retrieval. In this work, the query and documents are modeled by word-based representations and entity-based representations. Ranking features are generated by the interactions between the two representations, incorporating information from the word space, the entity space, and the cross-space connections through the knowledge graph. To handle the uncertainties from the automatically constructed entity representations, an attention-based ranking model AttR-Duet is developed. With back-propagation from ranking labels, the model learns simultaneously how to demote noisy entities and how to rank documents with the word-entity duet. Evaluation results on TREC Web Track ad-hoc task demonstrate that all of the four-way interactions in the duet are useful, the attention mechanism successfully steers the model away from noisy entities, and together they significantly outperform both word-based and entity-based learning to rank systems

    MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach

    Full text link
    Entity linking has recently been the subject of a significant body of research. Currently, the best performing approaches rely on trained mono-lingual models. Porting these approaches to other languages is consequently a difficult endeavor as it requires corresponding training data and retraining of the models. We address this drawback by presenting a novel multilingual, knowledge-based agnostic and deterministic approach to entity linking, dubbed MAG. MAG is based on a combination of context-based retrieval on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data sets and in 7 languages. Our results show that the best approach trained on English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse on datasets in other languages. MAG, on the other hand, achieves state-of-the-art performance on English datasets and reaches a micro F-measure that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc

    Query Expansion with Locally-Trained Word Embeddings

    Full text link
    Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for ad hoc information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained globally, underperform corpus and query specific embeddings for retrieval tasks. These results suggest that other tasks benefiting from global embeddings may also benefit from local embeddings

    Matching MEDLINE/PubMed data with Web of Science (WoS): a routine in R language

    Get PDF
    We present a novel routine, namely medlineR, based on R language, that enables the user to match data from MEDLINE/PubMed with records indexed in the ISI Web of Science (WoS) database. The matching allows exploiting the rich and controlled vocabulary of Medical Sub- ject Headings (MeSH) of MEDLINE/PubMed with additional fields of WoS. The integration provides data (e.g. citation data, list of cited reference, list of the addresses of authors’ host organisations, WoS subject categories) to perform a variety of scientometric analyses. This brief communication describes medlineR, the methodology on which it relies, and the steps the user should follow to perform the matching across the two databases. In order to specify the differences from Leydesdorff and Opthof (2013), we conclude the brief communication by testing the routine on the case of the "Burgada Syndrome"
    • 

    corecore