819 research outputs found

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

    ParaNames: A Massively Multilingual Entity Name Corpus

    Full text link
    This preprint describes work in progress on ParaNames, a multilingual parallel name resource consisting of names for approximately 14 million entities. The included names span over 400 languages, and almost all entities are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to-date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English. Our resource is released at \url{https://github.com/bltlab/paranames} under a Creative Commons license (CC BY 4.0)

    Automatically generated, phonemic Arabic-IPA pronunciation tiers for the boundary annotated Qur'an dataset for machine learning (version 2.0)

    Get PDF
    In this paper, we augment the Boundary Annotated Qur?an dataset published at LREC 2012 (Brierley et al 2012; Sawalha et al 2012a) with automatically generated phonemic transcriptions of Arabic words. We have developed and evaluated a comprehensive grapheme-phoneme mapping from Standard Arabic \ensuremath> IPA (Brierley et al under review), and implemented the mapping in Arabic transcription technology which achieves 100% accuracy as measured against two gold standards: one for Qur?anic or Classical Arabic, and one for Modern Standard Arabic (Sawalha et al [1]). Our mapping algorithm has also been used to generate a pronunciation guide for a subset of Qur?anic words with heightened prosody (Brierley et al 2014). This is funded research under the EPSRC " Working Together" theme
    • …
    corecore