819 research outputs found
Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration
Cross-language information retrieval (CLIR), where queries and documents are
in different languages, has of late become one of the major topics within the
information retrieval community. This paper proposes a Japanese/English CLIR
system, where we combine a query translation and retrieval modules. We
currently target the retrieval of technical documents, and therefore the
performance of our system is highly dependent on the quality of the translation
of technical terms. However, the technical term translation is still
problematic in that technical terms are often compound words, and thus new
terms are progressively created by combining existing base words. In addition,
Japanese often represents loanwords based on its special phonogram.
Consequently, existing dictionaries find it difficult to achieve sufficient
coverage. To counter the first problem, we produce a Japanese/English
dictionary for base words, and translate compound words on a word-by-word
basis. We also use a probabilistic method to resolve translation ambiguity. For
the second problem, we use a transliteration method, which corresponds words
unlisted in the base word dictionary to their phonetic equivalents in the
target language. We evaluate our system using a test collection for CLIR, and
show that both the compound word translation and transliteration methods
improve the system performance
ParaNames: A Massively Multilingual Entity Name Corpus
This preprint describes work in progress on ParaNames, a multilingual
parallel name resource consisting of names for approximately 14 million
entities. The included names span over 400 languages, and almost all entities
are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a
source, we create the largest resource of this type to-date. We describe our
approach to filtering and standardizing the data to provide the best quality
possible. ParaNames is useful for multilingual language processing, both in
defining tasks for name translation/transliteration and as supplementary data
for tasks such as named entity recognition and linking. We demonstrate an
application of ParaNames by training a multilingual model for canonical name
translation to and from English. Our resource is released at
\url{https://github.com/bltlab/paranames} under a Creative Commons license (CC
BY 4.0)
Automatically generated, phonemic Arabic-IPA pronunciation tiers for the boundary annotated Qur'an dataset for machine learning (version 2.0)
In this paper, we augment the Boundary Annotated Qur?an dataset published at LREC 2012 (Brierley et al 2012; Sawalha et al 2012a) with automatically generated phonemic transcriptions of Arabic words. We have developed and evaluated a comprehensive grapheme-phoneme mapping from Standard Arabic \ensuremath> IPA (Brierley et al under review), and implemented the mapping in Arabic transcription technology which achieves 100% accuracy as measured against two gold standards: one for Qur?anic or Classical Arabic, and one for Modern Standard Arabic (Sawalha et al [1]). Our mapping algorithm has also been used to generate a pronunciation guide for a subset of Qur?anic words with heightened prosody (Brierley et al 2014). This is funded research under the EPSRC " Working Together" theme
- …