972 research outputs found

    English-Hindi transliteration using context-informed PB-SMT: the DCU system for NEWS 2009

    Get PDF
    This paper presents English—Hindi transliteration in the NEWS 2009 Machine Transliteration Shared Task adding source context modeling into state-of-the-art log-linear phrase-based statistical machine translation (PB-SMT). Source context features enable us to exploit source similarity in addition to target similarity, as modelled by the language model. We use a memory-based classification framework that enables efficient estimation of these features while avoiding data sparseness problems.We carried out experiments both at character and transliteration unit (TU) level. Position-dependent source context features produce significant improvements in terms of all evaluation metrics

    Moses-based official baseline for NEWS 2016

    Get PDF
    Transliteration is the phonetic translation between two different languages. There are many works that approach transliteration using machine translation methods. This paper describes the official baseline system for the NEWS 2016 workshop shared task. This baseline is based on a standard phrase-based machine translation system using Moses. Results are between the range of best and worst from last year’s workshops providing a nice starting point for participants this year.Postprint (published version

    A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics

    Full text link
    © {Owner/Author | ACM} {Year}. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation, http://dx.doi.org/10.1145/2824864.2824876[EN] In this paper, we describe a hybrid approach for word-level language (WLL) identification of Bangla words written in Roman script and mixed with English words as part of our participation in the shared task on transliterated search at Forum for Information Retrieval Evaluation (FIRE) in 2014. A CRF based machine learning model and post-processing heuristics are employed for the WLL identification task. In addition to language identification, two transliteration systems were built to transliterate detected Bangla words written in Roman script into native Bangla script. The system demonstrated an overall token level language identification accuracy of 0.905. The token level Bangla and English language identification F-scores are 0.899, 0.920 respectively. The two transliteration systems achieved accuracies of 0.062 and 0.037. The word-level language identification system presented in this paper resulted in the best scores across almost all metrics among all the participating systems for the Bangla-English language pair.We acknowledge the support of the Department of Electronics and Information Technology (DeitY), Government of India, through the project “CLIA System Phase II”. The research work of the last author was carried out in the framework of WIQ-EI IRSES (Grant No. 269180) within the FP 7 Marie Curie, DIANA-APPLICATIONS (TIN2012-38603-C02-01) projects and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Banerjee, S.; Kuila, A.; Roy, A.; Naskar, SK.; Rosso, P.; Bandyopadhyay, S. (2014). A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics. En FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation. ACM. 170-173. https://doi.org/10.1145/2824864.2824876S170173Y. Al-Onaizan and K. Knight. Named entity translation: Extended abstract. In HLT, pages 122--124. Singapore, 2002.P. J. Antony, V. P. Ajith, and K. P. Suman. Feature extraction based english to kannada transliteration. In In hird International conference on Semantic E-business and Enterprise Computing. SEEC 2010, 2010.P. J. Antony, V. P. Ajith, and K. P. Suman. Kernel method for english to kannada transliteration. In International conference on-Recent trends in Information, Telecommunication and computing. ITC2010, 2010.M. Arbabi, S. M. Fischthal, V. C. Cheng, and E. Bart. Algorithms for arabic name transliteration. In IBM Journal of Research and Development, page 183. TeX Users Group, 1994.S. Banerjee, S. Naskar, and S. Bandyopadhyay. Bengali named entity recognition using margin infused relaxed algorithm. In TSD, pages 125--132. Springer International Publishing, 2014.U. Barman, J. Wagner, G. Chrupala, and J. Foster. Identification of languages and encodings in a multilingual document. page 127. EMNLP, 2014.K. R. Beesley. Language identifier: A computer program for automatic natural-language identification of on-line text. pages 47--54. ATA, 1988.P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. Mercer: The mathematics of statistical machine translation: parameter estimation. pages 263--311. Computational Linguistics, 1993.M. Carpuat. Mixed-language and code-switching in the canadian hansard. page 107. EMNLP, 2014.G. Chittaranjan, Y. Vyas, K. Bali, and M. Choudhury. Word-level language identification using crf: Code-switching shared task report of msr india system. pages 73--79. EMNLP, 2014.A. Das, A. Ekbal, T. Mandal, and S. Bandyopadhyay. English to hindi machine transliteration system at news. pages 80--83. Proceeding of the Named Entities Workshop ACL-IJCNLP, Singapore, 2009.A. Ekbal, S. Naskar, and S. Bandyopadhyay. A modified joint source channel model for transliteration. pages 191--198. COLING-ACL Australia, 2006.I. Goto, N. Kato, N. Uratani, and T. Ehara. Transliteration considering context information based on the maximum entropy method. pages 125--132. MT-Summit IX, New Orleans, USA, 2003.R. Haque, S. Dandapat, A. K. Srivastava, S. K. Naskar, and A. Way. English to hindi transliteration using context-informed pb-smt:the dcu system for news 2009. NEWS 2009, 2009.S. Y. Jung, S. Hong, and E. Paek. An english to korean transliteration model of extended markov window.S. Y. Jung, S. L. Hong, and E. Paek. An english to korean transliteration model of extended markov window. pages 383--389. COLING, 2000.B. J. Kang and K. S. Choi. Automatic transliteration and back-transliteration by decision tree learning. LERC, May 2000.B. King and S. Abney. Labeling the languages of words in mixed-language documents using weakly supervised methods. pages 1110--1119. NAACL-HLT, 2013.R. Kneser and H. Ney. Improved backing-off for m-gram language modeling. In ICASSP, pages 181--184. Detroit, MI, 1995.R. Kneser and H. Ney. SRILM-an extensible language modeling toolkit. In Intl. Conf. on Spoken Language Processing, pages 901--904, 2002.K. Knight and J. Graehl. Machine transliteration. in computational linguistics. pages 599--612, 1998.P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: open source toolkit for statistical machine translation. In ACL, pages 177--180, 2007.P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In HLT-NAACL, 2003.A. Kumaran and T. Kellner. A generic framework for machine transliteration. In 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 721--722. ACM, 2007.H. Li, Z. Min, and J. Su. A joint source-channel model for machine transliteration. In ACL, page 159, 2004.C. Lignos and M. Marcus. Toward web-scale analysis of codeswitching. In Annual Meeting of the Linguistic Society of America, 2013.J. H. Oh and K. S. Choi. An english-korean transliteration model using pronunciation and contextual rules. In 19th international conference on Computational linguistics. ACL, 2002.T. Rama and K. Gali. Modeling machine transliteration as a phrase based statistical machine translation problem. In Language Technologies Research Centre. IIIT, Hyderabad, India, 2009.A. K. Singh and J. Gorla. Identification of languages and encodings in a multilingual document. In ACL-SIGWAC's Web As Corpus3, page 95. Presses univ. de Louvain, 2007.V. Sowmya, M. Choudhury, K. Bali, T. Dasgupta, and A. Basu. Resource creation for training and testing of transliteration systems for indian languages. In LREC, pages 2902--2907, 2010.V. Sowmya and V. Varma. Transliteration based text input methods for telugu. In ICCPOL-2009, 2009.B. G. Stalls and J. Graehl. Translating names and technical terms in arabic text. In Workshop on Computational Approaches to Semitic Languages, pages 34--41. ACL, 1998.S. Sumaja, R. Loganathan, and K. P. Suman. English to malayalam transliteration using sequence labeling approach. International Journal of Recent Trends in Engineering, 1(2), 2009.M. S. Vijaya, V. P. Ajith, G. Shivapratap, and K. P. Soman. English to tamil transliteration using weka. International Journal of Recent Trends in Engineering, 2009

    Filling Knowledge Gaps in a Broad-Coverage Machine Translation System

    Full text link
    Knowledge-based machine translation (KBMT) techniques yield high quality in domains with detailed semantic models, limited vocabulary, and controlled input grammar. Scaling up along these dimensions means acquiring large knowledge resources. It also means behaving reasonably when definitive knowledge is not yet available. This paper describes how we can fill various KBMT knowledge gaps, often using robust statistical techniques. We describe quantitative and qualitative results from JAPANGLOSS, a broad-coverage Japanese-English MT system.Comment: 7 pages, Compressed and uuencoded postscript. To appear: IJCAI-9

    A Comparison of Different Machine Transliteration Models

    Full text link
    Machine transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine transliteration models -- grapheme-based transliteration model, phoneme-based transliteration model, hybrid transliteration model, and correspondence-based transliteration model -- have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine transliteration performance
    corecore