900 research outputs found

    A Transliteration of CRF based Manipuri POS Tagging

    Get PDF
    AbstractTransliteration is common to all those language which have multiple scripts. Manipuri, which is one of the Schedule Indian Languages, is one of them. This language has two scripts: a borrowed Bengali Script and the original Meitei Mayek (Script). Part of Speech (POS) tagging of the Bengali Script Manipuri text is performed using Conditional Random Field (CRF) which is then followed by the transliteration to Meitei Mayek

    A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics

    Full text link
    © {Owner/Author | ACM} {Year}. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation, http://dx.doi.org/10.1145/2824864.2824876[EN] In this paper, we describe a hybrid approach for word-level language (WLL) identification of Bangla words written in Roman script and mixed with English words as part of our participation in the shared task on transliterated search at Forum for Information Retrieval Evaluation (FIRE) in 2014. A CRF based machine learning model and post-processing heuristics are employed for the WLL identification task. In addition to language identification, two transliteration systems were built to transliterate detected Bangla words written in Roman script into native Bangla script. The system demonstrated an overall token level language identification accuracy of 0.905. The token level Bangla and English language identification F-scores are 0.899, 0.920 respectively. The two transliteration systems achieved accuracies of 0.062 and 0.037. The word-level language identification system presented in this paper resulted in the best scores across almost all metrics among all the participating systems for the Bangla-English language pair.We acknowledge the support of the Department of Electronics and Information Technology (DeitY), Government of India, through the project “CLIA System Phase II”. The research work of the last author was carried out in the framework of WIQ-EI IRSES (Grant No. 269180) within the FP 7 Marie Curie, DIANA-APPLICATIONS (TIN2012-38603-C02-01) projects and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Banerjee, S.; Kuila, A.; Roy, A.; Naskar, SK.; Rosso, P.; Bandyopadhyay, S. (2014). A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics. En FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation. ACM. 170-173. https://doi.org/10.1145/2824864.2824876S170173Y. Al-Onaizan and K. Knight. Named entity translation: Extended abstract. In HLT, pages 122--124. Singapore, 2002.P. J. Antony, V. P. Ajith, and K. P. Suman. Feature extraction based english to kannada transliteration. In In hird International conference on Semantic E-business and Enterprise Computing. SEEC 2010, 2010.P. J. Antony, V. P. Ajith, and K. P. Suman. Kernel method for english to kannada transliteration. In International conference on-Recent trends in Information, Telecommunication and computing. ITC2010, 2010.M. Arbabi, S. M. Fischthal, V. C. Cheng, and E. Bart. Algorithms for arabic name transliteration. In IBM Journal of Research and Development, page 183. TeX Users Group, 1994.S. Banerjee, S. Naskar, and S. Bandyopadhyay. Bengali named entity recognition using margin infused relaxed algorithm. In TSD, pages 125--132. Springer International Publishing, 2014.U. Barman, J. Wagner, G. Chrupala, and J. Foster. Identification of languages and encodings in a multilingual document. page 127. EMNLP, 2014.K. R. Beesley. Language identifier: A computer program for automatic natural-language identification of on-line text. pages 47--54. ATA, 1988.P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. Mercer: The mathematics of statistical machine translation: parameter estimation. pages 263--311. Computational Linguistics, 1993.M. Carpuat. Mixed-language and code-switching in the canadian hansard. page 107. EMNLP, 2014.G. Chittaranjan, Y. Vyas, K. Bali, and M. Choudhury. Word-level language identification using crf: Code-switching shared task report of msr india system. pages 73--79. EMNLP, 2014.A. Das, A. Ekbal, T. Mandal, and S. Bandyopadhyay. English to hindi machine transliteration system at news. pages 80--83. Proceeding of the Named Entities Workshop ACL-IJCNLP, Singapore, 2009.A. Ekbal, S. Naskar, and S. Bandyopadhyay. A modified joint source channel model for transliteration. pages 191--198. COLING-ACL Australia, 2006.I. Goto, N. Kato, N. Uratani, and T. Ehara. Transliteration considering context information based on the maximum entropy method. pages 125--132. MT-Summit IX, New Orleans, USA, 2003.R. Haque, S. Dandapat, A. K. Srivastava, S. K. Naskar, and A. Way. English to hindi transliteration using context-informed pb-smt:the dcu system for news 2009. NEWS 2009, 2009.S. Y. Jung, S. Hong, and E. Paek. An english to korean transliteration model of extended markov window.S. Y. Jung, S. L. Hong, and E. Paek. An english to korean transliteration model of extended markov window. pages 383--389. COLING, 2000.B. J. Kang and K. S. Choi. Automatic transliteration and back-transliteration by decision tree learning. LERC, May 2000.B. King and S. Abney. Labeling the languages of words in mixed-language documents using weakly supervised methods. pages 1110--1119. NAACL-HLT, 2013.R. Kneser and H. Ney. Improved backing-off for m-gram language modeling. In ICASSP, pages 181--184. Detroit, MI, 1995.R. Kneser and H. Ney. SRILM-an extensible language modeling toolkit. In Intl. Conf. on Spoken Language Processing, pages 901--904, 2002.K. Knight and J. Graehl. Machine transliteration. in computational linguistics. pages 599--612, 1998.P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: open source toolkit for statistical machine translation. In ACL, pages 177--180, 2007.P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In HLT-NAACL, 2003.A. Kumaran and T. Kellner. A generic framework for machine transliteration. In 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 721--722. ACM, 2007.H. Li, Z. Min, and J. Su. A joint source-channel model for machine transliteration. In ACL, page 159, 2004.C. Lignos and M. Marcus. Toward web-scale analysis of codeswitching. In Annual Meeting of the Linguistic Society of America, 2013.J. H. Oh and K. S. Choi. An english-korean transliteration model using pronunciation and contextual rules. In 19th international conference on Computational linguistics. ACL, 2002.T. Rama and K. Gali. Modeling machine transliteration as a phrase based statistical machine translation problem. In Language Technologies Research Centre. IIIT, Hyderabad, India, 2009.A. K. Singh and J. Gorla. Identification of languages and encodings in a multilingual document. In ACL-SIGWAC's Web As Corpus3, page 95. Presses univ. de Louvain, 2007.V. Sowmya, M. Choudhury, K. Bali, T. Dasgupta, and A. Basu. Resource creation for training and testing of transliteration systems for indian languages. In LREC, pages 2902--2907, 2010.V. Sowmya and V. Varma. Transliteration based text input methods for telugu. In ICCPOL-2009, 2009.B. G. Stalls and J. Graehl. Translating names and technical terms in arabic text. In Workshop on Computational Approaches to Semitic Languages, pages 34--41. ACL, 1998.S. Sumaja, R. Loganathan, and K. P. Suman. English to malayalam transliteration using sequence labeling approach. International Journal of Recent Trends in Engineering, 1(2), 2009.M. S. Vijaya, V. P. Ajith, G. Shivapratap, and K. P. Soman. English to tamil transliteration using weka. International Journal of Recent Trends in Engineering, 2009

    A Comprehensive Review of Sentiment Analysis on Indian Regional Languages: Techniques, Challenges, and Trends

    Get PDF
    Sentiment analysis (SA) is the process of understanding emotion within a text. It helps identify the opinion, attitude, and tone of a text categorizing it into positive, negative, or neutral. SA is frequently used today as more and more people get a chance to put out their thoughts due to the advent of social media. Sentiment analysis benefits industries around the globe, like finance, advertising, marketing, travel, hospitality, etc. Although the majority of work done in this field is on global languages like English, in recent years, the importance of SA in local languages has also been widely recognized. This has led to considerable research in the analysis of Indian regional languages. This paper comprehensively reviews SA in the following major Indian Regional languages: Marathi, Hindi, Tamil, Telugu, Malayalam, Bengali, Gujarati, and Urdu. Furthermore, this paper presents techniques, challenges, findings, recent research trends, and future scope for enhancing results accuracy

    My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

    Full text link
    The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmarking, we present three supervised datasets MeHate, MeSent, and MeLID for downstream tasks like code-mixed Mr-En hate speech detection, sentiment analysis, and language identification respectively. These evaluation datasets individually consist of manually annotated \url{~}12,000 Marathi-English code-mixed tweets. Ablations show that the models trained on this novel corpus significantly outperform the existing state-of-the-art BERT models. This is the first work that presents artifacts for code-mixed Marathi research. All datasets and models are publicly released at https://github.com/l3cube-pune/MarathiNLP

    An Urdu semantic tagger - lexicons, corpora, methods and tools

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports F1F_1 of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F1_1 of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%

    Cross-view Embeddings for Information Retrieval

    Full text link
    In this dissertation, we deal with the cross-view tasks related to information retrieval using embedding methods. We study existing methodologies and propose new methods to overcome their limitations. We formally introduce the concept of mixed-script IR, which deals with the challenges faced by an IR system when a language is written in different scripts because of various technological and sociological factors. Mixed-script terms are represented by a small and finite feature space comprised of character n-grams. We propose the cross-view autoencoder (CAE) to model such terms in an abstract space and CAE provides the state-of-the-art performance. We study a wide variety of models for cross-language information retrieval (CLIR) and propose a model based on compositional neural networks (XCNN) which overcomes the limitations of the existing methods and achieves the best results for many CLIR tasks such as ad-hoc retrieval, parallel sentence retrieval and cross-language plagiarism detection. We empirically test the proposed models for these tasks on publicly available datasets and present the results with analyses. In this dissertation, we also explore an effective method to incorporate contextual similarity for lexical selection in machine translation. Concretely, we investigate a feature based on context available in source sentence calculated using deep autoencoders. The proposed feature exhibits statistically significant improvements over the strong baselines for English-to-Spanish and English-to-Hindi translation tasks. Finally, we explore the the methods to evaluate the quality of autoencoder generated representations of text data and analyse its architectural properties. For this, we propose two metrics based on reconstruction capabilities of the autoencoders: structure preservation index (SPI) and similarity accumulation index (SAI). We also introduce a concept of critical bottleneck dimensionality (CBD) below which the structural information is lost and present analyses linking CBD and language perplexity.En esta disertación estudiamos problemas de vistas-múltiples relacionados con la recuperación de información utilizando técnicas de representación en espacios de baja dimensionalidad. Estudiamos las técnicas existentes y proponemos nuevas técnicas para solventar algunas de las limitaciones existentes. Presentamos formalmente el concepto de recuperación de información con escritura mixta, el cual trata las dificultades de los sistemas de recuperación de información cuando los textos contienen escrituras en distintos alfabetos debido a razones tecnológicas y socioculturales. Las palabras en escritura mixta son representadas en un espacio de características finito y reducido, compuesto por n-gramas de caracteres. Proponemos los auto-codificadores de vistas-múltiples (CAE, por sus siglas en inglés) para modelar dichas palabras en un espacio abstracto, y esta técnica produce resultados de vanguardia. En este sentido, estudiamos varios modelos para la recuperación de información entre lenguas diferentes (CLIR, por sus siglas en inglés) y proponemos un modelo basado en redes neuronales composicionales (XCNN, por sus siglas en inglés), el cual supera las limitaciones de los métodos existentes. El método de XCNN propuesto produce mejores resultados en diferentes tareas de CLIR tales como la recuperación de información ad-hoc, la identificación de oraciones equivalentes en lenguas distintas y la detección de plagio entre lenguas diferentes. Para tal efecto, realizamos pruebas experimentales para dichas tareas sobre conjuntos de datos disponibles públicamente, presentando los resultados y análisis correspondientes. En esta disertación, también exploramos un método eficiente para utilizar similitud semántica de contextos en el proceso de selección léxica en traducción automática. Específicamente, proponemos características extraídas de los contextos disponibles en las oraciones fuentes mediante el uso de auto-codificadores. El uso de las características propuestas demuestra mejoras estadísticamente significativas sobre sistemas de traducción robustos para las tareas de traducción entre inglés y español, e inglés e hindú. Finalmente, exploramos métodos para evaluar la calidad de las representaciones de datos de texto generadas por los auto-codificadores, a la vez que analizamos las propiedades de sus arquitecturas. Como resultado, proponemos dos nuevas métricas para cuantificar la calidad de las reconstrucciones generadas por los auto-codificadores: el índice de preservación de estructura (SPI, por sus siglas en inglés) y el índice de acumulación de similitud (SAI, por sus siglas en inglés). También presentamos el concepto de dimensión crítica de cuello de botella (CBD, por sus siglas en inglés), por debajo de la cual la información estructural se deteriora. Mostramos que, interesantemente, la CBD está relacionada con la perplejidad de la lengua.En aquesta dissertació estudiem els problemes de vistes-múltiples relacionats amb la recuperació d'informació utilitzant tècniques de representació en espais de baixa dimensionalitat. Estudiem les tècniques existents i en proposem unes de noves per solucionar algunes de les limitacions existents. Presentem formalment el concepte de recuperació d'informació amb escriptura mixta, el qual tracta les dificultats dels sistemes de recuperació d'informació quan els textos contenen escriptures en diferents alfabets per motius tecnològics i socioculturals. Les paraules en escriptura mixta són representades en un espai de característiques finit i reduït, composat per n-grames de caràcters. Proposem els auto-codificadors de vistes-múltiples (CAE, per les seves sigles en anglès) per modelar aquestes paraules en un espai abstracte, i aquesta tècnica produeix resultats d'avantguarda. En aquest sentit, estudiem diversos models per a la recuperació d'informació entre llengües diferents (CLIR , per les sevas sigles en anglès) i proposem un model basat en xarxes neuronals composicionals (XCNN, per les sevas sigles en anglès), el qual supera les limitacions dels mètodes existents. El mètode de XCNN proposat produeix millors resultats en diferents tasques de CLIR com ara la recuperació d'informació ad-hoc, la identificació d'oracions equivalents en llengües diferents, i la detecció de plagi entre llengües diferents. Per a tal efecte, realitzem proves experimentals per aquestes tasques sobre conjunts de dades disponibles públicament, presentant els resultats i anàlisis corresponents. En aquesta dissertació, també explorem un mètode eficient per utilitzar similitud semàntica de contextos en el procés de selecció lèxica en traducció automàtica. Específicament, proposem característiques extretes dels contextos disponibles a les oracions fonts mitjançant l'ús d'auto-codificadors. L'ús de les característiques proposades demostra millores estadísticament significatives sobre sistemes de traducció robustos per a les tasques de traducció entre anglès i espanyol, i anglès i hindú. Finalment, explorem mètodes per avaluar la qualitat de les representacions de dades de text generades pels auto-codificadors, alhora que analitzem les propietats de les seves arquitectures. Com a resultat, proposem dues noves mètriques per quantificar la qualitat de les reconstruccions generades pels auto-codificadors: l'índex de preservació d'estructura (SCI, per les seves sigles en anglès) i l'índex d'acumulació de similitud (SAI, per les seves sigles en anglès). També presentem el concepte de dimensió crítica de coll d'ampolla (CBD, per les seves sigles en anglès), per sota de la qual la informació estructural es deteriora. Mostrem que, de manera interessant, la CBD està relacionada amb la perplexitat de la llengua.Gupta, PA. (2017). Cross-view Embeddings for Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/78457TESI
    corecore