128 research outputs found

    Revisiting Low Resource Status of Indian Languages in Machine Translation

    Full text link
    Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.Comment: 10 pages, few figures, Preprint under revie

    A Hybrid Optimization Approach for Neural Machine Translation Using LSTM+RNN with MFO for Under Resource Language (Telugu)

    Get PDF
    NMT (Neural Machine Translation) is an innovative approach in the field of machine translation, in contrast to SMT (statistical machine translation) and Rule-based techniques which has resulted annotable improvements. This is because NMT is able to overcome many of the shortcomings that are inherent in the traditional approaches. The Development of NMT has grown tremendously in the recent years but NMT performance remain under optimal when applied to low resource language pairs like Telugu, Tamil and Hindi. In this work a proposedmethod fortranslating pairs (Telugu to English) is attempted, an optimal approach which enhancesthe accuracy and execution time period.A hybrid method approach utilizing Long short-term memory (LSTM) and traditional Recurrent Neural Network (RNN) are used for testing and training of the dataset. In the event of long-range dependencies, LSTM will generate more accurate results than a standard RNN would endure and the hybrid technique enhances the performance of LSTM. LSTM is used during the encoding and RNN is used in decoding phases of NMT. Moth Flame Optimization (MFO) is utilized in the proposed system for the purpose of providing the encoder and decoder model with the best ideal points for training the data

    Register Variation Remains Stable Across 60 Languages

    Full text link
    This paper measures the stability of cross-linguistic register variation. A register is a variety of a language that is associated with extra-linguistic context. The relationship between a register and its context is functional: the linguistic features that make up a register are motivated by the needs and constraints of the communicative situation. This view hypothesizes that register should be universal, so that we expect a stable relationship between the extra-linguistic context that defines a register and the sets of linguistic features which the register contains. In this paper, the universality and robustness of register variation is tested by comparing variation within vs. between register-specific corpora in 60 languages using corpora produced in comparable communicative situations: tweets and Wikipedia articles. Our findings confirm the prediction that register variation is, in fact, universal

    On the use of word embedding for cross language plagiarism detection

    Full text link
    [EN] Cross language plagiarism is the unacknowledged reuse of text across language pairs. It occurs if a passage of text is translated from source language to target language and no proper citation is provided. Although various methods have been developed for detection of cross language plagiarism, less attention has been paid to measure and compare their performance, especially when tackling with different types of paraphrasing through translation. In this paper, we investigate various approaches to cross language plagiarism detection. Moreover, we present a novel approach to cross language plagiarism detection using word embedding methods and explore its performance against other state-of-the-art plagiarism detection algorithms. In order to evaluate the methods, we have constructed an English-Persian bilingual plagiarism detection corpus (referred to as HAMTA-CL) comprised of seven types of obfuscation. The results show that the word embedding approach outperforms the other approaches with respect to recall when encountering heavily paraphrased passages. On the other hand, translation based approach performs well when the precision is the main consideration of the cross language plagiarism detection system.Asghari, H.; Fatemi, O.; Mohtaj, S.; Faili, H.; Rosso, P. (2019). On the use of word embedding for cross language plagiarism detection. Intelligent Data Analysis. 23(3):661-680. https://doi.org/10.3233/IDA-183985S661680233H. Asghari, K. Khoshnava, O. Fatemi and H. Faili, Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus: Notebook for {PAN} at {CLEF} 2015, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.A. Barrón-Cede no, M. Potthast, P. Rosso and B. Stein, Corpus and evaluation measures for automatic plagiarism detection, In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner and D. Tapias, editors, Proceedings of the International Conference on Language Resources and Evaluation, {LREC} 2010, 17–23 May 2010, Valletta, Malta. European Language Resources Association, 2010.A. Barrón-Cede no, P. Rosso, D. Pinto and A. Juan, On cross-lingual plagiarism analysis using a statistical model, In B. Stein, E. Stamatatos and M. Koppel, editors, Proceedings of the ECAI’08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008, volume 377 of {CEUR} Workshop Proceedings. CEUR-WS.org, 2008.Farghaly, A., & Shaalan, K. (2009). Arabic Natural Language Processing. ACM Transactions on Asian Language Information Processing, 8(4), 1-22. doi:10.1145/1644879.1644881J. Ferrero, F. Agnès, L. Besacier and D. Schwab, A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection, In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk and S. Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation {LREC} 2016, Portorož, Slovenia, May 23–28, 2016, European Language Resources Association {(ELRA)}, 2016.Franco-Salvador, M., Gupta, P., Rosso, P., & Banchs, R. E. (2016). Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowledge-Based Systems, 111, 87-99. doi:10.1016/j.knosys.2016.08.004Franco-Salvador, M., Rosso, P., & Montes-y-Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing & Management, 52(4), 550-570. doi:10.1016/j.ipm.2015.12.004C.K. Kent and N. Salim, Web based cross language plagiarism detection, CoRR, abs/0912.3, 2009.McNamee, P., & Mayfield, J. (2004). Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval, 7(1/2), 73-97. doi:10.1023/b:inrt.0000009441.78971.beT. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient estimation of word representations in vector space, CoRR, abs/1301.3, 2013.S. Mohtaj, B. Roshanfekr, A. Zafarian and H. Asghari, Parsivar: A language processing toolkit for persian, In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis and T. Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018, European Language Resources Association ELRA, 2018.R.M.A. Nawab, M. Stevenson and P.D. Clough, University of Sheffield – Lab Report for {PAN} at {CLEF} 2010, In M. Braschler, D. Harman and E. Pianta, editors, {CLEF} 2010 LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy, volume 1176 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2010.G. Oberreuter, G. L’Huillier, S.A. Rios and J.D. Velásquez, Approaches for intrinsic and external plagiarism detection – Notebook for {PAN} at {CLEF} 2011, In V. Petras, P. Forner and P.D. Clough, editors, {CLEF} 2011 Labs and Workshop, Notebook Papers, 19–22 September 2011, Amsterdam, The Netherlands, volume 1177 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2011.Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51-60. doi:10.1016/j.jalgor.2009.02.005M. Potthast, A. Barrón-Cede no, A. Eiselt, B. Stein and P. Rosso, Overview of the 2nd international competition on plagiarism detection, In M. Braschler, D. Harman and E. Pianta, editors, {CLEF} 2010 LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy, volume 1176 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2010.Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2010). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45-62. doi:10.1007/s10579-009-9114-zM. Potthast, A. Eiselt, A. Barrón-Cede no, B. Stein and P. Rosso, Overview of the 3rd international competition on plagiarism detection, In V. Petras, P. Forner and P.D. Clough, editors, {CLEF} 2011 Labs and Workshop, Notebook Papers, 19–22 September 2011, Amsterdam, The Netherlands, volume 1177 of {CEUR} Workshop Proceedings. CEUR-WS.org, 2011.M. Potthast, S. Goering, P. Rosso and B. Stein, Towards data submissions for shared tasks: First experiences for the task of text alignment, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.Potthast, M., Stein, B., & Anderka, M. (s. f.). A Wikipedia-Based Multilingual Retrieval Model. Advances in Information Retrieval, 522-530. doi:10.1007/978-3-540-78646-7_51B. Pouliquen, R. Steinberger and C. Ignat, Automatic identification of document translations in large multilingual document collections, CoRR, abs/cs/060, 2006.B. Stein, E. Stamatatos and M. Koppel, Proceedings of the ECAI’08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008, volume 377 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2008.J. Wieting, M. Bansal, K. Gimpel and K. Livescu, Towards universal paraphrastic sentence embeddings, CoRR, abs/1511.0, 2015.V. Zarrabi, J. Rafiei, K. Khoshnava, H. Asghari and S. Mohtaj, Evaluation of text reuse corpora for text alignment task of plagiarism detection, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.Barrón-Cedeño, A., Gupta, P., & Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, 211-217. doi:10.1016/j.knosys.2013.06.01
    corecore