6 research outputs found

    Deep Level Lexical Features for Cross-lingual Authorship Attribution

    Get PDF
    Crosslingual document classification aims to classify documents written in different languages that share a common genre, topic or author. Knowledge-based methods and others based on machine translation deliver state-of-the-art classification accuracy, however because of their reliance on external resources, poorly resourced languages present a challenge for these type of methods. In this paper, we propose a novel set of language independent features that capture language use from a document at a deep level, using features that are intrinsic to the document. These features are based on vocabulary richness measurements and are text length independent and self-contained, meaning that no external resources such as lexicons or machine translation software are needed. Preliminary evaluation results show promising results for the task of crosslingual authorship attribution, outperforming similar methods

    Modeling Classifier for Code Mixed Cross Script Questions.

    Get PDF
    ABSTRACT With a boom in the internet, the social media text had been increasing day by day and the user generated content (such as tweets and blogs) in Indian languages are written using Roman script due to various socio-cultural and technological reasons. A majority of these posts are multilingual in nature and many involve code mixing where lexical items and grammatical features from two languages appear in one sentence. Focusing on this current multilingual scenario, code-mixed cross-script (i.e., non-native script) data gives rise to a new problem and presents serious challenges to automatic Question Answering (QA) and for this question classification will be required which is an important step towards QA. This paper proposes an approach to handle cross script question classification as it is an important task of question analysis which detects the category of the question

    MSIR@FIRE: A Comprehensive Report from 2013 to 2016

    Full text link
    [EN] India is a nation of geographical and cultural diversity where over 1600 dialects are spoken by the people. With the technological advancement, penetration of the internet and cheaper access to mobile data, India has recently seen a sudden growth of internet users. These Indian internet users generate contents either in English or in other vernacular Indian languages. To develop technological solutions for the contents generated by the Indian users using the Indian languages, the Forum for Information Retrieval Evaluation (FIRE) was established and held for the first time in 2008. Although Indian languages are written using indigenous scripts, often websites and user-generated content (such as tweets and blogs) in these Indian languages are written using Roman script due to various socio-cultural and technological reasons. A challenge that search engines face while processing transliterated queries and documents is that of extensive spelling variation. MSIR track was first introduced in 2013 at FIRE and the aim of MSIR was to systematically formalize several research problems that one must solve to tackle the code mixing in Web search for users of many languages around the world, develop related data sets, test benches and most importantly, build a research community focusing on this important problem that has received very little attention. This document is a comprehensive report on the 4 years of MSIR track evaluated at FIRE between 2013 and 2016.Somnath Banerjee and Sudip Kumar Naskar are supported by Media Lab Asia, MeitY, Government of India, under the Visvesvaraya PhD Scheme for Electronics & IT. The work of Paolo Rosso was partially supported by the MISMIS research project PGC2018-096212-B-C31 funded by the Spanish MICINN.Banerjee, S.; Choudhury, M.; Chakma, K.; Kumar Naskar, S.; Das, A.; Bandyopadhyay, S.; Rosso, P. (2020). MSIR@FIRE: A Comprehensive Report from 2013 to 2016. SN Computer Science. 1(55):1-15. https://doi.org/10.1007/s42979-019-0058-0S115155Ahmed UZ, Bali K, Choudhury M, Sowmya VB. Challenges in designing input method editors for Indian languages: the role of word-origin and context. In: Advances in text input methods (WTIM 2011). 2011. pp. 1–9Banerjee S, Chakma K, Naskar SK, Das A, Rosso P, Bandyopadhyay S, Choudhury M. Overview of the mixed script information retrieval (MSIR) at fire-2016. In: Forum for information retrieval evaluation. Springer; 2016. pp. 39–49.Banerjee S, Kuila A, Roy A, Naskar SK, Rosso P, Bandyopadhyay S. A hybrid approach for transliterated word-level language identification: CRF with post-processing heuristics. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 54–59.Banerjee S, Naskar S, Rosso P, Bandyopadhyay S. Code mixed cross script factoid question classification—a deep learning approach. J Intell Fuzzy Syst. 2018;34(5):2959–69.Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S. The first cross-script code-mixed question answering corpus. In: Proceedings of the workshop on modeling, learning and mining for cross/multilinguality (MultiLingMine 2016), co-located with the 38th European Conference on Information Retrieval (ECIR). 2016.Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S. Named entity recognition on code-mixed cross-script social media content. Comput Sistemas. 2017;21(4):681–92.Barman U, Das A, Wagner J, Foster J. Code mixing: a challenge for language identification in the language of social media. In: Proceedings of the first workshop on computational approaches to code switching. 2014. pp. 13–23.Bhardwaj P, Pakray P, Bajpeyee V, Taneja A. Information retrieval on code-mixed Hindi–English tweets. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.Bhargava R, Khandelwal S, Bhatia A, Sharmai Y. Modeling classifier for code mixed cross script questions. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Bhattacharjee D, Bhattacharya, P. Ensemble classifier based approach for code-mixed cross-script question classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Chakma K, Das A. CMIR: a corpus for evaluation of code mixed information retrieval of Hindi–English tweets. In: The 17th international conference on intelligent text processing and computational linguistics (CICLING). 2016.Choudhury M, Chittaranjan G, Gupta P, Das A. Overview of fire 2014 track on transliterated search. Proceedings of FIRE. 2014. pp. 68–89.Ganguly D, Pal S, Jones GJ. Dcu@fire-2014: fuzzy queries with rule-based normalization for mixed script information retrieval. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 80–85.Gella S, Sharma J, Bali K. Query word labeling and back transliteration for Indian languages: shared task system description. FIRE Working Notes. 2013;3.Gupta DK, Kumar S, Ekbal A. Machine learning approach for language identification and transliteration. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 60–64.Gupta P, Bali K, Banchs RE, Choudhury M, Rosso P. Query expansion for mixed-script information retrieval. In: Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, ACM, 2014. pp. 677–686.Gupta P, Rosso P, Banchs RE. Encoding transliteration variation through dimensionality reduction: fire shared task on transliterated search. In: Fifth forum for information retrieval evaluation. 2013.HB Barathi Ganesh, M Anand Kumar, KP Soman. Distributional semantic representation for information retrieval. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.HB Barathi Ganesh, M Anand Kumar, KP Soman. Distributional semantic representation for text classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst. 2002;20:422–46. https://doi.org/10.1145/582415.582418.Joshi H, Bhatt A, Patel H. Transliterated search using syllabification approach. In: Forum for information retrieval evaluation. 2013.King B, Abney S. Labeling the languages of words in mixed-language documents using weakly supervised methods. In: Proceedings of NAACL-HLT, 2013. pp. 1110–1119.Londhe N, Srihari RK. Exploiting named entity mentions towards code mixed IR: working notes for the UB system submission for MSIR@FIRE’16. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.Anand Kumar M, Soman KP. Amrita-CEN@MSIR-FIRE2016: Code-mixed question classification using BoWs and RNN embeddings. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Majumder G, Pakray P. NLP-NITMZ@MSIR 2016 system for code-mixed cross-script question classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Mandal S, Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S. Adaptive voting in multiple classifier systems for word level language identification. In: FIRE workshops, 2015. pp. 47–50.Mukherjee A, Ravi A , Datta K. Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing. In: Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, India, 2014.Pakray P, Bhaskar P. Transliterated search system for Indian languages. In: Pre-proceedings of the 5th FIRE-2013 workshop, forum for information retrieval evaluation (FIRE). 2013.Patel S, Desai V. Liga and syllabification approach for language identification and back transliteration: a shared task report by da-iict. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 43–47.Prabhakar DK, Pal S. Ism@fire-2013 shared task on transliterated search. In: Post-Proceedings of the 4th and 5th workshops of the forum for information retrieval evaluation, ACM, 2013. p. 17.Prabhakar DK, Pal S. Ism@ fire-2015: mixed script information retrieval. In: FIRE workshops. 2015. pp. 55–58.Prakash A, Saha SK. A relevance feedback based approach for mixed script transliterated text search: shared task report by bit Mesra. In: Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, India, 2014.Raj A, Karfa S. A list-searching based approach for language identification in bilingual text: shared task report by asterisk. In: Working notes of the shared task on transliterated search at forum for information retrieval evaluation FIRE’14. 2014.Roy RS, Choudhury M, Majumder P, Agarwal K. Overview of the fire 2013 track on transliterated search. In: Post-proceedings of the 4th and 5th workshops of the forum for information retrieval evaluation, ACM, 2013. p. 4.Saini A. Code mixed cross script question classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Salton G, McGill MJ. Introduction to modern information retrieval. New York: McGraw-Hill, Inc.; 1986.Sequiera R, Choudhury M, Gupta P, Rosso P, Kumar S, Banerjee S, Naskar SK, Bandyopadhyay S, Chittaranjan G, Das A, et al. Overview of fire-2015 shared task on mixed script information retrieval. FIRE Workshops. 2015;1587:19–25.Singh S, M Anand Kumar, KP Soman. CEN@Amrita: information retrieval on code mixed Hindi–English tweets using vector space models. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.Sinha N, Srinivasa G. Hindi–English language identification, named entity recognition and back transliteration: shared task system description. In: Working notes os shared task on transliterated search at forum for information retrieval evaluation FIRE’14. 2014.Voorhees EM, Tice DM. The TREC-8 question answering track evaluation. In: TREC-8, 1999. pp. 83–105.Vyas Y, Gella S, Sharma J, Bali K, Choudhury M. Pos tagging of English–Hindi code-mixed social media content. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. pp. 974–979

    Code Mixed Cross Script Factoid Question Classification - A Deep Learning Approach

    Full text link
    [EN] Before the advent of the Internet era, code-mixing was mainly used in the spoken form. However, with the recent popular informal networking platforms such as Facebook, Twitter, Instagram, etc., in social media, code-mixing is being used more and more in written form. User-generated social media content is becoming an increasingly important resource in applied linguistics. Recent trends in social media usage have led to a proliferation of studies on social media content. Multilingual social media users often write native language content in non-native script (cross-script). Recently Banerjee et al. [9] introduced the code-mixed cross-script question answering research problem and reported that the ever increasing social media content could serve as a potential digital resource for less-computerized languages to build question answering systems. Question classification is a core task in question answering in which questions are assigned a class or a number of classes which denote the expected answer type(s). In this research work, we address the question classification task as part of the code-mixed cross-script question answering research problem. We combine deep learning framework with feature engineering to address the question classification task and enhance the state-of-the-art question classification accuracy by over 4% for code-mixed cross-script questions.The work of the third author was partially supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project.Banerjee, S.; Kumar Naskar, S.; Rosso, P.; Bandyopadhyay, S. (2018). Code Mixed Cross Script Factoid Question Classification - A Deep Learning Approach. Journal of Intelligent & Fuzzy Systems. 34(5):2959-2969. https://doi.org/10.3233/JIFS-169481S2959296934

    ECIR 2016 Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine 16)

    Full text link
    [EN] The First International Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine) was held in conjunction with the 2016 European Conference on Information Retrieval (ECIR), in Padua, Italy. This report presents an overview of the motivations and objectives underlying the establishment of this workshop. It also provides a summary of the contributing papers and of the main research topics and trends discussed among the participants.This work is partially funded by research project PON 2007-2013 “BA2Know - Business Analytics to Know”, funded by Italian Ministry of Instruction, University, and Research, and by the research project TIN2015-71147-C2-1-P of the Spanish Ministry of Economy and Competitiveness.Ienco, D.; Roche, M.; Romeo, S.; Rosso, P.; Tagarelli, A. (2016). ECIR 2016 Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine 16). ACM SIGIR Forum. 50(2):89-95. https://doi.org/10.1145/3053408.3053424S8995502S. Banerjee, S. Kumar Naskar, P. Rosso, and S. Bandyopadhyay. The first cross-script code-mixed question answering corpus. In ECIR 2016 Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine), Padua, Italy, March 20, 2016, pages 56--65, 2016.D. Brodić, A. Amelio, and Z. N. Milivojević. A new image analysis framework for latin and italian language discrimination. In ECIR 2016 Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine), Padua, Italy, March 20, 2016, pages 46--55, 2016.I. Cunha, E. SanJuan, J.M. Torres-Moreno, I. Castellon, and M. Lloberes. Extending automatic discourse segmentation for texts in spanish to catalan. In ECIR 2016 Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine), Padua, Italy, March 20, 2016, pages 36--45, 2016.H. N. Esfahani, J. Dadashkarimi, and A. Shakery. Profile-based translation in multilingual expertise retrieval. In ECIR 2016 Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine), Padua, Italy, March 20, 2016, pages 26--35, 2016.A. Ferrando, S. Beux, V. Mascardi, and P. Rosso. Identification of disease symptoms in multilingual sentences: an ontology-driven approach. In ECIR 2016 Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine), Padua, Italy, March 20, 2016, pages 6--15, 2016.M. Franco-Salvador, F. L. Cruz, J. A. Troyano, and P. Rosso. Cross-domain polarity classification using a knowledge-enhanced meta-classifier. Knowl.-Based Syst., 86:46--56, 2015.M. Franco-Salvador, P. Rosso, and R. Navigli. A knowledge-based representation for cross-language document retrieval and categorization. In Proc. of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), April 26-30, 2014, Gothenburg, Sweden, pages 414--423, 2014.J. Kim, J. Nam, and I. Gurevych. Learning semantics with deep belief network for cross-language information retrieval. In Proc. of the 24th International Conference on Computational Linguistics (COLING), December 8-15, 2012, Mumbai, India, pages 579--588, 2012.Y.-M. Kim, M.-R. Amini, C. Goutte, and P. Gallinari. Multi-view clustering of multilingual documents. In Proc. of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), July 19-23, 2010, Geneva, Switzerland, pages 821--822, 2010.M. Llorens-Salvador and S. J. Delany. Deep level lexical features for cross-lingual authorship attribution. In ECIR 2016 Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine), Padua, Italy, March 20, 2016, pages 16--25, 2016.M.-F. Moens and I. Vulic. Multilingual probabilistic topic modeling and its applications in web mining and search. In Proc. of the 17th ACM International Conference on Web Search and Data Mining (WSDM), February 24-28, 2014, New York, NY, USA, pages 681--682, 2014.R. Navigli and S. P. Ponzetto. Babelnet: Building a very large multilingual semantic network. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden, pages 216--, 2010.S. Romeo, D. Ienco, and A. Tagarelli. Knowledge-based representation for transductive multilingual document classification. In Proc. of the 37th European Conference on IR Research (ECIR), March 29 - April 2, 2015, Vienna, Austria, pages 92--103, 2015.S. Romeo, A. Tagarelli, and D. Ienco. Semantic-based multilingual document clustering via tensor modeling. In Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), October 25-29, 2014, Doha, Qatar, pages 600--609, 2014.B. Steichen, N. Ferro, D. Lewis, and E. H. Chi. 1st international workshop on multilingual web access (MWA 2015). SIGIR Forum, 49(2):137--140, 2015.I. Vulic and M.-F. Moens. A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models. In Proc. of the 35th European Conference on Information Retrieval Research (ECIR), March 24-27, 2013, Moscow, Russia, pages 98--109, 2013.T. Zhang, K. Liu, and J. Zhao. Cross lingual entity linking with bilingual topic model. In Proc. of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), August 3-9, 2013, Beijing, China, 2013

    An Unexpected Journey: Towards Runtime Verification of Multiagent Systems and Beyond

    Get PDF
    The Trace Expression formalism derives from works started in 2012 and is mainly used to specify and verify interaction protocols at runtime, but other applications have been devised. More specically, this thesis describes how to extend and apply such formalism in the engineering process of distributed articial intelligence systems (such as Multiagent systems). This thesis extends the state of the art through four dierent contributions: 1. Theoretical: the thesis extends the original formalism in order to represent also parametric and probabilistic specications (parametric trace expressions and probabilistic trace expressions respectively). 2. Algorithmic: the thesis proposes algorithms for verifying trace expressions at runtime in a decentralized way. The algorithms have been designed to be as general as possible, but their implementation and experimentation address scenarios where the modelled and observed events are communicative events (interactions) inside a multiagent system. 3. Application: the thesis analyzes the relations between runtime and static verication (e.g. model checking) proposing hybrid integrations in both directions. First of all, the thesis proposes a trace expression model checking approach where it shows how to statically verify LTL property on a trace expression specication. After that, the thesis presents a novel approach for supporting static verication through the addition of monitors at runtime (post-process). 4. Implementation: the thesis presents RIVERtools, a tool supporting the writing, the syntactic analysis and the decentralization of trace expressions
    corecore