2,999 research outputs found

    MSIR@FIRE: A Comprehensive Report from 2013 to 2016

    Full text link
    [EN] India is a nation of geographical and cultural diversity where over 1600 dialects are spoken by the people. With the technological advancement, penetration of the internet and cheaper access to mobile data, India has recently seen a sudden growth of internet users. These Indian internet users generate contents either in English or in other vernacular Indian languages. To develop technological solutions for the contents generated by the Indian users using the Indian languages, the Forum for Information Retrieval Evaluation (FIRE) was established and held for the first time in 2008. Although Indian languages are written using indigenous scripts, often websites and user-generated content (such as tweets and blogs) in these Indian languages are written using Roman script due to various socio-cultural and technological reasons. A challenge that search engines face while processing transliterated queries and documents is that of extensive spelling variation. MSIR track was first introduced in 2013 at FIRE and the aim of MSIR was to systematically formalize several research problems that one must solve to tackle the code mixing in Web search for users of many languages around the world, develop related data sets, test benches and most importantly, build a research community focusing on this important problem that has received very little attention. This document is a comprehensive report on the 4 years of MSIR track evaluated at FIRE between 2013 and 2016.Somnath Banerjee and Sudip Kumar Naskar are supported by Media Lab Asia, MeitY, Government of India, under the Visvesvaraya PhD Scheme for Electronics & IT. The work of Paolo Rosso was partially supported by the MISMIS research project PGC2018-096212-B-C31 funded by the Spanish MICINN.Banerjee, S.; Choudhury, M.; Chakma, K.; Kumar Naskar, S.; Das, A.; Bandyopadhyay, S.; Rosso, P. (2020). MSIR@FIRE: A Comprehensive Report from 2013 to 2016. SN Computer Science. 1(55):1-15. https://doi.org/10.1007/s42979-019-0058-0S115155Ahmed UZ, Bali K, Choudhury M, Sowmya VB. Challenges in designing input method editors for Indian languages: the role of word-origin and context. In: Advances in text input methods (WTIM 2011). 2011. pp. 1–9Banerjee S, Chakma K, Naskar SK, Das A, Rosso P, Bandyopadhyay S, Choudhury M. Overview of the mixed script information retrieval (MSIR) at fire-2016. In: Forum for information retrieval evaluation. Springer; 2016. pp. 39–49.Banerjee S, Kuila A, Roy A, Naskar SK, Rosso P, Bandyopadhyay S. A hybrid approach for transliterated word-level language identification: CRF with post-processing heuristics. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 54–59.Banerjee S, Naskar S, Rosso P, Bandyopadhyay S. Code mixed cross script factoid question classification—a deep learning approach. J Intell Fuzzy Syst. 2018;34(5):2959–69.Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S. The first cross-script code-mixed question answering corpus. In: Proceedings of the workshop on modeling, learning and mining for cross/multilinguality (MultiLingMine 2016), co-located with the 38th European Conference on Information Retrieval (ECIR). 2016.Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S. Named entity recognition on code-mixed cross-script social media content. Comput Sistemas. 2017;21(4):681–92.Barman U, Das A, Wagner J, Foster J. Code mixing: a challenge for language identification in the language of social media. In: Proceedings of the first workshop on computational approaches to code switching. 2014. pp. 13–23.Bhardwaj P, Pakray P, Bajpeyee V, Taneja A. Information retrieval on code-mixed Hindi–English tweets. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.Bhargava R, Khandelwal S, Bhatia A, Sharmai Y. Modeling classifier for code mixed cross script questions. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Bhattacharjee D, Bhattacharya, P. Ensemble classifier based approach for code-mixed cross-script question classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Chakma K, Das A. CMIR: a corpus for evaluation of code mixed information retrieval of Hindi–English tweets. In: The 17th international conference on intelligent text processing and computational linguistics (CICLING). 2016.Choudhury M, Chittaranjan G, Gupta P, Das A. Overview of fire 2014 track on transliterated search. Proceedings of FIRE. 2014. pp. 68–89.Ganguly D, Pal S, Jones GJ. Dcu@fire-2014: fuzzy queries with rule-based normalization for mixed script information retrieval. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 80–85.Gella S, Sharma J, Bali K. Query word labeling and back transliteration for Indian languages: shared task system description. FIRE Working Notes. 2013;3.Gupta DK, Kumar S, Ekbal A. Machine learning approach for language identification and transliteration. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 60–64.Gupta P, Bali K, Banchs RE, Choudhury M, Rosso P. Query expansion for mixed-script information retrieval. In: Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, ACM, 2014. pp. 677–686.Gupta P, Rosso P, Banchs RE. Encoding transliteration variation through dimensionality reduction: fire shared task on transliterated search. In: Fifth forum for information retrieval evaluation. 2013.HB Barathi Ganesh, M Anand Kumar, KP Soman. Distributional semantic representation for information retrieval. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.HB Barathi Ganesh, M Anand Kumar, KP Soman. Distributional semantic representation for text classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst. 2002;20:422–46. https://doi.org/10.1145/582415.582418.Joshi H, Bhatt A, Patel H. Transliterated search using syllabification approach. In: Forum for information retrieval evaluation. 2013.King B, Abney S. Labeling the languages of words in mixed-language documents using weakly supervised methods. In: Proceedings of NAACL-HLT, 2013. pp. 1110–1119.Londhe N, Srihari RK. Exploiting named entity mentions towards code mixed IR: working notes for the UB system submission for MSIR@FIRE’16. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.Anand Kumar M, Soman KP. Amrita-CEN@MSIR-FIRE2016: Code-mixed question classification using BoWs and RNN embeddings. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Majumder G, Pakray P. NLP-NITMZ@MSIR 2016 system for code-mixed cross-script question classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Mandal S, Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S. Adaptive voting in multiple classifier systems for word level language identification. In: FIRE workshops, 2015. pp. 47–50.Mukherjee A, Ravi A , Datta K. Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing. In: Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, India, 2014.Pakray P, Bhaskar P. Transliterated search system for Indian languages. In: Pre-proceedings of the 5th FIRE-2013 workshop, forum for information retrieval evaluation (FIRE). 2013.Patel S, Desai V. Liga and syllabification approach for language identification and back transliteration: a shared task report by da-iict. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 43–47.Prabhakar DK, Pal S. Ism@fire-2013 shared task on transliterated search. In: Post-Proceedings of the 4th and 5th workshops of the forum for information retrieval evaluation, ACM, 2013. p. 17.Prabhakar DK, Pal S. Ism@ fire-2015: mixed script information retrieval. In: FIRE workshops. 2015. pp. 55–58.Prakash A, Saha SK. A relevance feedback based approach for mixed script transliterated text search: shared task report by bit Mesra. In: Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, India, 2014.Raj A, Karfa S. A list-searching based approach for language identification in bilingual text: shared task report by asterisk. In: Working notes of the shared task on transliterated search at forum for information retrieval evaluation FIRE’14. 2014.Roy RS, Choudhury M, Majumder P, Agarwal K. Overview of the fire 2013 track on transliterated search. In: Post-proceedings of the 4th and 5th workshops of the forum for information retrieval evaluation, ACM, 2013. p. 4.Saini A. Code mixed cross script question classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Salton G, McGill MJ. Introduction to modern information retrieval. New York: McGraw-Hill, Inc.; 1986.Sequiera R, Choudhury M, Gupta P, Rosso P, Kumar S, Banerjee S, Naskar SK, Bandyopadhyay S, Chittaranjan G, Das A, et al. Overview of fire-2015 shared task on mixed script information retrieval. FIRE Workshops. 2015;1587:19–25.Singh S, M Anand Kumar, KP Soman. CEN@Amrita: information retrieval on code mixed Hindi–English tweets using vector space models. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.Sinha N, Srinivasa G. Hindi–English language identification, named entity recognition and back transliteration: shared task system description. In: Working notes os shared task on transliterated search at forum for information retrieval evaluation FIRE’14. 2014.Voorhees EM, Tice DM. The TREC-8 question answering track evaluation. In: TREC-8, 1999. pp. 83–105.Vyas Y, Gella S, Sharma J, Bali K, Choudhury M. Pos tagging of English–Hindi code-mixed social media content. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. pp. 974–979

    Cross-view Embeddings for Information Retrieval

    Full text link
    In this dissertation, we deal with the cross-view tasks related to information retrieval using embedding methods. We study existing methodologies and propose new methods to overcome their limitations. We formally introduce the concept of mixed-script IR, which deals with the challenges faced by an IR system when a language is written in different scripts because of various technological and sociological factors. Mixed-script terms are represented by a small and finite feature space comprised of character n-grams. We propose the cross-view autoencoder (CAE) to model such terms in an abstract space and CAE provides the state-of-the-art performance. We study a wide variety of models for cross-language information retrieval (CLIR) and propose a model based on compositional neural networks (XCNN) which overcomes the limitations of the existing methods and achieves the best results for many CLIR tasks such as ad-hoc retrieval, parallel sentence retrieval and cross-language plagiarism detection. We empirically test the proposed models for these tasks on publicly available datasets and present the results with analyses. In this dissertation, we also explore an effective method to incorporate contextual similarity for lexical selection in machine translation. Concretely, we investigate a feature based on context available in source sentence calculated using deep autoencoders. The proposed feature exhibits statistically significant improvements over the strong baselines for English-to-Spanish and English-to-Hindi translation tasks. Finally, we explore the the methods to evaluate the quality of autoencoder generated representations of text data and analyse its architectural properties. For this, we propose two metrics based on reconstruction capabilities of the autoencoders: structure preservation index (SPI) and similarity accumulation index (SAI). We also introduce a concept of critical bottleneck dimensionality (CBD) below which the structural information is lost and present analyses linking CBD and language perplexity.En esta disertación estudiamos problemas de vistas-múltiples relacionados con la recuperación de información utilizando técnicas de representación en espacios de baja dimensionalidad. Estudiamos las técnicas existentes y proponemos nuevas técnicas para solventar algunas de las limitaciones existentes. Presentamos formalmente el concepto de recuperación de información con escritura mixta, el cual trata las dificultades de los sistemas de recuperación de información cuando los textos contienen escrituras en distintos alfabetos debido a razones tecnológicas y socioculturales. Las palabras en escritura mixta son representadas en un espacio de características finito y reducido, compuesto por n-gramas de caracteres. Proponemos los auto-codificadores de vistas-múltiples (CAE, por sus siglas en inglés) para modelar dichas palabras en un espacio abstracto, y esta técnica produce resultados de vanguardia. En este sentido, estudiamos varios modelos para la recuperación de información entre lenguas diferentes (CLIR, por sus siglas en inglés) y proponemos un modelo basado en redes neuronales composicionales (XCNN, por sus siglas en inglés), el cual supera las limitaciones de los métodos existentes. El método de XCNN propuesto produce mejores resultados en diferentes tareas de CLIR tales como la recuperación de información ad-hoc, la identificación de oraciones equivalentes en lenguas distintas y la detección de plagio entre lenguas diferentes. Para tal efecto, realizamos pruebas experimentales para dichas tareas sobre conjuntos de datos disponibles públicamente, presentando los resultados y análisis correspondientes. En esta disertación, también exploramos un método eficiente para utilizar similitud semántica de contextos en el proceso de selección léxica en traducción automática. Específicamente, proponemos características extraídas de los contextos disponibles en las oraciones fuentes mediante el uso de auto-codificadores. El uso de las características propuestas demuestra mejoras estadísticamente significativas sobre sistemas de traducción robustos para las tareas de traducción entre inglés y español, e inglés e hindú. Finalmente, exploramos métodos para evaluar la calidad de las representaciones de datos de texto generadas por los auto-codificadores, a la vez que analizamos las propiedades de sus arquitecturas. Como resultado, proponemos dos nuevas métricas para cuantificar la calidad de las reconstrucciones generadas por los auto-codificadores: el índice de preservación de estructura (SPI, por sus siglas en inglés) y el índice de acumulación de similitud (SAI, por sus siglas en inglés). También presentamos el concepto de dimensión crítica de cuello de botella (CBD, por sus siglas en inglés), por debajo de la cual la información estructural se deteriora. Mostramos que, interesantemente, la CBD está relacionada con la perplejidad de la lengua.En aquesta dissertació estudiem els problemes de vistes-múltiples relacionats amb la recuperació d'informació utilitzant tècniques de representació en espais de baixa dimensionalitat. Estudiem les tècniques existents i en proposem unes de noves per solucionar algunes de les limitacions existents. Presentem formalment el concepte de recuperació d'informació amb escriptura mixta, el qual tracta les dificultats dels sistemes de recuperació d'informació quan els textos contenen escriptures en diferents alfabets per motius tecnològics i socioculturals. Les paraules en escriptura mixta són representades en un espai de característiques finit i reduït, composat per n-grames de caràcters. Proposem els auto-codificadors de vistes-múltiples (CAE, per les seves sigles en anglès) per modelar aquestes paraules en un espai abstracte, i aquesta tècnica produeix resultats d'avantguarda. En aquest sentit, estudiem diversos models per a la recuperació d'informació entre llengües diferents (CLIR , per les sevas sigles en anglès) i proposem un model basat en xarxes neuronals composicionals (XCNN, per les sevas sigles en anglès), el qual supera les limitacions dels mètodes existents. El mètode de XCNN proposat produeix millors resultats en diferents tasques de CLIR com ara la recuperació d'informació ad-hoc, la identificació d'oracions equivalents en llengües diferents, i la detecció de plagi entre llengües diferents. Per a tal efecte, realitzem proves experimentals per aquestes tasques sobre conjunts de dades disponibles públicament, presentant els resultats i anàlisis corresponents. En aquesta dissertació, també explorem un mètode eficient per utilitzar similitud semàntica de contextos en el procés de selecció lèxica en traducció automàtica. Específicament, proposem característiques extretes dels contextos disponibles a les oracions fonts mitjançant l'ús d'auto-codificadors. L'ús de les característiques proposades demostra millores estadísticament significatives sobre sistemes de traducció robustos per a les tasques de traducció entre anglès i espanyol, i anglès i hindú. Finalment, explorem mètodes per avaluar la qualitat de les representacions de dades de text generades pels auto-codificadors, alhora que analitzem les propietats de les seves arquitectures. Com a resultat, proposem dues noves mètriques per quantificar la qualitat de les reconstruccions generades pels auto-codificadors: l'índex de preservació d'estructura (SCI, per les seves sigles en anglès) i l'índex d'acumulació de similitud (SAI, per les seves sigles en anglès). També presentem el concepte de dimensió crítica de coll d'ampolla (CBD, per les seves sigles en anglès), per sota de la qual la informació estructural es deteriora. Mostrem que, de manera interessant, la CBD està relacionada amb la perplexitat de la llengua.Gupta, PA. (2017). Cross-view Embeddings for Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/78457TESI

    Exploiting semantics for improving clinical information retrieval

    Get PDF
    Clinical information retrieval (IR) presents several challenges including terminology mismatch and granularity mismatch. One of the main objectives in clinical IR is to fill the semantic gap among the queries and documents and going beyond keywords matching. To address these issues, in this study we attempt to use semantic information to improve the performance of clinical IR systems by representing queries in an expressive and meaningful context. In this study we propose query context modeling to improve the effectiveness of clinical IR systems. To model query contexts we propose two novel approaches to modeling medical query contexts. The first approach concerns modeling medical query contexts based on mining semantic-based AR for improving clinical text retrieval. The query context is derived from the rules that cover the query and then weighted according to their semantic relatedness to the query concepts. In our second approach we model a representative query context by developing query domain ontology. To develop query domain ontology we extract all the concepts that have semantic relationship with the query concept(s) in UMLS ontologies. Query context represents concepts extracted from query domain ontology and weighted according to their semantic relatedness to the query concept(s). The query context is then exploited in the patient records query expansion and re-ranking for improving clinical retrieval performance. We evaluate this approach on the TREC Medical Records dataset. Results show that our proposed approach significantly improves the retrieval performance compare to classic keyword-based IR model

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study

    Get PDF
    Research in information retrieval (IR) has largely been directed towards tasks requiring high precision. Recently, other IR applications which can be described as recall-oriented IR tasks have received increased attention in the IR research domain. Prominent among these IR applications are patent search and legal search, where users are typically ready to check hundreds or possibly thousands of documents in order to find any possible relevant document. The main concerns in this kind of application are very different from those in standard precision-oriented IR tasks, where users tend to be focused on finding an answer to their information need that can typically be addressed by one or two relevant documents. For precision-oriented tasks, mean average precision continues to be used as the primary evaluation metric for almost all IR applications. For recall-oriented IR applications the nature of the search task, including objectives, users, queries, and document collections, is different from that of standard precision-oriented search tasks. In this research study, two dimensions in IR are explored for the recall-oriented patent search task. The study includes IR system evaluation and multilingual IR for patent search. In each of these dimensions, current IR techniques are studied and novel techniques developed especially for this kind of recall-oriented IR application are proposed and investigated experimentally in the context of patent retrieval. The techniques developed in this thesis provide a significant contribution toward evaluating the effectiveness of recall-oriented IR in general and particularly patent search, and improving the efficiency of multilingual search for this kind of task

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    Search beyond traditional probabilistic information retrieval

    Get PDF
    "This thesis focuses on search beyond probabilistic information retrieval. Three ap- proached are proposed beyond the traditional probabilistic modelling. First, term associ- ation is deeply examined. Term association considers the term dependency using a factor analysis based model, instead of treating each term independently. Latent factors, con- sidered the same as the hidden variables of ""eliteness"" introduced by Robertson et al. to gain understanding of the relation among term occurrences and relevance, are measured by the dependencies and occurrences of term sequences and subsequences. Second, an entity-based ranking approach is proposed in an entity system named ""EntityCube"" which has been released by Microsoft for public use. A summarization page is given to summarize the entity information over multiple documents such that the truly relevant entities can be highly possibly searched from multiple documents through integrating the local relevance contributed by proximity and the global enhancer by topic model. Third, multi-source fusion sets up a meta-search engine to combine the ""knowledge"" from different sources. Meta-features, distilled as high-level categories, are deployed to diversify the baselines. Three modified fusion methods are employed, which are re- ciprocal, CombMNZ and CombSUM with three expanded versions. Through extensive experiments on the standard large-scale TREC Genomics data sets, the TREC HARD data sets and the Microsoft EntityCube Web collections, the proposed extended models beyond probabilistic information retrieval show their effectiveness and superiority.
    corecore