166,340 research outputs found

    Initial Observations on Query Based Sampling in Distributed CLIR

    Get PDF
    Cross Language Information Retrieval (CLIR) enables people to search information written in different languages from their query languages. Information can be retrieved either from a single cross lingual collection or from a variety of dis-tributed cross lingual sources. This paper pre-sents initial results exploring the effectiveness of distributed CLIR using query-based sampling techniques, which to the best of our knowledge has not been investigated before. In distributed retrieval with multiple databases, query-based sampling provides a simple and effective way for acquiring accurate resource descriptions which helps to select which databases to search. Obser-vations from our initial experiments show that the negative impact of query-based sampling on cross language search may not be as great as it is on monolingual retrieval

    Improving cross language information retrieval using corpus based query suggestion approach

    Get PDF
    Users seeking information may not find relevant information pertaining to their information need in a specific language. But information may be available in a language different from their own, but users may not know that language. Thus users may experience difficulty in accessing the information present in different languages. Since the retrieval process depends on the translation of the user query, there are many issues in getting the right translation of the user query. For a pair of languages chosen by a user, resources, like incomplete dictionary, inaccurate machine translation system may exist. These resources may be insufficient to map the query terms in one language to its equivalent terms in another language. Also for a given query, there might exist multiple correct translations. The underlying corpus evidence may suggest a clue to select a probable set of translations that could eventually perform a better information retrieval. In this paper, we present a cross language information retrieval approach to effectively retrieve information present in a language other than the language of the user query using the corpus driven query suggestion approach. The idea is to utilize the corpus based evidence of one language to improve the retrieval and re-ranking of news documents in the other language. We use FIRE corpora - Tamil and English news collections in our experiments and illustrate the effectiveness of the proposed cross language information retrieval approach

    Using Language Models for Information Retrieval

    Get PDF
    Because of the world wide web, information retrieval systems are now used by millions of untrained users all over the world. The search engines that perform the information retrieval tasks, often retrieve thousands of potentially interesting documents to a query. The documents should be ranked in decreasing order of relevance in order to be useful to the user. This book describes a mathematical model of information retrieval based on the use of statistical language models. The approach uses simple document-based unigram models to compute for each document the probability that it generates the query. This probability is used to rank the documents. The study makes the following research contributions. * The development of a model that integrates term weighting, relevance feedback and structured queries. * The development of a model that supports multiple representations of a request or information need by integrating a statistical translation model. * The development of a model that supports multiple representations of a document, for instance by allowing proximity searches or searches for terms from a particular record field (e.g. a search for terms from the title). * A mathematical interpretation of stop word removal and stemming. * A mathematical interpretation of operators for mandatory terms, wildcards and synonyms. * A practical comparison of a language model-based retrieval system with similar systems that are based on well-established models and term weighting algorithms in a controlled experiment. * The application of the model to cross-language information retrieval and adaptive information filtering, and the evaluation of two prototype systems in a controlled experiment. Experimental results on three standard tasks show that the language model-based algorithms work as well as, or better than, today's top-performing retrieval algorithms. The standard tasks investigated are ad-hoc retrieval (when there are no previously retrieved documents to guide the search), retrospective relevance weighting (find the optimum model for a given set of relevant documents), and ad-hoc retrieval using manually formulated Boolean queries. The application to cross-language retrieval and adaptive filtering shows the practical use of respectively structured queries, and relevance feedback

    Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations

    Full text link
    Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer, without direct cross-lingual supervision. While these results are promising, follow-up works found that, within the multilingual embedding spaces, there exists strong language identity information which hinders the expression of linguistic factors shared across languages. For semantic tasks like cross-lingual sentence retrieval, it is desired to remove such language identity signals to fully leverage semantic information. In this work, we provide a novel view of projecting away language-specific factors from a multilingual embedding space. Specifically, we discover that there exists a low-rank subspace that primarily encodes information irrelevant to semantics (e.g., syntactic information). To identify this subspace, we present a simple but effective unsupervised method based on singular value decomposition with multiple monolingual corpora as input. Once the subspace is found, we can directly project the original embeddings into the null space to boost language agnosticism without finetuning. We systematically evaluate our method on various tasks including the challenging language-agnostic QA retrieval task. Empirical results show that applying our method consistently leads to improvements over commonly used ML-LMs.Comment: 17 pages, 7 figures, EMNLP 2022 (main conference

    A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles

    Get PDF
    Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and CrossLanguage Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-ngrams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).The work of the first author was in the framework of the Tacardi research project (TIN2012-38523-C02-00). The work of the fourth author was in the framework of the DIANA-Applications (TIN2012-38603-C02-01) and WIQ-EI IRSES (FP7 Marie Curie No. 269180) research projects.Barrón Cedeño, LA.; Paramita, ML.; Clough, P.; Rosso, P. (2014). A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles. En Advances in Information Retrieval. Springer Verlag (Germany). 424-429. https://doi.org/10.1007/978-3-319-06028-6_36S424429Adafre, S., de Rijke, M.: Finding Similar Sentences across Multiple Languages in Wikipedia. In: Proc. of the 11th Conf. of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006)Dumais, S., Letsche, T., Littman, M., Landauer, T.: Automatic Cross-Language Retrieval Using Latent Semantic Indexing. In: AAAI 1997 Spring Symposium Series: Cross-Language Text and Speech Retrieval, Stanford University, pp. 24–26 (1997)Filatova, E.: Directions for exploiting asymmetries in multilingual Wikipedia. In: Proc. of the Third Intl. Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies, Boulder, CO (2009)Levow, G.A., Oard, D., Resnik, P.: Dictionary-Based Techniques for Cross-Language Information Retrieval. Information Processing and Management: Special Issue on Cross-Language Information Retrieval 41(3), 523–547 (2005)Mcnamee, P., Mayfield, J.: Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)Mihalcea, R.: Using Wikipedia for Automatic Word Sense Disambiguation. In: Proc. of NAACL 2007. ACL, Rochester (2007)Mohammadi, M., GhasemAghaee, N.: Building Bilingual Parallel Corpora based on Wikipedia. In: Second Intl. Conf. on Computer Engineering and Applications., vol. 2, pp. 264–268 (2010)Munteanu, D., Fraser, A., Marcu, D.: Improved Machine Translation Performace via Parallel Sentence Extraction from Comparable Corpora. In: Proc. of the Human Language Technology and North American Association for Computational Linguistics Conf (HLT/NAACL 2004), Boston, MA (2004)Nguyen, D., Overwijk, A., Hauff, C., Trieschnigg, D.R.B., Hiemstra, D., de Jong, F.: WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 58–65. Springer, Heidelberg (2009)Paramita, M.L., Clough, P.D., Aker, A., Gaizauskas, R.: Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles. In: Calzolari, E.A. (ed.) Proc. of the 8th Intl. Language Resources and Evaluation (LREC 2012), pp. 790–797. ELRA, Istanbul (2012)Potthast, M., Stein, B., Anderka, M.: A Wikipedia-Based Multilingual Retrieval Model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008)Simard, M., Foster, G.F., Isabelle, P.: Using Cognates to Align Sentences in Bilingual Corpora. In: Proc. of the Fourth Intl. Conf. on Theoretical and Methodological Issues in Machine Translation (1992)Steinberger, R., Pouliquen, B., Hagman, J.: Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 415–424. Springer, Heidelberg (2002)Toral, A., Muñoz, R.: A proposal to automatically build and maintain gazetteers for Named Entity Recognition using Wikipedia. In: Proc. of the EACL Workshop on New Text 2006. Association for Computational Linguistics, Trento (2006

    MSIR@FIRE: A Comprehensive Report from 2013 to 2016

    Full text link
    [EN] India is a nation of geographical and cultural diversity where over 1600 dialects are spoken by the people. With the technological advancement, penetration of the internet and cheaper access to mobile data, India has recently seen a sudden growth of internet users. These Indian internet users generate contents either in English or in other vernacular Indian languages. To develop technological solutions for the contents generated by the Indian users using the Indian languages, the Forum for Information Retrieval Evaluation (FIRE) was established and held for the first time in 2008. Although Indian languages are written using indigenous scripts, often websites and user-generated content (such as tweets and blogs) in these Indian languages are written using Roman script due to various socio-cultural and technological reasons. A challenge that search engines face while processing transliterated queries and documents is that of extensive spelling variation. MSIR track was first introduced in 2013 at FIRE and the aim of MSIR was to systematically formalize several research problems that one must solve to tackle the code mixing in Web search for users of many languages around the world, develop related data sets, test benches and most importantly, build a research community focusing on this important problem that has received very little attention. This document is a comprehensive report on the 4 years of MSIR track evaluated at FIRE between 2013 and 2016.Somnath Banerjee and Sudip Kumar Naskar are supported by Media Lab Asia, MeitY, Government of India, under the Visvesvaraya PhD Scheme for Electronics & IT. The work of Paolo Rosso was partially supported by the MISMIS research project PGC2018-096212-B-C31 funded by the Spanish MICINN.Banerjee, S.; Choudhury, M.; Chakma, K.; Kumar Naskar, S.; Das, A.; Bandyopadhyay, S.; Rosso, P. (2020). MSIR@FIRE: A Comprehensive Report from 2013 to 2016. SN Computer Science. 1(55):1-15. https://doi.org/10.1007/s42979-019-0058-0S115155Ahmed UZ, Bali K, Choudhury M, Sowmya VB. Challenges in designing input method editors for Indian languages: the role of word-origin and context. In: Advances in text input methods (WTIM 2011). 2011. pp. 1–9Banerjee S, Chakma K, Naskar SK, Das A, Rosso P, Bandyopadhyay S, Choudhury M. Overview of the mixed script information retrieval (MSIR) at fire-2016. In: Forum for information retrieval evaluation. Springer; 2016. pp. 39–49.Banerjee S, Kuila A, Roy A, Naskar SK, Rosso P, Bandyopadhyay S. A hybrid approach for transliterated word-level language identification: CRF with post-processing heuristics. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 54–59.Banerjee S, Naskar S, Rosso P, Bandyopadhyay S. Code mixed cross script factoid question classification—a deep learning approach. J Intell Fuzzy Syst. 2018;34(5):2959–69.Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S. The first cross-script code-mixed question answering corpus. In: Proceedings of the workshop on modeling, learning and mining for cross/multilinguality (MultiLingMine 2016), co-located with the 38th European Conference on Information Retrieval (ECIR). 2016.Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S. Named entity recognition on code-mixed cross-script social media content. Comput Sistemas. 2017;21(4):681–92.Barman U, Das A, Wagner J, Foster J. Code mixing: a challenge for language identification in the language of social media. In: Proceedings of the first workshop on computational approaches to code switching. 2014. pp. 13–23.Bhardwaj P, Pakray P, Bajpeyee V, Taneja A. Information retrieval on code-mixed Hindi–English tweets. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.Bhargava R, Khandelwal S, Bhatia A, Sharmai Y. Modeling classifier for code mixed cross script questions. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Bhattacharjee D, Bhattacharya, P. Ensemble classifier based approach for code-mixed cross-script question classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Chakma K, Das A. CMIR: a corpus for evaluation of code mixed information retrieval of Hindi–English tweets. In: The 17th international conference on intelligent text processing and computational linguistics (CICLING). 2016.Choudhury M, Chittaranjan G, Gupta P, Das A. Overview of fire 2014 track on transliterated search. Proceedings of FIRE. 2014. pp. 68–89.Ganguly D, Pal S, Jones GJ. Dcu@fire-2014: fuzzy queries with rule-based normalization for mixed script information retrieval. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 80–85.Gella S, Sharma J, Bali K. Query word labeling and back transliteration for Indian languages: shared task system description. FIRE Working Notes. 2013;3.Gupta DK, Kumar S, Ekbal A. Machine learning approach for language identification and transliteration. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 60–64.Gupta P, Bali K, Banchs RE, Choudhury M, Rosso P. Query expansion for mixed-script information retrieval. In: Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, ACM, 2014. pp. 677–686.Gupta P, Rosso P, Banchs RE. Encoding transliteration variation through dimensionality reduction: fire shared task on transliterated search. In: Fifth forum for information retrieval evaluation. 2013.HB Barathi Ganesh, M Anand Kumar, KP Soman. Distributional semantic representation for information retrieval. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.HB Barathi Ganesh, M Anand Kumar, KP Soman. Distributional semantic representation for text classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst. 2002;20:422–46. https://doi.org/10.1145/582415.582418.Joshi H, Bhatt A, Patel H. Transliterated search using syllabification approach. In: Forum for information retrieval evaluation. 2013.King B, Abney S. Labeling the languages of words in mixed-language documents using weakly supervised methods. In: Proceedings of NAACL-HLT, 2013. pp. 1110–1119.Londhe N, Srihari RK. Exploiting named entity mentions towards code mixed IR: working notes for the UB system submission for MSIR@FIRE’16. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.Anand Kumar M, Soman KP. Amrita-CEN@MSIR-FIRE2016: Code-mixed question classification using BoWs and RNN embeddings. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Majumder G, Pakray P. NLP-NITMZ@MSIR 2016 system for code-mixed cross-script question classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Mandal S, Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S. Adaptive voting in multiple classifier systems for word level language identification. In: FIRE workshops, 2015. pp. 47–50.Mukherjee A, Ravi A , Datta K. Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing. In: Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, India, 2014.Pakray P, Bhaskar P. Transliterated search system for Indian languages. In: Pre-proceedings of the 5th FIRE-2013 workshop, forum for information retrieval evaluation (FIRE). 2013.Patel S, Desai V. Liga and syllabification approach for language identification and back transliteration: a shared task report by da-iict. In: Proceedings of the forum for information retrieval evaluation, ACM, 2014. pp. 43–47.Prabhakar DK, Pal S. Ism@fire-2013 shared task on transliterated search. In: Post-Proceedings of the 4th and 5th workshops of the forum for information retrieval evaluation, ACM, 2013. p. 17.Prabhakar DK, Pal S. Ism@ fire-2015: mixed script information retrieval. In: FIRE workshops. 2015. pp. 55–58.Prakash A, Saha SK. A relevance feedback based approach for mixed script transliterated text search: shared task report by bit Mesra. In: Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, India, 2014.Raj A, Karfa S. A list-searching based approach for language identification in bilingual text: shared task report by asterisk. In: Working notes of the shared task on transliterated search at forum for information retrieval evaluation FIRE’14. 2014.Roy RS, Choudhury M, Majumder P, Agarwal K. Overview of the fire 2013 track on transliterated search. In: Post-proceedings of the 4th and 5th workshops of the forum for information retrieval evaluation, ACM, 2013. p. 4.Saini A. Code mixed cross script question classification. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. CEUR-WS.org. 2016.Salton G, McGill MJ. Introduction to modern information retrieval. New York: McGraw-Hill, Inc.; 1986.Sequiera R, Choudhury M, Gupta P, Rosso P, Kumar S, Banerjee S, Naskar SK, Bandyopadhyay S, Chittaranjan G, Das A, et al. Overview of fire-2015 shared task on mixed script information retrieval. FIRE Workshops. 2015;1587:19–25.Singh S, M Anand Kumar, KP Soman. CEN@Amrita: information retrieval on code mixed Hindi–English tweets using vector space models. In: Working notes of FIRE 2016—forum for information retrieval evaluation, Kolkata, India, December 7–10, 2016, CEUR workshop proceedings. 2016.Sinha N, Srinivasa G. Hindi–English language identification, named entity recognition and back transliteration: shared task system description. In: Working notes os shared task on transliterated search at forum for information retrieval evaluation FIRE’14. 2014.Voorhees EM, Tice DM. The TREC-8 question answering track evaluation. In: TREC-8, 1999. pp. 83–105.Vyas Y, Gella S, Sharma J, Bali K, Choudhury M. Pos tagging of English–Hindi code-mixed social media content. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. pp. 974–979

    SMAN : Stacked Multi-Modal Attention Network for cross-modal image-text retrieval

    Get PDF
    This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods

    Kannada and Telugu Native Languages to English Cross Language Information Retrieval

    Get PDF
    One of the crucial challenges in cross lingual information retrieval is the retrieval of relevant information for a query expressed in as native language. While retrieval of relevant documents is slightly easier, analysing the relevance of the retrieved documents and the presentation of the results to the users are non-trivial tasks. To accomplish the above task, we present our Kannada English and Telugu English CLIR systems as part of Ad-Hoc Bilingual task. We take a query translation based approach using bi-lingual dictionaries. When a query words not found in the dictionary then the words are transliterated using a simple rule based approach which utilizes the corpus to return the ‘k’ closest English transliterations of the given Kannada/Telugu word. The resulting multiple translation/transliteration choices for each query word are disambiguated using an iterative page-rank style algorithm which, based on term-term co-occurrence statistics, produces the final translated query. Finally we conduct experiments on these translated query using a Kannada/Telugu document collection and a set of English queries to report the improvements, performance achieved for each task is to be presented and statistical analysis of these results are given

    Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding

    Full text link
    Learning multi-lingual sentence embeddings is a fundamental and significant task in natural language processing. Recent trends of learning both mono-lingual and multi-lingual sentence embeddings are mainly based on contrastive learning (CL) with an anchor, one positive, and multiple negative instances. In this work, we argue that leveraging multiple positives should be considered for multi-lingual sentence embeddings because (1) positives in a diverse set of languages can benefit cross-lingual learning, and (2) transitive similarity across multiple positives can provide reliable structural information to learn. In order to investigate the impact of CL with multiple positives, we propose a novel approach MPCL to effectively utilize multiple positive instances to improve learning multi-lingual sentence embeddings. Our experimental results on various backbone models and downstream tasks support that compared with conventional CL, MPCL leads to better retrieval, semantic similarity, and classification performances. We also observe that on unseen languages, sentence embedding models trained on multiple positives have better cross-lingual transferring performance than models trained on a single positive instance.Comment: 14 pages, 4 figure
    corecore