Search CORE

5,878 research outputs found

Evaluation of MIRACLE approach results for CLEF 2003

Author: Fombella Mourelle Jorge
García Serrano Ana
González Cristóbal José Carlos
Goñi Menoyo José Miguel
Martínez Fernández José Luis
Martínez Fernández Paloma
Ruiz Cristina Alberto
Villena Román Julio
Publication venue: E.T.S.I. Telecomunicación (UPM)
Publication date: 01/01/2003
Field of study

This paper describes MIRACLE (Multilingual Information RetrievAl for the CLEf campaign) approach and results for the mono, bi and multilingual Cross Language Evaluation Forum tasks. The approach is based on the combination of linguistic and statistic techniques to perform indexing and retrieval tasks

Exploring motivation and clil: a literature review

Author: Pozo Beamud Marta del
Publication venue: 'Editorial Universidad de Almeria'
Publication date: 01/01/2018
Field of study

Combination approaches for multilingual text retrieval

Author: Braschler Martin
Publication venue: Springer
Publication date: 01/01/2004
Field of study

Mixed-Language Arabic- English Information Retrieval

Author: Mustafa Ali Mohammed
Publication venue: Department of Computer Science
Publication date: 01/01/2013
Field of study

Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

MIRACLE Approaches to Multilingual Information Retrieval: A Baseline for Future Research

Author: Fombella Mourelle Jorge
García Serrano Ana
González Cristóbal José Carlos
Goñi Menoyo José Miguel
Martínez Fernández José Luis
Martínez Fernández Paloma
Villena Román Julio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

This paper describes the first set of experiments defined by the MIRACLE (Multilingual Information RetrievAl for the CLEf campaign) research group for some of the cross language tasks defined by CLEF. These experiments combine different basic techniques, linguistic-oriented and statistic-oriented, to be applied to the indexing and retrieval processes

Making Connections Oakland: A Case Study for GCIR

Author: William Wong
Publication venue: Grantmakers Concerned with Immigrants and Refugees (GCIR)
Publication date: 10/10/2008
Field of study

GCIR profiles Making Connections Oakland (MCO), a comprehensive initiative that helps newcomers gain an economic foothold and become full participating members of society. The program was designed to build united neighborhoods and stronger families through strategies that illustrate the cornerstones of GCIR's Immigrant Integration Framework: mutual responsibility, change and benefits; multi-sector involvement; and multi-strategy approaches. Each method has its unique strengths with regard to immigrant integration and is highlighted in this document. As the examples in this report demonstrate, foundations do not need to build an immigrant integration program from scratch. Grantmakers can use resources that already exist in their communities to continue supporting their funding priorities

Crosslingual Document Embedding as Reduced-Rank Ridge Regression

Author: Jaggi Martin
Josifoski Martin
Paskov Hristo S.
Paskov Ivan S.
West Robert
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/02/2019
Field of study

There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia. Our method, Cr5 (Crosslingual reduced-rank ridge regression), starts by training a ridge-regression-based classifier that uses language-specific bag-of-word features in order to predict the concept that a given document is about. We show that, when constraining the learned weight matrix to be of low rank, it can be factored to obtain the desired mappings from language-specific bags-of-words to language-independent embeddings. As opposed to most prior methods, which use pretrained monolingual word vectors, postprocess them to make them crosslingual, and finally average word vectors to obtain document vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as document-level. Moreover, since our algorithm uses the singular value decomposition as its core operation, it is highly scalable. Experiments show that our method achieves state-of-the-art performance on a crosslingual document retrieval task. Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19

arXiv.org e-Print Archive

“Appropriateness” in foreign language acquisition and use: some theoretical, methodological and ethical considerations

Author: Barron
Besemeres
Dewaele
Dewaele
Dewaele
Jean-Marc Dewaele
Kinginger
Kinginger
Leung
Lyster
Pavlenko
Warga
Wilkinson
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2008
Field of study

In this contribution, I focus on the concept of “appropriateness” in the usage, the learning and the teaching of foreign languages. Using a participant-based emic perspective, I investigate multilinguals’ perceptions of appropriateness in their foreign languages. Referring to the existing literature, and using previously unpublished material collected through a web questionnaire (Dewaele and Pavlenko 2001–2003), I will show that multilinguals develop their judgements of appropriateness, a crucial aspect of sociopragmatic and sociocultural competence, as part of their socialisation in a new language/culture. However, their ability to judge appropriateness accurately does not imply that they will always act “appropriately”. Indeed, the presence of conflicting norms in their other languages may contribute to conscious or unconscious divergence from the “appropriate” norm in a particular language. Some implications for foreign language teaching will be considered

Birkbeck Institutional Research Online

Unified model for code-switching speech recognition and language identification based on a concatenated tokenizer

Author: Dhawan Kunal
Ginsburg Boris
Rekesh Dima
Publication venue
Publication date: 16/09/2023
Field of study

Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models can transcribe speech containing two or more alternating languages during a conversation. This paper proposes (1) a new method for creating code-switching ASR datasets from purely monolingual data sources, and (2) a novel Concatenated Tokenizer that enables ASR models to generate language ID for each emitted text token while reusing existing monolingual tokenizers. The efficacy of these approaches for building CS ASR models is demonstrated for two language pairs, English-Hindi and English-Spanish, where we achieve new state-of-the-art results on the Miami Bangor CS evaluation corpus. In addition to competitive ASR performance, the proposed Concatenated Tokenizer models are highly effective for spoken language identification, achieving 98%+ accuracy on the out-of-distribution FLEURS dataset

arXiv.org e-Print Archive

MIRACLE at GeoCLEF Query Parsing 2007: Extraction and Classification of Geographical Information

Author: Goñi Menoyo José Miguel
Lana Serrano Sara
Villena Román Julio
Publication venue: E.T.S.I. Telecomunicación (UPM)
Publication date: 01/01/2007
Field of study

This paper describes the participation of MIRACLE research consortium at the Query Parsing task of GeoCLEF 2007. Our system is composed of three main modules. First, the Named Geo-entity Identifier, whose objective is to perform the geo-entity identification and tagging, i.e., to extract the “where” component of the geographical query, should there be any. This module is based on a gazetteer built up from the Geonames geographical database and carries out a sequential process in three steps that consist on geo-entity recognition, geo-entity selection and query tagging. Then, the Query Analyzer parses this tagged query to identify the “what” and “geo-relation” components by means of a rule-based grammar. Finally, a two-level multiclassifier first decides whether the query is indeed a geographical query and, should it be positive, then determines the query type according to the type of information that the user is supposed to be looking for: map, yellow page or information. According to a strict evaluation criterion where a match should have all fields correct, our system reaches a precision value of 42.8% and a recall of 56.6% and our submission is ranked 1st out of 6 participants in the task. A detailed evaluation of the confusion matrixes reveal that some extra effort must be invested in “user-oriented” disambiguation techniques to improve the first level binary classifier for detecting geographical queries, as it is a key component to eliminate many false-positives