5,878 research outputs found
Evaluation of MIRACLE approach results for CLEF 2003
This paper describes MIRACLE (Multilingual Information RetrievAl for the CLEf campaign) approach and results for the mono, bi and multilingual Cross Language Evaluation Forum tasks. The approach is based on the combination of linguistic and statistic techniques to perform indexing and retrieval tasks
Mixed-Language Arabic- English Information Retrieval
Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries
MIRACLE Approaches to Multilingual Information Retrieval: A Baseline for Future Research
This paper describes the first set of experiments defined by the MIRACLE (Multilingual Information RetrievAl for the CLEf campaign) research group for some of the cross language tasks defined by CLEF. These experiments combine different basic techniques, linguistic-oriented and statistic-oriented, to be applied to the indexing and retrieval processes
Making Connections Oakland: A Case Study for GCIR
GCIR profiles Making Connections Oakland (MCO), a comprehensive initiative that helps newcomers gain an economic foothold and become full participating members of society. The program was designed to build united neighborhoods and stronger families through strategies that illustrate the cornerstones of GCIR's Immigrant Integration Framework: mutual responsibility, change and benefits; multi-sector involvement; and multi-strategy approaches. Each method has its unique strengths with regard to immigrant integration and is highlighted in this document. As the examples in this report demonstrate, foundations do not need to build an immigrant integration program from scratch. Grantmakers can use resources that already exist in their communities to continue supporting their funding priorities
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
“Appropriateness” in foreign language acquisition and use: some theoretical, methodological and ethical considerations
In this contribution, I focus on the concept of “appropriateness” in the usage, the learning and the teaching of foreign languages. Using a participant-based
emic perspective, I investigate multilinguals’ perceptions of appropriateness in their foreign languages. Referring to the existing literature, and using previously unpublished material collected through a web questionnaire (Dewaele
and Pavlenko 2001–2003), I will show that multilinguals develop their judgements of appropriateness, a crucial aspect of sociopragmatic and sociocultural competence, as part of their socialisation in a new language/culture. However, their ability to judge appropriateness accurately does not imply that they will always act “appropriately”. Indeed, the presence of conflicting norms in their
other languages may contribute to conscious or unconscious divergence from the “appropriate” norm in a particular language. Some implications for foreign language teaching will be considered
Unified model for code-switching speech recognition and language identification based on a concatenated tokenizer
Code-Switching (CS) multilingual Automatic Speech Recognition (ASR) models
can transcribe speech containing two or more alternating languages during a
conversation. This paper proposes (1) a new method for creating code-switching
ASR datasets from purely monolingual data sources, and (2) a novel Concatenated
Tokenizer that enables ASR models to generate language ID for each emitted text
token while reusing existing monolingual tokenizers. The efficacy of these
approaches for building CS ASR models is demonstrated for two language pairs,
English-Hindi and English-Spanish, where we achieve new state-of-the-art
results on the Miami Bangor CS evaluation corpus. In addition to competitive
ASR performance, the proposed Concatenated Tokenizer models are highly
effective for spoken language identification, achieving 98%+ accuracy on the
out-of-distribution FLEURS dataset
MIRACLE at GeoCLEF Query Parsing 2007: Extraction and Classification of Geographical Information
This paper describes the participation of MIRACLE research consortium at the Query Parsing task of GeoCLEF 2007. Our system is composed of three main modules. First, the Named Geo-entity Identifier, whose objective is to perform the geo-entity identification and tagging, i.e., to extract the “where” component of the geographical query, should there be any. This module is based on a gazetteer built up from the Geonames geographical database and carries out a sequential process in three steps that consist on geo-entity recognition, geo-entity selection and query tagging. Then, the Query Analyzer parses this tagged query to identify the “what” and “geo-relation” components by means of a rule-based grammar. Finally, a two-level multiclassifier first decides whether the query is indeed a geographical query and, should it be positive, then determines the query type according to the type of information that the user is supposed to be looking for: map, yellow page or information. According to a strict evaluation criterion where a match should have all fields correct, our system reaches a precision value of 42.8% and a recall of 56.6% and our submission is ranked 1st out of 6 participants in the task. A detailed evaluation of the confusion matrixes reveal that some extra effort must be invested in “user-oriented” disambiguation techniques to improve the first level binary classifier for detecting geographical queries, as it is a key component to eliminate many false-positives
- …