17 research outputs found
Cross-language Information Retrieval
Two key assumptions shape the usual view of ranked retrieval: (1) that the
searcher can choose words for their query that might appear in the documents
that they wish to see, and (2) that ranking retrieved documents will suffice
because the searcher will be able to recognize those which they wished to find.
When the documents to be searched are in a language not known by the searcher,
neither assumption is true. In such cases, Cross-Language Information Retrieval
(CLIR) is needed. This chapter reviews the state of the art for CLIR and
outlines some open research questions.Comment: 49 pages, 0 figure
Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only
We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be represented, irrespective of their actual language. The shared embedding spaces are induced solely on the basis of monolingual corpora in two languages through an iterative process based on adversarial neural networks. Our experiments on the standard CLEF CLIR collections for three language pairs of varying degrees of language similarity (English-Dutch/Italian/Finnish) demonstrate the usefulness of the proposed fully unsupervised approach. Our CLIR models with unsupervised cross-lingual embeddings outperform baselines that utilize cross-lingual embeddings induced relying on word-level and document-level alignments. We then demonstrate that further improvements can be achieved by unsupervised ensemble CLIR models. We believe that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or non-existent
Recommended from our members
Data Scarcity in Event Analysis and Abusive Language Detection
Lack of data is almost always the cause of the suboptimal performance of neural networks. Even though data scarce scenarios can be simulated for any task by assuming limited access to training data, we study two problem areas where data scarcity is a practical challenge: event analysis and abusive content detection} Journalists, social scientists and political scientists need to retrieve and analyze event mentions in unstructured text to compute useful statistical information to understand society. We claim that it is hard to specify information need about events using keyword-based representation and propose a Query by Example (QBE) setting for event retrieval. In the QBE setting, we assume that there are a few example sentences mentioning the event class a user is interested in and we aim to retrieve relevant events using only the examples as a query. Traditional event detection approaches are not applicable in this setting as event detection datasets are constructed based on pre-defined schemas which limits them to a small set of event and event-argument types. Moreover, the amount of annotated data in event detection datasets is limited that only allows us to build a retrieval corpus for evaluation. Thus we assume that there are no relevance judgments to train an event retrieval model -- except for the few examples of a specific event type. We create three QBE evaluation settings from three event detection datasets: PoliceKilling, ACE, and IndiaPoliceEvents. For the PoliceKilling dataset, where a relevant sentence describes a police killing event, we show that a query model constructed from the NLP features extracted from the few given examples is effective compared to event detection baselines. For the ACE dataset, where there are thirty-three types of events, we construct a QBE setting for each type and show that a sentence embedding approach effectively transfers for event matching. Finally, we conducted a unified evaluation of all three datasets using the sentence-embedding-based model and showed that it outperforms strong baselines.
We further examine the effect of data scarcity in abusive language detection. We first study a specific type of abusive language -- hate speech. Neural hate speech detection models trained from one dataset poorly generalize to another dataset from a different domain. This is because characteristics of hate speech vary based on racial and cultural aspects. Our data scarcity scenario assumes that we have a hate speech dataset from a domain and it needs to generalize to a test set from another domain using the unlabeled data from the test domain only. Thus we assume zero target domain data in this scenario. To tackle the data scarcity, we propose an unsupervised domain adaptation approach to augment labeled data for hate speech detection. We evaluate the approach with three different models (character CNNs, BiLSTMs, and BERT) on three different collections. We show our approach improves Area under the Precision/Recall curve by as much as 42% and recall by as much as 278%, with no loss (and in some cases a significant gain) in precision.
Finally, we examine the cross-lingual abusive language detection problem. Abusive language is a superclass of hate speech that includes profanity, aggression, offensiveness, cyberbullying, toxicity, and hate speech itself. There is a large collection of abusive language detection datasets in English such as Jigsaw. For other languages there exist datasets for abusive language detection but with very limited data. We propose a cross-lingual transfer learning approach to learn an effective neural abusive language classifier for such low-resource languages with help from a dataset from a resource-rich language. The framework is based on a nearest-neighbor architecture and is thus interpretable by design. It is a modern instantiation of the classic k-nearest neighbor model, as we use transformer representations in all its components. Unlike prior work on neighborhood-based approaches, we encode the neighborhood information based on query-neighbor interactions. We propose two encoding schemes and show their effectiveness using both qualitative and quantitative analyses. Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements in F1 over strong baselines
End-to-End Multilingual Information Retrieval with Massively Large Synthetic Datasets
End-to-end neural networks have revolutionized various fields of artificial intelligence. However, advancements in the field of Cross-Lingual Information Retrieval (CLIR) have been stalled due to the lack of large-scale labeled data. CLIR is a retrieval task in which search queries and candidate documents are in different languages. CLIR can be very useful in some scenarios: for example, a reporter may want to search foreign-language news to obtain different perspectives for her story; an inventor may explore the patents in another country to understand prior art.
This dissertation addresses the bottleneck in end-to-end neural CLIR research by synthesizing large-scale CLIR training data and examining techniques that can exploit this in various CLIR tasks. We publicly release the Large-Scale CLIR dataset and CLIRMatrix, two synthetic CLIR datasets covering a large variety of language directions. We explore and evaluate several neural architectures for end-to-end CLIR modeling. Results show that multilingual information retrieval systems trained on these synthetic CLIR datasets are helpful for many language pairs, especially those in low-resource settings. We further show how these systems can be adapted to real-world scenarios
A Grand Challenges-Based Research Agenda for Scholarly Communication and Information Science [MIT Grand Challenge PubPub Participation Platform]
Identifying Grand Challenges
A global and multidisciplinary community of stakeholders came together in March 2018 to identify, scope, and prioritize a common vision for specific grand research challenges related to the fields of information science and scholarly communications. The participants included domain researchers in academia, practitioners, and those who are aiming to democratize scholarship. An explicit goal of the summit was to identify research needs related to barriers in the development of scalable, interoperable, socially beneficial, and equitable systems for scholarly information; and to explore the development of non-market approaches to governing the scholarly knowledge ecosystem.
To spur discussion and exploration, grand challenge provocations were suggested by participants and framed into one of three sections: scholarly discovery, digital curation and preservation, and open scholarship. A few people participated in three segments, but most only attended discussions around a single topic.
To create the guest list of desired participants within our three workshop target areas we invited a distribution of expertise providing diversity across several facets. In addition to having expertise in the specific focus area, we aimed for the participants in each track to be diverse across sectors, disciplines, and regions of the world. Each track had approximately 20-25 people from different parts of the world—including the United States, European Union, South Africa, and India. Domain researchers brought perspectives from a range of scientific disciplines, while practitioners brought perspectives from different roles (drawn from commercial, non-profit, and governmental sectors). Notwithstanding, we were constrained by our social networks, and by the location of the workshop in Cambridge, Massachusetts— and most of the participants were affiliated with US and European institutions.
During our discussions, it quickly became clear that the grand challenges themselves cannot be neatly categorized into discovery, curation and preservation, and open scholarship—or even, for that matter, limited to library science and information sciences. Several cross-cutting themes emerged, such as a strong need to include underrepresented voices and communities outside of mainstream publishing and academic institutions, a need to identify incentives that will motivate people to make changes in their own approaches and processes toward a more open and trusted framework, and a need to identify collaborators and partners from multiple disciplines in order to build strong programs.
The discussions were full of energy, insights, and enthusiasm for inclusive participation—and concluded with a desire for a global call to action to spark changes that will enable more equitable and open scholarship. Some important and productive tensions surfaced in our discussions, particularly around the best paths forward on the challenges we identified. On many core topics, however, there was widespread agreement among participants, especially on the urgent need to address the exclusion of knowledge production and access of so many people around the globe, and the troubling overrepresentation in the scholarly record of white, male, English-language voices. Ultimately, all agreed that we have an obligation to better enrich and greatly expand this space so that our communities can be catalysts for change.
Towards a more inclusive, open, equitable, and sustainable scholarly knowledge ecosystem: Vision; Broadest impacts; Recommendations for broad impact.
Research landscape: Challenges, threats, and barriers; Challenges to participation in the research community; Restrictions on forms of knowledge; Threats to integrity and trust; Threats to the durability of knowledge; Threats to individual agency; Incentives to sustain a scholarly knowledge ecosystem that is inclusive, equity, trustworthy, and sustainable; Grand Challenges research areas; Recommendations for research areas and programs.
Targeted research questions, research challenges: Legal economic, policy, and organizational design for enduring, equitable, open scholarship; Measuring, predicting, and adapting to use and utility across scholarly communities; Designing and governing algorithms in the scholarly knowledge ecosystem to support accountability, credibility, and agency; Integrating oral and tacit knowledge into the scholarly knowledge ecosystem.
Integrating research, practice, and policy: The need for leadership to coordinate research, policy, and practice initiatives; Role of libraries and archives as advocates and collaborators; Incorporating values of openness, sustainability, and equity into scholarly infrastructure and practice; Funders, catalysts, and coordinators; Recommendations for integrating research, practice, and policy
Recuperação e identificação de momentos em imagens
In our modern society almost anyone is able to capture moments and record
events due to the ease accessibility to smartphones. This leads to the question,
if we record so much of our life how can we easily retrieve specific
moments? The answer to this question would open the door for a big leap
in human life quality. The possibilities are endless, from trivial problems like
finding a photo of a birthday cake to being capable of analyzing the progress
of mental illnesses in patients or even tracking people with infectious diseases.
With so much data being created everyday, the answer to this question becomes
more complex. There is no stream lined approach to solve the problem
of moment localization in a large dataset of images and investigations into
this problem have only started a few years ago. ImageCLEF is one competition
where researchers participate and try to achieve new and better results
in the task of moment retrieval.
This complex problem, along with the interest in participating in the ImageCLEF
Lifelog Moment Retrieval Task posed a good challenge for the
development of this dissertation.
The proposed solution consists in developing a system capable of retriving
images automatically according to specified moments described in a corpus
of text without any sort of user interaction and using only state-of-the-art
image and text processing methods.
The developed retrieval system achieves this objective by extracting and
categorizing relevant information from text while being able to compute a
similarity score with the extracted labels from the image processing stage. In
this way, the system is capable of telling if images are related to the specified
moment in text and therefore able to retrieve the pictures accordingly.
In the ImageCLEF Life Moment Retrieval 2020 subtask the proposed automatic
retrieval system achieved a score of 0.03 in the F1-measure@10
evaluation methodology. Even though this scores are not competitve when
compared to other teams systems scores, the built system presents a good
baseline for future work.Na sociedade moderna, praticamente qualquer pessoa consegue capturar
momentos e registar eventos devido à facilidade de acesso a smartphones.
Isso leva à questão, se registamos tanto da nossa vida, como podemos facilmente
recuperar momentos específicos? A resposta a esta questão abriria a
porta para um grande salto na qualidade da vida humana. As possibilidades
são infinitas, desde problemas triviais como encontrar a foto de um bolo
de aniversário até ser capaz de analisar o progresso de doenças mentais em
pacientes ou mesmo rastrear pessoas com doenças infecciosas.
Com tantos dados a serem criados todos os dias, a resposta a esta pergunta
torna-se mais complexa. Não existe uma abordagem linear para resolver
o problema da localização de momentos num grande conjunto de imagens
e investigações sobre este problema começaram há apenas poucos anos.
O ImageCLEF é uma competição onde investigadores participam e tentam
alcançar novos e melhores resultados na tarefa de recuperação de momentos
a cada ano.
Este problema complexo, em conjunto com o interesse em participar na
tarefa ImageCLEF Lifelog Moment Retrieval, apresentam-se como um bom
desafio para o desenvolvimento desta dissertação.
A solução proposta consiste num sistema capaz de recuperar automaticamente
imagens de momentos descritos em formato de texto, sem qualquer
tipo de interação de um utilizador, utilizando apenas métodos estado da arte
de processamento de imagem e texto.
O sistema de recuperação desenvolvido alcança este objetivo através da extração
e categorização de informação relevante de texto enquanto calcula
um valor de similaridade com os rótulos extraídos durante a fase de processamento
de imagem. Dessa forma, o sistema consegue dizer se as imagens
estão relacionadas ao momento especificado no texto e, portanto, é capaz
de recuperar as imagens de acordo.
Na subtarefa ImageCLEF Life Moment Retrieval 2020, o sistema de recuperação
automática de imagens proposto alcançou uma pontuação de 0.03
na metodologia de avaliação F1-measure@10. Mesmo que estas pontuações
não sejam competitivas quando comparadas às pontuações de outros sistemas
de outras equipas, o sistema construído apresenta-se como uma boa
base para trabalhos futuros.Mestrado em Engenharia Eletrónica e Telecomunicaçõe
Recommended from our members
Neural Models for Information Retrieval without Labeled Data
Recent developments of machine learning models, and in particular deep neural networks, have yielded significant improvements on several computer vision, natural language processing, and speech recognition tasks. Progress with information retrieval (IR) tasks has been slower, however, due to the lack of large-scale training data as well as neural network models specifically designed for effective information retrieval. In this dissertation, we address these two issues by introducing task-specific neural network architectures for a set of IR tasks and proposing novel unsupervised or \emph{weakly supervised} solutions for training the models. The proposed learning solutions do not require labeled training data. Instead, in our weak supervision approach, neural models are trained on a large set of noisy and biased training data obtained from external resources, existing models, or heuristics.
We first introduce relevance-based embedding models that learn distributed representations for words and queries. We show that the learned representations can be effectively employed for a set of IR tasks, including query expansion, pseudo-relevance feedback, and query classification.
We further propose a standalone learning to rank model based on deep neural networks. Our model learns a sparse representation for queries and documents. This enables us to perform efficient retrieval by constructing an inverted index in the learned semantic space. Our model outperforms state-of-the-art retrieval models, while performing as efficiently as term matching retrieval models.
We additionally propose a neural network framework for predicting the performance of a retrieval model for a given query. Inspired by existing query performance prediction models, our framework integrates several information sources, such as retrieval score distribution and term distribution in the top retrieved documents. This leads to state-of-the-art results for the performance prediction task on various standard collections.
We finally bridge the gap between retrieval and recommendation models, as the two key components in most information systems. Search and recommendation often share the same goal: helping people get the information they need at the right time. Therefore, joint modeling and optimization of search engines and recommender systems could potentially benefit both systems. In more detail, we introduce a retrieval model that is trained using user-item interaction (e.g., recommendation data), with no need to query-document relevance information for training.
Our solutions and findings in this dissertation smooth the path towards learning efficient and effective models for various information retrieval and related tasks, especially when large-scale training data is not available