17 research outputs found

    Cross-language Information Retrieval

    Full text link
    Two key assumptions shape the usual view of ranked retrieval: (1) that the searcher can choose words for their query that might appear in the documents that they wish to see, and (2) that ranking retrieved documents will suffice because the searcher will be able to recognize those which they wished to find. When the documents to be searched are in a language not known by the searcher, neither assumption is true. In such cases, Cross-Language Information Retrieval (CLIR) is needed. This chapter reviews the state of the art for CLIR and outlines some open research questions.Comment: 49 pages, 0 figure

    Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only

    Get PDF
    We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be represented, irrespective of their actual language. The shared embedding spaces are induced solely on the basis of monolingual corpora in two languages through an iterative process based on adversarial neural networks. Our experiments on the standard CLEF CLIR collections for three language pairs of varying degrees of language similarity (English-Dutch/Italian/Finnish) demonstrate the usefulness of the proposed fully unsupervised approach. Our CLIR models with unsupervised cross-lingual embeddings outperform baselines that utilize cross-lingual embeddings induced relying on word-level and document-level alignments. We then demonstrate that further improvements can be achieved by unsupervised ensemble CLIR models. We believe that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or non-existent

    End-to-End Multilingual Information Retrieval with Massively Large Synthetic Datasets

    Get PDF
    End-to-end neural networks have revolutionized various fields of artificial intelligence. However, advancements in the field of Cross-Lingual Information Retrieval (CLIR) have been stalled due to the lack of large-scale labeled data. CLIR is a retrieval task in which search queries and candidate documents are in different languages. CLIR can be very useful in some scenarios: for example, a reporter may want to search foreign-language news to obtain different perspectives for her story; an inventor may explore the patents in another country to understand prior art. This dissertation addresses the bottleneck in end-to-end neural CLIR research by synthesizing large-scale CLIR training data and examining techniques that can exploit this in various CLIR tasks. We publicly release the Large-Scale CLIR dataset and CLIRMatrix, two synthetic CLIR datasets covering a large variety of language directions. We explore and evaluate several neural architectures for end-to-end CLIR modeling. Results show that multilingual information retrieval systems trained on these synthetic CLIR datasets are helpful for many language pairs, especially those in low-resource settings. We further show how these systems can be adapted to real-world scenarios

    Cross-Market Product Recommendation

    Get PDF

    A Grand Challenges-Based Research Agenda for Scholarly Communication and Information Science [MIT Grand Challenge PubPub Participation Platform]

    Get PDF
    Identifying Grand Challenges A global and multidisciplinary community of stakeholders came together in March 2018 to identify, scope, and prioritize a common vision for specific grand research challenges related to the fields of information science and scholarly communications. The participants included domain researchers in academia, practitioners, and those who are aiming to democratize scholarship. An explicit goal of the summit was to identify research needs related to barriers in the development of scalable, interoperable, socially beneficial, and equitable systems for scholarly information; and to explore the development of non-market approaches to governing the scholarly knowledge ecosystem. To spur discussion and exploration, grand challenge provocations were suggested by participants and framed into one of three sections: scholarly discovery, digital curation and preservation, and open scholarship. A few people participated in three segments, but most only attended discussions around a single topic. To create the guest list of desired participants within our three workshop target areas we invited a distribution of expertise providing diversity across several facets. In addition to having expertise in the specific focus area, we aimed for the participants in each track to be diverse across sectors, disciplines, and regions of the world. Each track had approximately 20-25 people from different parts of the world—including the United States, European Union, South Africa, and India. Domain researchers brought perspectives from a range of scientific disciplines, while practitioners brought perspectives from different roles (drawn from commercial, non-profit, and governmental sectors). Notwithstanding, we were constrained by our social networks, and by the location of the workshop in Cambridge, Massachusetts— and most of the participants were affiliated with US and European institutions. During our discussions, it quickly became clear that the grand challenges themselves cannot be neatly categorized into discovery, curation and preservation, and open scholarship—or even, for that matter, limited to library science and information sciences. Several cross-cutting themes emerged, such as a strong need to include underrepresented voices and communities outside of mainstream publishing and academic institutions, a need to identify incentives that will motivate people to make changes in their own approaches and processes toward a more open and trusted framework, and a need to identify collaborators and partners from multiple disciplines in order to build strong programs. The discussions were full of energy, insights, and enthusiasm for inclusive participation—and concluded with a desire for a global call to action to spark changes that will enable more equitable and open scholarship. Some important and productive tensions surfaced in our discussions, particularly around the best paths forward on the challenges we identified. On many core topics, however, there was widespread agreement among participants, especially on the urgent need to address the exclusion of knowledge production and access of so many people around the globe, and the troubling overrepresentation in the scholarly record of white, male, English-language voices. Ultimately, all agreed that we have an obligation to better enrich and greatly expand this space so that our communities can be catalysts for change. Towards a more inclusive, open, equitable, and sustainable scholarly knowledge ecosystem: Vision; Broadest impacts; Recommendations for broad impact. Research landscape: Challenges, threats, and barriers; Challenges to participation in the research community; Restrictions on forms of knowledge; Threats to integrity and trust; Threats to the durability of knowledge; Threats to individual agency; Incentives to sustain a scholarly knowledge ecosystem that is inclusive, equity, trustworthy, and sustainable; Grand Challenges research areas; Recommendations for research areas and programs. Targeted research questions, research challenges: Legal economic, policy, and organizational design for enduring, equitable, open scholarship; Measuring, predicting, and adapting to use and utility across scholarly communities; Designing and governing algorithms in the scholarly knowledge ecosystem to support accountability, credibility, and agency; Integrating oral and tacit knowledge into the scholarly knowledge ecosystem. Integrating research, practice, and policy: The need for leadership to coordinate research, policy, and practice initiatives; Role of libraries and archives as advocates and collaborators; Incorporating values of openness, sustainability, and equity into scholarly infrastructure and practice; Funders, catalysts, and coordinators; Recommendations for integrating research, practice, and policy

    Recuperação e identificação de momentos em imagens

    Get PDF
    In our modern society almost anyone is able to capture moments and record events due to the ease accessibility to smartphones. This leads to the question, if we record so much of our life how can we easily retrieve specific moments? The answer to this question would open the door for a big leap in human life quality. The possibilities are endless, from trivial problems like finding a photo of a birthday cake to being capable of analyzing the progress of mental illnesses in patients or even tracking people with infectious diseases. With so much data being created everyday, the answer to this question becomes more complex. There is no stream lined approach to solve the problem of moment localization in a large dataset of images and investigations into this problem have only started a few years ago. ImageCLEF is one competition where researchers participate and try to achieve new and better results in the task of moment retrieval. This complex problem, along with the interest in participating in the ImageCLEF Lifelog Moment Retrieval Task posed a good challenge for the development of this dissertation. The proposed solution consists in developing a system capable of retriving images automatically according to specified moments described in a corpus of text without any sort of user interaction and using only state-of-the-art image and text processing methods. The developed retrieval system achieves this objective by extracting and categorizing relevant information from text while being able to compute a similarity score with the extracted labels from the image processing stage. In this way, the system is capable of telling if images are related to the specified moment in text and therefore able to retrieve the pictures accordingly. In the ImageCLEF Life Moment Retrieval 2020 subtask the proposed automatic retrieval system achieved a score of 0.03 in the F1-measure@10 evaluation methodology. Even though this scores are not competitve when compared to other teams systems scores, the built system presents a good baseline for future work.Na sociedade moderna, praticamente qualquer pessoa consegue capturar momentos e registar eventos devido à facilidade de acesso a smartphones. Isso leva à questão, se registamos tanto da nossa vida, como podemos facilmente recuperar momentos específicos? A resposta a esta questão abriria a porta para um grande salto na qualidade da vida humana. As possibilidades são infinitas, desde problemas triviais como encontrar a foto de um bolo de aniversário até ser capaz de analisar o progresso de doenças mentais em pacientes ou mesmo rastrear pessoas com doenças infecciosas. Com tantos dados a serem criados todos os dias, a resposta a esta pergunta torna-se mais complexa. Não existe uma abordagem linear para resolver o problema da localização de momentos num grande conjunto de imagens e investigações sobre este problema começaram há apenas poucos anos. O ImageCLEF é uma competição onde investigadores participam e tentam alcançar novos e melhores resultados na tarefa de recuperação de momentos a cada ano. Este problema complexo, em conjunto com o interesse em participar na tarefa ImageCLEF Lifelog Moment Retrieval, apresentam-se como um bom desafio para o desenvolvimento desta dissertação. A solução proposta consiste num sistema capaz de recuperar automaticamente imagens de momentos descritos em formato de texto, sem qualquer tipo de interação de um utilizador, utilizando apenas métodos estado da arte de processamento de imagem e texto. O sistema de recuperação desenvolvido alcança este objetivo através da extração e categorização de informação relevante de texto enquanto calcula um valor de similaridade com os rótulos extraídos durante a fase de processamento de imagem. Dessa forma, o sistema consegue dizer se as imagens estão relacionadas ao momento especificado no texto e, portanto, é capaz de recuperar as imagens de acordo. Na subtarefa ImageCLEF Life Moment Retrieval 2020, o sistema de recuperação automática de imagens proposto alcançou uma pontuação de 0.03 na metodologia de avaliação F1-measure@10. Mesmo que estas pontuações não sejam competitivas quando comparadas às pontuações de outros sistemas de outras equipas, o sistema construído apresenta-se como uma boa base para trabalhos futuros.Mestrado em Engenharia Eletrónica e Telecomunicaçõe
    corecore