55 research outputs found

    Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval

    Full text link
    In monolingual dense retrieval, lots of works focus on how to distill knowledge from cross-encoder re-ranker to dual-encoder retriever and these methods achieve better performance due to the effectiveness of cross-encoder re-ranker. However, we find that the performance of the cross-encoder re-ranker is heavily influenced by the number of training samples and the quality of negative samples, which is hard to obtain in the cross-lingual setting. In this paper, we propose to use a query generator as the teacher in the cross-lingual setting, which is less dependent on enough training samples and high-quality negative samples. In addition to traditional knowledge distillation, we further propose a novel enhancement method, which uses the query generator to help the dual-encoder align queries from different languages, but does not need any additional parallel sentences. The experimental results show that our method outperforms the state-of-the-art methods on two benchmark datasets.Comment: EMNLP 2022 main conferenc

    Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval

    Full text link
    Recent multilingual pre-trained models have shown better performance in various multilingual tasks. However, these models perform poorly on multilingual retrieval tasks due to lacking multilingual training data. In this paper, we propose to mine and generate self-supervised training data based on a large-scale unlabeled corpus. We carefully design a mining method which combines the sparse and dense models to mine the relevance of unlabeled queries and passages. And we introduce a query generator to generate more queries in target languages for unlabeled passages. Through extensive experiments on Mr. TYDI dataset and an industrial dataset from a commercial search engine, we demonstrate that our method performs better than baselines based on various pre-trained multilingual models. Our method even achieves on-par performance with the supervised method on the latter dataset.Comment: EMNLP 2022 Finding

    Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation

    Full text link
    The Differentiable Search Index (DSI) is an emerging paradigm for information retrieval. Unlike traditional retrieval architectures where index and retrieval are two different and separate components, DSI uses a single transformer model to perform both indexing and retrieval. In this paper, we identify and tackle an important issue of current DSI models: the data distribution mismatch that occurs between the DSI indexing and retrieval processes. Specifically, we argue that, at indexing, current DSI methods learn to build connections between the text of long documents and the identifier of the documents, but then retrieval of document identifiers is based on queries that are commonly much shorter than the indexed documents. This problem is further exacerbated when using DSI for cross-lingual retrieval, where document text and query text are in different languages. To address this fundamental problem of current DSI models, we propose a simple yet effective indexing framework for DSI, called DSI-QG. When indexing, DSI-QG represents documents with a number of potentially relevant queries generated by a query generation model and re-ranked and filtered by a cross-encoder ranker. The presence of these queries at indexing allows the DSI models to connect a document identifier to a set of queries, hence mitigating data distribution mismatches present between the indexing and the retrieval phases. Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.Comment: 11 page

    Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

    Full text link
    Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on \textit{fine-tuning} strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel re-training strategy for DRs that increases their robustness to misspelled queries while preserving their effectiveness in downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture where the encoder takes misspelled text with masked tokens as input and outputs bottlenecked information to the decoder. The decoder then takes as input the bottlenecked embeddings, along with token embeddings of the original text with the misspelled tokens masked out. The pre-training task is to recover the masked tokens for both the encoder and decoder. Our extensive experimental results and detailed ablation studies show that DRs pre-trained with ToRoDer exhibit significantly higher effectiveness on misspelled queries, sensibly closing the gap with pipelines that use a separate, complex spell-checker component, while retaining their effectiveness on correctly spelled queries.Comment: 10 pages, accepted at SIGIR-A

    Evaluating the importation of yellow fever cases into China in 2016 and strategies used to prevent and control the spread of the disease

    Get PDF
    During the yellow fever epidemic in Angola in 2016, cases of yellow fever were reported in China for the first time. The 11 cases, all Chinese nationals returning from Angola, were identified in March and April 2016, one to two weeks after the peak of the Angolan epidemic. One patient died; the other 10 cases recovered after treatment. This paper reviews the epidemiological characteristics of the 11 yellow fever cases imported into China. It examines case detection and disease control and surveillance, and presents recommendations for further action to prevent additional importation of yellow fever into China

    Fe/Mg/Fe Multilayer Composite Sheet Fabricated by Roll Cladding

    No full text
    A new multilayer composite sheet consisting of Fe/Mg/Fe was fabricated from galvanized steels and Mg alloy sheets via roll cladding. The clad steel improved the Mg surface hardness from HV 65 to HV 132. Bonding occurred as the reduction ratios increased up to over 10%. Investigation of the microstructure of the Mg/steel interface revealed a 5 μm- to 10 μm-thick transition layer between Mg and each steel sheet, consisting of Zn and an intermetallic compound (0.97Mg–0.03Zn). Zinc coating from the galvanized steel sheet improved the metallurgical bonding between Mg and Fe by forming new intermetallic phases
    corecore