55 research outputs found
Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval
In monolingual dense retrieval, lots of works focus on how to distill
knowledge from cross-encoder re-ranker to dual-encoder retriever and these
methods achieve better performance due to the effectiveness of cross-encoder
re-ranker. However, we find that the performance of the cross-encoder re-ranker
is heavily influenced by the number of training samples and the quality of
negative samples, which is hard to obtain in the cross-lingual setting. In this
paper, we propose to use a query generator as the teacher in the cross-lingual
setting, which is less dependent on enough training samples and high-quality
negative samples. In addition to traditional knowledge distillation, we further
propose a novel enhancement method, which uses the query generator to help the
dual-encoder align queries from different languages, but does not need any
additional parallel sentences. The experimental results show that our method
outperforms the state-of-the-art methods on two benchmark datasets.Comment: EMNLP 2022 main conferenc
Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval
Recent multilingual pre-trained models have shown better performance in
various multilingual tasks. However, these models perform poorly on
multilingual retrieval tasks due to lacking multilingual training data. In this
paper, we propose to mine and generate self-supervised training data based on a
large-scale unlabeled corpus. We carefully design a mining method which
combines the sparse and dense models to mine the relevance of unlabeled queries
and passages. And we introduce a query generator to generate more queries in
target languages for unlabeled passages. Through extensive experiments on Mr.
TYDI dataset and an industrial dataset from a commercial search engine, we
demonstrate that our method performs better than baselines based on various
pre-trained multilingual models. Our method even achieves on-par performance
with the supervised method on the latter dataset.Comment: EMNLP 2022 Finding
Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation
The Differentiable Search Index (DSI) is an emerging paradigm for information
retrieval. Unlike traditional retrieval architectures where index and retrieval
are two different and separate components, DSI uses a single transformer model
to perform both indexing and retrieval.
In this paper, we identify and tackle an important issue of current DSI
models: the data distribution mismatch that occurs between the DSI indexing and
retrieval processes. Specifically, we argue that, at indexing, current DSI
methods learn to build connections between the text of long documents and the
identifier of the documents, but then retrieval of document identifiers is
based on queries that are commonly much shorter than the indexed documents.
This problem is further exacerbated when using DSI for cross-lingual retrieval,
where document text and query text are in different languages.
To address this fundamental problem of current DSI models, we propose a
simple yet effective indexing framework for DSI, called DSI-QG. When indexing,
DSI-QG represents documents with a number of potentially relevant queries
generated by a query generation model and re-ranked and filtered by a
cross-encoder ranker. The presence of these queries at indexing allows the DSI
models to connect a document identifier to a set of queries, hence mitigating
data distribution mismatches present between the indexing and the retrieval
phases. Empirical results on popular mono-lingual and cross-lingual passage
retrieval datasets show that DSI-QG significantly outperforms the original DSI
model.Comment: 11 page
Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval
Current dense retrievers (DRs) are limited in their ability to effectively
process misspelled queries, which constitute a significant portion of query
traffic in commercial search engines. The main issue is that the pre-trained
language model-based encoders used by DRs are typically trained and fine-tuned
using clean, well-curated text data. Misspelled queries are typically not found
in the data used for training these models, and thus misspelled queries
observed at inference time are out-of-distribution compared to the data used
for training and fine-tuning. Previous efforts to address this issue have
focused on \textit{fine-tuning} strategies, but their effectiveness on
misspelled queries remains lower than that of pipelines that employ separate
state-of-the-art spell-checking components. To address this challenge, we
propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse
Retrieval), a novel re-training strategy for DRs that increases their
robustness to misspelled queries while preserving their effectiveness in
downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture
where the encoder takes misspelled text with masked tokens as input and outputs
bottlenecked information to the decoder. The decoder then takes as input the
bottlenecked embeddings, along with token embeddings of the original text with
the misspelled tokens masked out. The pre-training task is to recover the
masked tokens for both the encoder and decoder. Our extensive experimental
results and detailed ablation studies show that DRs pre-trained with ToRoDer
exhibit significantly higher effectiveness on misspelled queries, sensibly
closing the gap with pipelines that use a separate, complex spell-checker
component, while retaining their effectiveness on correctly spelled queries.Comment: 10 pages, accepted at SIGIR-A
Evaluating the importation of yellow fever cases into China in 2016 and strategies used to prevent and control the spread of the disease
During the yellow fever epidemic in Angola in 2016, cases of yellow fever were reported in China for the first time. The 11 cases, all Chinese nationals returning from Angola, were identified in March and April 2016, one to two weeks after the peak of the Angolan epidemic. One patient died; the other 10 cases recovered after treatment. This paper reviews the epidemiological characteristics of the 11 yellow fever cases imported into China. It examines case detection and disease control and surveillance, and presents recommendations for further action to prevent additional importation of yellow fever into China
Fe/Mg/Fe Multilayer Composite Sheet Fabricated by Roll Cladding
A new multilayer composite sheet consisting of Fe/Mg/Fe was fabricated from galvanized steels and Mg alloy sheets via roll cladding. The clad steel improved the Mg surface hardness from HV 65 to HV 132. Bonding occurred as the reduction ratios increased up to over 10%. Investigation of the microstructure of the Mg/steel interface revealed a 5 μm- to 10 μm-thick transition layer between Mg and each steel sheet, consisting of Zn and an intermetallic compound (0.97Mg–0.03Zn). Zinc coating from the galvanized steel sheet improved the metallurgical bonding between Mg and Fe by forming new intermetallic phases
- …