56 research outputs found
MILL: Mutual Verification with Large Language Models for Zero-Shot Query Expansion
Query expansion is a commonly-used technique in many search systems to better
represent users' information needs with additional query terms. Existing
studies for this task usually propose to expand a query with retrieved or
generated contextual documents. However, both types of methods have clear
limitations. For retrieval-based methods, the documents retrieved with the
original query might not be accurate enough to reveal the search intent,
especially when the query is brief or ambiguous. For generation-based methods,
existing models can hardly be trained or aligned on a particular corpus, due to
the lack of corpus-specific labeled data. In this paper, we propose a novel
Large Language Model (LLM) based mutual verification framework for query
expansion, which alleviates the aforementioned limitations. Specifically, we
first design a query-query-document generation pipeline, which can effectively
leverage the contextual knowledge encoded in LLMs to generate sub-queries and
corresponding documents from multiple perspectives. Next, we employ a mutual
verification method for both generated and retrieved contextual documents,
where 1) retrieved documents are filtered with the external contextual
knowledge in generated documents, and 2) generated documents are filtered with
the corpus-specific knowledge in retrieved documents. Overall, the proposed
method allows retrieved and generated documents to complement each other to
finalize a better query expansion. We conduct extensive experiments on three
information retrieval datasets, i.e., TREC-DL-2020, TREC-COVID, and MSMARCO.
The results demonstrate that our method outperforms other baselines
significantly
A Robust Semantics-based Watermark for Large Language Model against Paraphrasing
Large language models (LLMs) have show great ability in various natural
language tasks. However, there are concerns that LLMs are possible to be used
improperly or even illegally. To prevent the malicious usage of LLMs, detecting
LLM-generated text becomes crucial in the deployment of LLM applications.
Watermarking is an effective strategy to detect the LLM-generated content by
encoding a pre-defined secret watermark to facilitate the detection process.
However, the majority of existing watermark methods leverage the simple hashes
of precedent tokens to partition vocabulary. Such watermark can be easily
eliminated by paraphrase and correspondingly the detection effectiveness will
be greatly compromised. Thus, to enhance the robustness against paraphrase, we
propose a semantics-based watermark framework SemaMark. It leverages the
semantics as an alternative to simple hashes of tokens since the paraphrase
will likely preserve the semantic meaning of the sentences. Comprehensive
experiments are conducted to demonstrate the effectiveness and robustness of
SemaMark under different paraphrases
Enhancing Graph Neural Networks with Structure-Based Prompt
Graph Neural Networks (GNNs) are powerful in learning semantics of graph
data. Recently, a new paradigm "pre-train, prompt" has shown promising results
in adapting GNNs to various tasks with less supervised data. The success of
such paradigm can be attributed to the more consistent objectives of
pre-training and task-oriented prompt tuning, where the pre-trained knowledge
can be effectively transferred to downstream tasks. However, an overlooked
issue of existing studies is that the structure information of graph is usually
exploited during pre-training for learning node representations, while
neglected in the prompt tuning stage for learning task-specific parameters. To
bridge this gap, we propose a novel structure-based prompting method for GNNs,
namely SAP, which consistently exploits structure information in both
pre-training and prompt tuning stages. In particular, SAP 1) employs a
dual-view contrastive learning to align the latent semantic spaces of node
attributes and graph structure, and 2) incorporates structure information in
prompted graph to elicit more pre-trained knowledge in prompt tuning. We
conduct extensive experiments on node classification and graph classification
tasks to show the effectiveness of SAP. Moreover, we show that SAP can lead to
better performance in more challenging few-shot scenarios on both homophilous
and heterophilous graphs
Towards Verifiable Text Generation with Evolving Memory and Self-Reflection
Despite the remarkable ability of large language models (LLMs) in language
comprehension and generation, they often suffer from producing factually
incorrect information, also known as hallucination. A promising solution to
this issue is verifiable text generation, which prompts LLMs to generate
content with citations for accuracy verification. However, verifiable text
generation is non-trivial due to the focus-shifting phenomenon, the intricate
reasoning needed to align the claim with correct citations, and the dilemma
between the precision and breadth of retrieved documents. In this paper, we
present VTG, an innovative framework for Verifiable Text Generation with
evolving memory and self-reflection. VTG introduces evolving long short-term
memory to retain both valuable documents and recent documents. A two-tier
verifier equipped with an evidence finder is proposed to rethink and reflect on
the relationship between the claim and citations. Furthermore, active retrieval
and diverse query generation are utilized to enhance both the precision and
breadth of the retrieved documents. We conduct extensive experiments on five
datasets across three knowledge-intensive tasks and the results reveal that VTG
significantly outperforms baselines
Self-supervised Heterogeneous Graph Variational Autoencoders
Heterogeneous Information Networks (HINs), which consist of various types of
nodes and edges, have recently demonstrated excellent performance in graph
mining. However, most existing heterogeneous graph neural networks (HGNNs)
ignore the problems of missing attributes, inaccurate attributes and scarce
labels for nodes, which limits their expressiveness. In this paper, we propose
a generative self-supervised model SHAVA to address these issues
simultaneously. Specifically, SHAVA first initializes all the nodes in the
graph with a low-dimensional representation matrix. After that, based on the
variational graph autoencoder framework, SHAVA learns both node-level and
attribute-level embeddings in the encoder, which can provide fine-grained
semantic information to construct node attributes. In the decoder, SHAVA
reconstructs both links and attributes. Instead of directly reconstructing raw
features for attributed nodes, SHAVA generates the initial low-dimensional
representation matrix for all the nodes, based on which raw features of
attributed nodes are further reconstructed to leverage accurate attributes. In
this way, SHAVA can not only complete informative features for non-attributed
nodes, but rectify inaccurate ones for attributed nodes. Finally, we conduct
extensive experiments to show the superiority of SHAVA in tackling HINs with
missing and inaccurate attributes
Graph Enhanced BERT for Query Understanding
Query understanding plays a key role in exploring users' search intents and
facilitating users to locate their most desired information. However, it is
inherently challenging since it needs to capture semantic information from
short and ambiguous queries and often requires massive task-specific labeled
data. In recent years, pre-trained language models (PLMs) have advanced various
natural language processing tasks because they can extract general semantic
information from large-scale corpora. Therefore, there are unprecedented
opportunities to adopt PLMs for query understanding. However, there is a gap
between the goal of query understanding and existing pre-training strategies --
the goal of query understanding is to boost search performance while existing
strategies rarely consider this goal. Thus, directly applying them to query
understanding is sub-optimal. On the other hand, search logs contain user
clicks between queries and urls that provide rich users' search behavioral
information on queries beyond their content. Therefore, in this paper, we aim
to fill this gap by exploring search logs. In particular, to incorporate search
logs into pre-training, we first construct a query graph where nodes are
queries and two queries are connected if they lead to clicks on the same urls.
Then we propose a novel graph-enhanced pre-training framework, GE-BERT, which
can leverage both query content and the query graph. In other words, GE-BERT
can capture both the semantic information and the users' search behavioral
information of queries. Extensive experiments on various query understanding
tasks have demonstrated the effectiveness of the proposed framework
A Simple yet Effective Framework for Active Learning to Rank
While China has become the biggest online market in the world with around 1
billion internet users, Baidu runs the world largest Chinese search engine
serving more than hundreds of millions of daily active users and responding
billions queries per day. To handle the diverse query requests from users at
web-scale, Baidu has done tremendous efforts in understanding users' queries,
retrieve relevant contents from a pool of trillions of webpages, and rank the
most relevant webpages on the top of results. Among these components used in
Baidu search, learning to rank (LTR) plays a critical role and we need to
timely label an extremely large number of queries together with relevant
webpages to train and update the online LTR models. To reduce the costs and
time consumption of queries/webpages labeling, we study the problem of Activ
Learning to Rank (active LTR) that selects unlabeled queries for annotation and
training in this work. Specifically, we first investigate the criterion --
Ranking Entropy (RE) characterizing the entropy of relevant webpages under a
query produced by a sequence of online LTR models updated by different
checkpoints, using a Query-By-Committee (QBC) method. Then, we explore a new
criterion namely Prediction Variances (PV) that measures the variance of
prediction results for all relevant webpages under a query. Our empirical
studies find that RE may favor low-frequency queries from the pool for labeling
while PV prioritizing high-frequency queries more. Finally, we combine these
two complementary criteria as the sample selection strategies for active
learning. Extensive experiments with comparisons to baseline algorithms show
that the proposed approach could train LTR models achieving higher Discounted
Cumulative Gain (i.e., the relative improvement {\Delta}DCG4=1.38%) with the
same budgeted labeling efforts.Comment: This paper is accepted to Machine Intelligence Research and a short
version is presented in NeurIPS 2022 Workshop on Human in the Loop Learnin
- …