124 research outputs found
HetCAN: A Heterogeneous Graph Cascade Attention Network with Dual-Level Awareness
Heterogeneous graph neural networks(HGNNs) have recently shown impressive
capability in modeling heterogeneous graphs that are ubiquitous in real-world
applications. Most existing methods for heterogeneous graphs mainly learn node
embeddings by stacking multiple convolutional or attentional layers, which can
be considered as capturing the high-order information from node-level aspect.
However, different types of nodes in heterogeneous graphs have diverse
features, it is also necessary to capture interactions among node features,
namely the high-order information from feature-level aspect. In addition, most
methods first align node features by mapping them into one same low-dimensional
space, while they may lose some type information of nodes in this way. To
address these problems, in this paper, we propose a novel Heterogeneous graph
Cascade Attention Network (HetCAN) composed of multiple cascade blocks. Each
cascade block includes two components, the type-aware encoder and the
dimension-aware encoder. Specifically, the type-aware encoder compensates for
the loss of node type information and aims to make full use of graph
heterogeneity. The dimension-aware encoder is able to learn the feature-level
high-order information by capturing the interactions among node features. With
the assistance of these components, HetCAN can comprehensively encode
information of node features, graph heterogeneity and graph structure in node
embeddings. Extensive experiments demonstrate the superiority of HetCAN over
advanced competitors and also exhibit its efficiency and robustness.Comment: Accepted by ECML-PKDD 202
XLBench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies
Large Language Models (LLMs) have demonstrated remarkable performance across
diverse tasks but are constrained by their small context window sizes. Various
efforts have been proposed to expand the context window to accommodate even up
to 200K input tokens. Meanwhile, building high-quality benchmarks with much
longer text lengths and more demanding tasks to provide comprehensive
evaluations is of immense practical interest to facilitate long context
understanding research of LLMs. However, prior benchmarks create datasets that
ostensibly cater to long-text comprehension by expanding the input of
traditional tasks, which falls short to exhibit the unique characteristics of
long-text understanding, including long dependency tasks and longer text length
compatible with modern LLMs' context window size. In this paper, we introduce a
benchmark for extremely long context understanding with long-range
dependencies, XLBench, which includes three scenarios: Fiction Reading,
Paper Reading, and Law Reading, and four tasks of increasing complexity: Memory
Retrieval, Detailed Understanding, Overall Understanding, and Open-ended
Generation, covering 27 subtasks in English and Chinese. It has an average
length of 100K+ words (English) and 200K+ characters (Chinese). Evaluating six
leading LLMs on XLBench, we find that their performance significantly lags
behind human levels. Moreover, the observed decline in performance across both
the original and enhanced datasets underscores the efficacy of our approach to
mitigating data contamination.Comment: Work in progres
MILL: Mutual Verification with Large Language Models for Zero-Shot Query Expansion
Query expansion is a commonly-used technique in many search systems to better
represent users' information needs with additional query terms. Existing
studies for this task usually propose to expand a query with retrieved or
generated contextual documents. However, both types of methods have clear
limitations. For retrieval-based methods, the documents retrieved with the
original query might not be accurate enough to reveal the search intent,
especially when the query is brief or ambiguous. For generation-based methods,
existing models can hardly be trained or aligned on a particular corpus, due to
the lack of corpus-specific labeled data. In this paper, we propose a novel
Large Language Model (LLM) based mutual verification framework for query
expansion, which alleviates the aforementioned limitations. Specifically, we
first design a query-query-document generation pipeline, which can effectively
leverage the contextual knowledge encoded in LLMs to generate sub-queries and
corresponding documents from multiple perspectives. Next, we employ a mutual
verification method for both generated and retrieved contextual documents,
where 1) retrieved documents are filtered with the external contextual
knowledge in generated documents, and 2) generated documents are filtered with
the corpus-specific knowledge in retrieved documents. Overall, the proposed
method allows retrieved and generated documents to complement each other to
finalize a better query expansion. We conduct extensive experiments on three
information retrieval datasets, i.e., TREC-DL-2020, TREC-COVID, and MSMARCO.
The results demonstrate that our method outperforms other baselines
significantly
A Robust Semantics-based Watermark for Large Language Model against Paraphrasing
Large language models (LLMs) have show great ability in various natural
language tasks. However, there are concerns that LLMs are possible to be used
improperly or even illegally. To prevent the malicious usage of LLMs, detecting
LLM-generated text becomes crucial in the deployment of LLM applications.
Watermarking is an effective strategy to detect the LLM-generated content by
encoding a pre-defined secret watermark to facilitate the detection process.
However, the majority of existing watermark methods leverage the simple hashes
of precedent tokens to partition vocabulary. Such watermark can be easily
eliminated by paraphrase and correspondingly the detection effectiveness will
be greatly compromised. Thus, to enhance the robustness against paraphrase, we
propose a semantics-based watermark framework SemaMark. It leverages the
semantics as an alternative to simple hashes of tokens since the paraphrase
will likely preserve the semantic meaning of the sentences. Comprehensive
experiments are conducted to demonstrate the effectiveness and robustness of
SemaMark under different paraphrases
Enhancing Graph Neural Networks with Structure-Based Prompt
Graph Neural Networks (GNNs) are powerful in learning semantics of graph
data. Recently, a new paradigm "pre-train, prompt" has shown promising results
in adapting GNNs to various tasks with less supervised data. The success of
such paradigm can be attributed to the more consistent objectives of
pre-training and task-oriented prompt tuning, where the pre-trained knowledge
can be effectively transferred to downstream tasks. However, an overlooked
issue of existing studies is that the structure information of graph is usually
exploited during pre-training for learning node representations, while
neglected in the prompt tuning stage for learning task-specific parameters. To
bridge this gap, we propose a novel structure-based prompting method for GNNs,
namely SAP, which consistently exploits structure information in both
pre-training and prompt tuning stages. In particular, SAP 1) employs a
dual-view contrastive learning to align the latent semantic spaces of node
attributes and graph structure, and 2) incorporates structure information in
prompted graph to elicit more pre-trained knowledge in prompt tuning. We
conduct extensive experiments on node classification and graph classification
tasks to show the effectiveness of SAP. Moreover, we show that SAP can lead to
better performance in more challenging few-shot scenarios on both homophilous
and heterophilous graphs
Towards Verifiable Text Generation with Evolving Memory and Self-Reflection
Despite the remarkable ability of large language models (LLMs) in language
comprehension and generation, they often suffer from producing factually
incorrect information, also known as hallucination. A promising solution to
this issue is verifiable text generation, which prompts LLMs to generate
content with citations for accuracy verification. However, verifiable text
generation is non-trivial due to the focus-shifting phenomenon, the intricate
reasoning needed to align the claim with correct citations, and the dilemma
between the precision and breadth of retrieved documents. In this paper, we
present VTG, an innovative framework for Verifiable Text Generation with
evolving memory and self-reflection. VTG introduces evolving long short-term
memory to retain both valuable documents and recent documents. A two-tier
verifier equipped with an evidence finder is proposed to rethink and reflect on
the relationship between the claim and citations. Furthermore, active retrieval
and diverse query generation are utilized to enhance both the precision and
breadth of the retrieved documents. We conduct extensive experiments on five
datasets across three knowledge-intensive tasks and the results reveal that VTG
significantly outperforms baselines
A Simple yet Effective Framework for Active Learning to Rank
While China has become the biggest online market in the world with around 1
billion internet users, Baidu runs the world largest Chinese search engine
serving more than hundreds of millions of daily active users and responding
billions queries per day. To handle the diverse query requests from users at
web-scale, Baidu has done tremendous efforts in understanding users' queries,
retrieve relevant contents from a pool of trillions of webpages, and rank the
most relevant webpages on the top of results. Among these components used in
Baidu search, learning to rank (LTR) plays a critical role and we need to
timely label an extremely large number of queries together with relevant
webpages to train and update the online LTR models. To reduce the costs and
time consumption of queries/webpages labeling, we study the problem of Activ
Learning to Rank (active LTR) that selects unlabeled queries for annotation and
training in this work. Specifically, we first investigate the criterion --
Ranking Entropy (RE) characterizing the entropy of relevant webpages under a
query produced by a sequence of online LTR models updated by different
checkpoints, using a Query-By-Committee (QBC) method. Then, we explore a new
criterion namely Prediction Variances (PV) that measures the variance of
prediction results for all relevant webpages under a query. Our empirical
studies find that RE may favor low-frequency queries from the pool for labeling
while PV prioritizing high-frequency queries more. Finally, we combine these
two complementary criteria as the sample selection strategies for active
learning. Extensive experiments with comparisons to baseline algorithms show
that the proposed approach could train LTR models achieving higher Discounted
Cumulative Gain (i.e., the relative improvement {\Delta}DCG4=1.38%) with the
same budgeted labeling efforts.Comment: This paper is accepted to Machine Intelligence Research and a short
version is presented in NeurIPS 2022 Workshop on Human in the Loop Learnin
Graph Enhanced BERT for Query Understanding
Query understanding plays a key role in exploring users' search intents and
facilitating users to locate their most desired information. However, it is
inherently challenging since it needs to capture semantic information from
short and ambiguous queries and often requires massive task-specific labeled
data. In recent years, pre-trained language models (PLMs) have advanced various
natural language processing tasks because they can extract general semantic
information from large-scale corpora. Therefore, there are unprecedented
opportunities to adopt PLMs for query understanding. However, there is a gap
between the goal of query understanding and existing pre-training strategies --
the goal of query understanding is to boost search performance while existing
strategies rarely consider this goal. Thus, directly applying them to query
understanding is sub-optimal. On the other hand, search logs contain user
clicks between queries and urls that provide rich users' search behavioral
information on queries beyond their content. Therefore, in this paper, we aim
to fill this gap by exploring search logs. In particular, to incorporate search
logs into pre-training, we first construct a query graph where nodes are
queries and two queries are connected if they lead to clicks on the same urls.
Then we propose a novel graph-enhanced pre-training framework, GE-BERT, which
can leverage both query content and the query graph. In other words, GE-BERT
can capture both the semantic information and the users' search behavioral
information of queries. Extensive experiments on various query understanding
tasks have demonstrated the effectiveness of the proposed framework
- …