8 research outputs found
Evaluating Modules in Graph Contrastive Learning
The recent emergence of contrastive learning approaches facilitates the
research on graph representation learning (GRL), introducing graph contrastive
learning (GCL) into the literature. These methods contrast semantically similar
and dissimilar sample pairs to encode the semantics into node or graph
embeddings. However, most existing works only performed model-level evaluation,
and did not explore the combination space of modules for more comprehensive and
systematic studies. For effective module-level evaluation, we propose a
framework that decomposes GCL models into four modules: (1) a sampler to
generate anchor, positive and negative data samples (nodes or graphs); (2) an
encoder and a readout function to get sample embeddings; (3) a discriminator to
score each sample pair (anchor-positive and anchor-negative); and (4) an
estimator to define the loss function. Based on this framework, we conduct
controlled experiments over a wide range of architectural designs and
hyperparameter settings on node and graph classification tasks. Specifically,
we manage to quantify the impact of a single module, investigate the
interaction between modules, and compare the overall performance with current
model architectures. Our key findings include a set of module-level guidelines
for GCL, e.g., simple samplers from LINE and DeepWalk are strong and robust; an
MLP encoder associated with Sum readout could achieve competitive performance
on graph classification. Finally, we release our implementations and results as
OpenGCL, a modularized toolkit that allows convenient reproduction, standard
model and module evaluation, and easy extension
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations
This paper reexamines the research on out-of-distribution (OOD) robustness in
the field of NLP. We find that the distribution shift settings in previous
studies commonly lack adequate challenges, hindering the accurate evaluation of
OOD robustness. To address these issues, we propose a benchmark construction
protocol that ensures clear differentiation and challenging distribution
shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution
robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we
conduct a series of experiments on pre-trained language models for analysis and
evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the
relationship between in-distribution (ID) and OOD performance. We identify
three typical types that unveil the inner learning mechanism, which could
potentially facilitate the forecasting of OOD robustness, correlating with the
advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and
find that, despite exhibiting some effectiveness in specific cases, they do not
offer significant improvement compared to vanilla fine-tuning. Further, we
evaluate 5 LLMs with various adaptation paradigms and find that when sufficient
ID data is available, fine-tuning domain-specific models outperform LLMs on ID
examples significantly. However, in the case of OOD instances, prioritizing
LLMs with in-context learning yields better results. We identify that both
fine-tuned small models and LLMs face challenges in effectively addressing
downstream tasks. The code is public at
\url{https://github.com/lifan-yuan/OOD_NLP}.Comment: Accepted to NeurIPS 2023 Dataset and Benchmark Track. Code is
available at \url{https://github.com/lifan-yuan/OOD_NLP
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework
Textual adversarial attacks can discover models' weaknesses by adding
semantic-preserved but misleading perturbations to the inputs. The long-lasting
adversarial attack-and-defense arms race in Natural Language Processing (NLP)
is algorithm-centric, providing valuable techniques for automatic robustness
evaluation. However, the existing practice of robustness evaluation may exhibit
issues of incomprehensive evaluation, impractical evaluation protocol, and
invalid adversarial samples. In this paper, we aim to set up a unified
automatic robustness evaluation framework, shifting towards model-centric
evaluation to further exploit the advantages of adversarial attacks. To address
the above challenges, we first determine robustness evaluation dimensions based
on model capabilities and specify the reasonable algorithm to generate
adversarial samples for each dimension. Then we establish the evaluation
protocol, including evaluation settings and metrics, under realistic demands.
Finally, we use the perturbation degree of adversarial samples to control the
sample validity. We implement a toolkit RobTest that realizes our automatic
robustness evaluation framework. In our experiments, we conduct a robustness
evaluation of RoBERTa models to demonstrate the effectiveness of our evaluation
framework, and further show the rationality of each component in the framework.
The code will be made public at \url{https://github.com/thunlp/RobTest}.Comment: Accepted to Findings of ACL 202
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks
Textual backdoor attacks are a kind of practical threat to NLP systems. By
injecting a backdoor in the training phase, the adversary could control model
predictions via predefined triggers. As various attack and defense models have
been proposed, it is of great significance to perform rigorous evaluations.
However, we highlight two issues in previous backdoor learning evaluations: (1)
The differences between real-world scenarios (e.g. releasing poisoned datasets
or models) are neglected, and we argue that each scenario has its own
constraints and concerns, thus requires specific evaluation protocols; (2) The
evaluation metrics only consider whether the attacks could flip the models'
predictions on poisoned samples and retain performances on benign samples, but
ignore that poisoned samples should also be stealthy and semantic-preserving.
To address these issues, we categorize existing works into three practical
scenarios in which attackers release datasets, pre-trained models, and
fine-tuned models respectively, then discuss their unique evaluation
methodologies. On metrics, to completely evaluate poisoned samples, we use
grammar error increase and perplexity difference for stealthiness, along with
text similarity for validity. After formalizing the frameworks, we develop an
open-source toolkit OpenBackdoor to foster the implementations and evaluations
of textual backdoor learning. With this toolkit, we perform extensive
experiments to benchmark attack and defense models under the suggested
paradigm. To facilitate the underexplored defenses against poisoned datasets,
we further propose CUBE, a simple yet strong clustering-based defense baseline.
We hope that our frameworks and benchmarks could serve as the cornerstones for
future model development and evaluations.Comment: NeurIPS 2022 Datasets & Benchmarks; Toolkits avaliable at
https://github.com/thunlp/OpenBackdoo
Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models
Underlying data distributions of natural language, programming code, and
mathematical symbols vary vastly, presenting a complex challenge for large
language models (LLMs) that strive to achieve high performance across all three
domains simultaneously. Achieving a very high level of proficiency for an LLM
within a specific domain often requires extensive training with relevant
corpora, which is typically accompanied by a sacrifice in performance in other
domains. In this paper, we propose to fuse models that are already
highly-specialized directly. The proposed fusing framework, UltraFuser,
consists of three distinct specialists that are already sufficiently trained on
language, coding, and mathematics. A token-level gating mechanism is introduced
to blend the specialists' outputs. A two-stage training strategy accompanied by
balanced sampling is designed to ensure stability. To effectively train the
fused model, we further construct a high-quality supervised instruction tuning
dataset, UltraChat 2, which includes text, code, and mathematical content. This
dataset comprises approximately 300,000 instructions and covers a wide range of
topics in each domain. Experiments show that our model could simultaneously
achieve mastery of the three crucial domains