27 research outputs found
Projection Regret: Reducing Background Bias for Novelty Detection via Diffusion Models
Novelty detection is a fundamental task of machine learning which aims to
detect abnormal ( out-of-distribution (OOD)) samples. Since
diffusion models have recently emerged as the de facto standard generative
framework with surprising generation results, novelty detection via diffusion
models has also gained much attention. Recent methods have mainly utilized the
reconstruction property of in-distribution samples. However, they often suffer
from detecting OOD samples that share similar background information to the
in-distribution data. Based on our observation that diffusion models can
\emph{project} any sample to an in-distribution sample with similar background
information, we propose \emph{Projection Regret (PR)}, an efficient novelty
detection method that mitigates the bias of non-semantic information. To be
specific, PR computes the perceptual distance between the test image and its
diffusion-based projection to detect abnormality. Since the perceptual distance
often fails to capture semantic changes when the background information is
dominant, we cancel out the background bias by comparing it against recursive
projections. Extensive experiments demonstrate that PR outperforms the prior
art of generative-model-based novelty detection methods by a significant
margin.Comment: NeurIPS 202
YTCommentQA: Video Question Answerability in Instructional Videos
Instructional videos provide detailed how-to guides for various tasks, with
viewers often posing questions regarding the content. Addressing these
questions is vital for comprehending the content, yet receiving immediate
answers is difficult. While numerous computational models have been developed
for Video Question Answering (Video QA) tasks, they are primarily trained on
questions generated based on video content, aiming to produce answers from
within the content. However, in real-world situations, users may pose questions
that go beyond the video's informational boundaries, highlighting the necessity
to determine if a video can provide the answer. Discerning whether a question
can be answered by video content is challenging due to the multi-modal nature
of videos, where visual and verbal information are intertwined. To bridge this
gap, we present the YTCommentQA dataset, which contains naturally-generated
questions from YouTube, categorized by their answerability and required
modality to answer -- visual, script, or both. Experiments with answerability
classification tasks demonstrate the complexity of YTCommentQA and emphasize
the need to comprehend the combined role of visual and script information in
video reasoning. The dataset is available at
https://github.com/lgresearch/YTCommentQA.Comment: AAAI 202
Curve Your Attention: Mixed-Curvature Transformers for Graph Representation Learning
Real-world graphs naturally exhibit hierarchical or cyclical structures that
are unfit for the typical Euclidean space. While there exist graph neural
networks that leverage hyperbolic or spherical spaces to learn representations
that embed such structures more accurately, these methods are confined under
the message-passing paradigm, making the models vulnerable against side-effects
such as oversmoothing and oversquashing. More recent work have proposed global
attention-based graph Transformers that can easily model long-range
interactions, but their extensions towards non-Euclidean geometry are yet
unexplored. To bridge this gap, we propose Fully Product-Stereographic
Transformer, a generalization of Transformers towards operating entirely on the
product of constant curvature spaces. When combined with tokenized graph
Transformers, our model can learn the curvature appropriate for the input graph
in an end-to-end fashion, without the need of additional tuning on different
curvature initializations. We also provide a kernelized approach to
non-Euclidean attention, which enables our model to run in time and memory cost
linear to the number of nodes and edges while respecting the underlying
geometry. Experiments on graph reconstruction and node classification
demonstrate the benefits of generalizing Transformers to the non-Euclidean
domain.Comment: 19 pages, 7 figure
Discriminator-Guided Multi-step Reasoning with Language Models
In the context of multi-step reasoning, language models (LMs) probabilities
are often miscalibrated -- solutions with high probabilities are not always
correct. Therefore, greedy decoding, which is the standard decoding method for
reasoning tasks, often yields incorrect solutions. In addition, methods such as
self-consistency and verifiers rely on sampling from the LM distribution and do
not tackle the underlying issue. To address this, we introduce Guiding
Multi-step ReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise
decoding approach that nudges the model towards producing correct reasoning
steps. GRACE employs a discriminator model, which is trained to differentiate
correct steps from invalid ones, to adjust decoding preferences based on the
correctness of each reasoning step. Importantly, GRACE does not require
fine-tuning or re-training the LMs. When compared with conventional decoding
strategies over four popular math reasoning benchmarks, GRACE exhibits
significant improvements in both final answer accuracy and step correctness,
outperforming both greedy decoding and self-consistency.\footnote{Our code can
be found at \url{https://github.com/mukhal/grace.}}Comment: 19 pages, 7 figures, and 8 table
Exploring Demonstration Ensembling for In-context Learning
In-context learning (ICL) operates by showing language models (LMs) examples
of input-output pairs for a given task, i.e., demonstrations. The standard
approach for ICL is to prompt the LM with concatenated demonstrations followed
by the test input. This approach suffers from some issues. First, concatenation
offers almost no control over the contribution of each demo to the model
prediction. This can be sub-optimal when some demonstrations are irrelevant to
the test example. Second, due to the input length limit of some transformer
models, it might be infeasible to fit many examples into the context,
especially when dealing with long-input tasks. In this work, we explore
Demonstration Ensembling (DENSE) as an alternative to simple concatenation.
DENSE predicts outputs using subsets (i.e., buckets) of the demonstrations and
then combines the output probabilities resulting from each subset to produce
the final prediction. We study different ensembling methods using GPT-j and
experiment on 12 language tasks. Our experiments show weighted max ensembling
to outperform vanilla concatenation by as large as 2.4 average points. Code
available at https://github.com/mukhal/icl-ensembling.Comment: Published at ME-FoMo workshop at ICLR 2023. Arxiv version includes
evaluation on 5 more task
Merging Generated and Retrieved Knowledge for Open-Domain QA
Open-domain question answering (QA) systems are often built with retrieval
modules. However, retrieving passages from a given source is known to suffer
from insufficient knowledge coverage. Alternatively, prompting large language
models (LLMs) to generate contextual passages based on their parametric
knowledge has been shown to improve QA performance. Yet, LLMs tend to
"hallucinate" content that conflicts with the retrieved knowledge. Based on the
intuition that answers supported by both sources are more likely to be correct,
we propose COMBO, a Compatibility-Oriented knowledge Merging for Better
Open-domain QA framework, to effectively leverage the two sources of
information. Concretely, we match LLM-generated passages with retrieved
counterparts into compatible pairs, based on discriminators trained with silver
compatibility labels. Then a Fusion-in-Decoder-based reader model handles
passage pairs to arrive at the final answer. Experiments show that COMBO
outperforms competitive baselines on three out of four tested open-domain QA
benchmarks. Further analysis reveals that our proposed framework demonstrates
greater efficacy in scenarios with a higher degree of knowledge conflicts.Comment: EMNLP 2023 - Camera Read
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost
To overcome the quadratic cost of self-attention, recent works have proposed
various sparse attention modules, most of which fall under one of two groups:
1) sparse attention under a hand-crafted patterns and 2) full attention
followed by a sparse variant of softmax such as -entmax. Unfortunately,
the first group lacks adaptability to data while the second still requires
quadratic cost in training. In this work, we propose SBM-Transformer, a model
that resolves both problems by endowing each attention head with a
mixed-membership Stochastic Block Model (SBM). Then, each attention head
data-adaptively samples a bipartite graph, the adjacency of which is used as an
attention mask for each input. During backpropagation, a straight-through
estimator is used to flow gradients beyond the discrete sampling step and
adjust the probabilities of sampled edges based on the predictive loss. The
forward and backward cost are thus linear to the number of edges, which each
attention head can also choose flexibly based on the input. By assessing the
distribution of graphs, we theoretically show that SBM-Transformer is a
universal approximator for arbitrary sequence-to-sequence functions in
expectation. Empirical evaluations under the LRA and GLUE benchmarks
demonstrate that our model outperforms previous efficient variants as well as
the original Transformer with full attention. Our implementation can be found
in https://github.com/sc782/SBM-Transformer .Comment: 19 pages, 8 figure