16 research outputs found
Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering
Many vision and language tasks require commonsense reasoning beyond
data-driven image and natural language processing. Here we adopt Visual
Question Answering (VQA) as an example task, where a system is expected to
answer a question in natural language about an image. Current state-of-the-art
systems attempted to solve the task using deep neural architectures and
achieved promising performance. However, the resulting systems are generally
opaque and they struggle in understanding questions for which extra knowledge
is required. In this paper, we present an explicit reasoning layer on top of a
set of penultimate neural network based systems. The reasoning layer enables
reasoning and answering questions where additional knowledge is required, and
at the same time provides an interpretable interface to the end users.
Specifically, the reasoning layer adopts a Probabilistic Soft Logic (PSL) based
engine to reason over a basket of inputs: visual relations, the semantic parse
of the question, and background ontological knowledge from word2vec and
ConceptNet. Experimental analysis of the answers and the key evidential
predicates generated on the VQA dataset validate our approach.Comment: 9 pages, 3 figures, AAAI 201
LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI
Natural Language Inference (NLI) is considered a representative task to test
natural language understanding (NLU). In this work, we propose an extensible
framework to collectively yet categorically test diverse Logical reasoning
capabilities required for NLI (and by extension, NLU). Motivated by behavioral
testing, we create a semi-synthetic large test-bench (363 templates, 363k
examples) and an associated framework that offers following utilities: 1)
individually test and analyze reasoning capabilities along 17 reasoning
dimensions (including pragmatic reasoning), 2) design experiments to study
cross-capability information content (leave one out or bring one in); and 3)
the synthetic nature enable us to control for artifacts and biases. The
inherited power of automated test case instantiation from free-form natural
language templates (using CheckList), and a well-defined taxonomy of
capabilities enable us to extend to (cognitively) harder test cases while
varying the complexity of natural language. Through our analysis of
state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and
non-trivial even with training on additional resources). Some capabilities
stand out as harder. Further fine-grained analysis and fine-tuning experiments
reveal more insights about these capabilities and the models -- supporting and
extending previous observations. Towards the end we also perform an user-study,
to investigate whether behavioral information can be utilised to generalize
much better for some models compared to others.Comment: arXiv admin note: substantial text overlap with arXiv:2107.0722
Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks
Recent explorations with commercial Large Language Models (LLMs) have shown
that non-expert users can jailbreak LLMs by simply manipulating the prompts;
resulting in degenerate output behavior, privacy and security breaches,
offensive outputs, and violations of content regulator policies. Limited formal
studies have been carried out to formalize and analyze these attacks and their
mitigations. We bridge this gap by proposing a formalism and a taxonomy of
known (and possible) jailbreaks. We perform a survey of existing jailbreak
methods and their effectiveness on open-source and commercial LLMs (such as GPT
3.5, OPT, BLOOM, and FLAN-T5-xxl). We further propose a limited set of prompt
guards and discuss their effectiveness against known attack types
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs
Reasoning is a fundamental component of language understanding. Recent
prompting techniques, such as chain of thought, have consistently improved
LLMs' performance on various reasoning tasks. Nevertheless, there is still
little understanding of what triggers reasoning abilities in LLMs in the
inference stage. In this paper, we introduce code prompting, a chain of prompts
that transforms a natural language problem into code and directly prompts the
LLM using the generated code without resorting to external code execution. We
hypothesize that code prompts can elicit certain reasoning capabilities of LLMs
trained on text and code and utilize the proposed method to improve conditional
reasoning, the ability to infer different conclusions depending on the
fulfillment of certain conditions. We find that code prompting exhibits a
high-performance boost for multiple LLMs (up to 22.52 percentage points on GPT
3.5, 7.75 on Mixtral, and 16.78 on Mistral) across multiple conditional
reasoning datasets. We then conduct comprehensive experiments to understand how
code prompts trigger reasoning abilities and which capabilities are elicited in
the underlying models. Our analysis of GPT 3.5 reveals that the code formatting
of the input problem is essential for performance improvement. Furthermore,
code prompts improve sample efficiency of in-context learning and facilitate
state tracking of variables or entities.Comment: Code, prompt templates, prompts, and outputs are publicly available
at https://github.com/UKPLab/arxiv2024-conditional-reasoning-llm
Multilingual CheckList: Generation and Evaluation
The recently proposed CheckList (Riberio et al,. 2020) approach to evaluation
of NLP systems has revealed high failure rates for basic capabilities for
multiple state-of-the-art and commercial models. However, the CheckList
creation process is manual which creates a bottleneck towards creation of
multilingual CheckLists catering 100s of languages. In this work, we explore
multiple approaches to generate and evaluate the quality of Multilingual
CheckList. We device an algorithm -- Automated Multilingual Checklist
Generation (AMCG) for automatically transferring a CheckList from a source to a
target language that relies on a reasonable machine translation system. We then
compare the CheckList generated by AMCG with CheckLists generated with
different levels of human intervention. Through in-depth crosslingual
experiments between English and Hindi, and broad multilingual experiments
spanning 11 languages, we show that the automatic approach can provide accurate
estimates of failure rates of a model across capabilities, as would a
human-verified CheckList, and better than CheckLists generated by humans from
scratch
Melting of the vortex lattice through intermediate hexatic fluid in a-MoGe thin film
The hexatic fluid refers to a phase in between a solid and a liquid which has
short range positional order but quasi-long range orientational order. In the
celebrated theory of Berezinskii, Kosterlitz and Thouless and subsequently
refined by Halperin, Nelson and Young, it was predicted that a 2-dimensional
hexagonal solid can melt in two steps: first, through a transformation from a
solid to a hexatic fluid which retains quasi long range orientational order and
then from a hexatic fluid to an isotropic liquid. In this paper, using a
combination of real space imaging and transport measurements we show that the
2-dimensional vortex lattice in a-MoGe thin film follows this sequence of
melting as the magnetic field is increased. Identifying the signatures of
various transitions on the bulk transport properties of the superconductor, we
construct a vortex phase diagram for a two dimensional superconductor.Comment: New Data added in this versio