30 research outputs found
Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning
Visual question answering requires high-order reasoning about an image, which
is a fundamental capability needed by machine systems to follow complex
directives. Recently, modular networks have been shown to be an effective
framework for performing visual reasoning tasks. While modular networks were
initially designed with a degree of model transparency, their performance on
complex visual reasoning benchmarks was lacking. Current state-of-the-art
approaches do not provide an effective mechanism for understanding the
reasoning process. In this paper, we close the performance gap between
interpretable models and state-of-the-art visual reasoning methods. We propose
a set of visual-reasoning primitives which, when composed, manifest as a model
capable of performing complex reasoning tasks in an explicitly-interpretable
manner. The fidelity and interpretability of the primitives' outputs enable an
unparalleled ability to diagnose the strengths and weaknesses of the resulting
model. Critically, we show that these primitives are highly performant,
achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show
that our model is able to effectively learn generalized representations when
provided a small amount of data containing novel object attributes. Using the
CoGenT generalization task, we show more than a 20 percentage point improvement
over the current state of the art.Comment: CVPR 2018 pre-prin
Behavioral Analysis of Vision-and-Language Navigation Agents
To be successful, Vision-and-Language Navigation (VLN) agents must be able to
ground instructions to actions based on their surroundings. In this work, we
develop a methodology to study agent behavior on a skill-specific basis --
examining how well existing agents ground instructions about stopping, turning,
and moving towards specified objects or rooms. Our approach is based on
generating skill-specific interventions and measuring changes in agent
predictions. We present a detailed case study analyzing the behavior of a
recent agent and then compare multiple agents in terms of skill-specific
competency scores. This analysis suggests that biases from training have
lasting effects on agent behavior and that existing models are able to ground
simple referring expressions. Our comparisons between models show that
skill-specific scores correlate with improvements in overall VLN task
performance.Comment: accepted to CVPR202
Analysis of General Aviation fixed-wing aircraft accidents involving inflight loss of control using a state-based approach
Inflight loss of control (LOC-I) is a significant cause of General Aviation (GA) fixed-wing aircraft accidents. The United States National Transportation Safety Board’s database provides a rich source of accident data, but conventional analyses of the database yield limited insights to LOC-I. We investigate the causes of 5,726 LOC-I fixed‑wing GA aircraft accidents in the United States in 1999–2008 and 2009–2017 using a state-based modeling approach. The multi-year analysis helps discern changes in causation trends over the last two decades. Our analysis highlights LOC-I causes such as pilot actions and mechanical issues that were not discernible in previous research efforts. The logic rules in the state-based approach help infer missing information from the National Transportation Safety Board (NTSB) accident reports. We inferred that 4.84% (1999–2008) and 7.46% (2009–2017) of LOC-I accidents involved a preflight hazardous aircraft condition. We also inferred that 20.11% (1999–2008) and 19.59% (2009–2017) of LOC-I accidents happened because the aircraft hit an object or terrain. By removing redundant coding and identifying when codes are missing, the state-based approach potentially provides a more consistent way of coding accidents compared to the current coding system
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings
We present a scalable approach for learning open-world object-goal navigation
(ObjectNav) -- the task of asking a virtual robot (agent) to find any instance
of an object in an unexplored environment (e.g., "find a sink"). Our approach
is entirely zero-shot -- i.e., it does not require ObjectNav rewards or
demonstrations of any kind. Instead, we train on the image-goal navigation
(ImageNav) task, in which agents find the location where a picture (i.e., goal
image) was captured. Specifically, we encode goal images into a multimodal,
semantic embedding space to enable training semantic-goal navigation
(SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D).
After training, SemanticNav agents can be instructed to find objects described
in free-form natural language (e.g., "sink", "bathroom sink", etc.) by
projecting language goals into the same multimodal, semantic embedding space.
As a result, our approach enables open-world ObjectNav. We extensively evaluate
our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe
absolute improvements in success of 4.2% - 20.0% over existing zero-shot
methods. For reference, these gains are similar or better than the 5%
improvement in success between the Habitat 2020 and 2021 ObjectNav challenge
winners. In an open-world setting, we discover that our agents can generalize
to compound instructions with a room explicitly mentioned (e.g., "Find a
kitchen sink") and when the target room can be inferred (e.g., "Find a sink and
a stove").Comment: code: https://github.com/gunagg/zso
A Review on working of Cflow
In the field of program analysis, call graphs provide a succinct and human readable visual form of function flows in a program. Typically, call graphs are directed graphs, that determine the sequence of invocation of subroutines depicting the caller callee dependencies. This is used to tap the flow a program takes during execution, laying a foundation for further needful analysis. In this context, Call graph generators, taking a program as input, are typically used to generate call graphs. GNU Cflow, is one such tool. It accepts a C program or a number of C programs as input and generates a procedure flow, with clear caller-callee sequence distinguished by level indentation, with callee functions indented inside caller functions. This output can be altered by supplying different available flags and output-formatting options to suit the requirement. There is a lot of scope to revamp the Cflow source code and utilize the dispensed output. In this paper, we discuss the nature of cflow, its expected output, its limitations and scope for future research in it
OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav
We present a single neural network architecture composed of task-agnostic
components (ViTs, convolutions, and LSTMs) that achieves state-of-art results
on both the ImageNav ("go to location in ") and ObjectNav ("find
a chair") tasks without any task-specific modules like object detection,
segmentation, mapping, or planning modules. Such general-purpose methods offer
advantages of simplicity in design, positive scaling with available compute,
and versatile applicability to multiple tasks. Our work builds upon the recent
success of self-supervised learning (SSL) for pre-training vision transformers
(ViT). However, while the training recipes for convolutional networks are
mature and robust, the recipes for ViTs are contingent and brittle, and in the
case of ViTs for visual navigation, yet to be fully discovered. Specifically,
we find that vanilla ViTs do not outperform ResNets on visual navigation. We
propose the use of a compression layer operating over ViT patch representations
to preserve spatial information along with policy training improvements. These
improvements allow us to demonstrate positive scaling laws for the first time
in visual navigation tasks. Consequently, our model advances state-of-the-art
performance on ImageNav from 54.2% to 82.0% success and performs competitively
against concurrent state-of-art on ObjectNav with success rate of 64.0% vs.
65.0%. Overall, this work does not present a fundamentally new approach, but
rather recommendations for training a general-purpose architecture that
achieves state-of-art performance today and could serve as a strong baseline
for future methods.Comment: 15 pages, 7 figures, 9 table
The International Workshop on Osteoarthritis Imaging Knee MRI Segmentation Challenge: A Multi-Institute Evaluation and Analysis Framework on a Standardized Dataset
Purpose: To organize a knee MRI segmentation challenge for characterizing the
semantic and clinical efficacy of automatic segmentation methods relevant for
monitoring osteoarthritis progression.
Methods: A dataset partition consisting of 3D knee MRI from 88 subjects at
two timepoints with ground-truth articular (femoral, tibial, patellar)
cartilage and meniscus segmentations was standardized. Challenge submissions
and a majority-vote ensemble were evaluated using Dice score, average symmetric
surface distance, volumetric overlap error, and coefficient of variation on a
hold-out test set. Similarities in network segmentations were evaluated using
pairwise Dice correlations. Articular cartilage thickness was computed per-scan
and longitudinally. Correlation between thickness error and segmentation
metrics was measured using Pearson's coefficient. Two empirical upper bounds
for ensemble performance were computed using combinations of model outputs that
consolidated true positives and true negatives.
Results: Six teams (T1-T6) submitted entries for the challenge. No
significant differences were observed across all segmentation metrics for all
tissues (p=1.0) among the four top-performing networks (T2, T3, T4, T6). Dice
correlations between network pairs were high (>0.85). Per-scan thickness errors
were negligible among T1-T4 (p=0.99) and longitudinal changes showed minimal
bias (<0.03mm). Low correlations (<0.41) were observed between segmentation
metrics and thickness error. The majority-vote ensemble was comparable to top
performing networks (p=1.0). Empirical upper bound performances were similar
for both combinations (p=1.0).
Conclusion: Diverse networks learned to segment the knee similarly where high
segmentation accuracy did not correlate to cartilage thickness accuracy. Voting
ensembles did not outperform individual networks but may help regularize
individual models.Comment: Submitted to Radiology: Artificial Intelligence; Fixed typo
Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
We present the largest and most comprehensive empirical study of pre-trained
visual representations (PVRs) or visual 'foundation models' for Embodied AI.
First, we curate CortexBench, consisting of 17 different tasks spanning
locomotion, navigation, dexterous, and mobile manipulation. Next, we
systematically evaluate existing PVRs and find that none are universally
dominant.
To study the effect of pre-training data scale and diversity, we combine over
4,000 hours of egocentric videos from 7 different sources (over 5.6M images)
and ImageNet to train different-sized vision transformers using Masked
Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior
work, we find that scaling dataset size and diversity does not improve
performance universally (but does so on average).
Our largest model, named VC-1, outperforms all prior PVRs on average but does
not universally dominate either. Finally, we show that task or domain-specific
adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving
competitive or superior performance than the best known results on all of the
benchmarks in CortexBench. These models required over 10,000 GPU-hours to train
and can be found on our website for the benefit of the research community.Comment: Project website: https://eai-vc.github.i