2,826 research outputs found

    What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

    Full text link
    Most of the existing work in one-stage referring expression comprehension (REC) mainly focuses on multi-modal fusion and reasoning, while the influence of other factors in this task lacks in-depth exploration. To fill this gap, we conduct an empirical study in this paper. Concretely, we first build a very simple REC network called SimREC, and ablate 42 candidate designs/settings, which covers the entire process of one-stage REC from network design to model training. Afterwards, we conduct over 100 experimental trials on three benchmark datasets of REC. The extensive experimental results not only show the key factors that affect REC performance in addition to multi-modal fusion, e.g., multi-scale features and data augmentation, but also yield some findings that run counter to conventional understanding. For example, as a vision and language (V&L) task, REC does is less impacted by language prior. In addition, with a proper combination of these findings, we can improve the performance of SimREC by a large margin, e.g., +27.12% on RefCOCO+, which outperforms all existing REC methods. But the most encouraging finding is that with much less training overhead and parameters, SimREC can still achieve better performance than a set of large-scale pre-trained models, e.g., UNITER and VILLA, portraying the special role of REC in existing V&L research

    HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images

    Full text link
    Visual question answering (VQA) is an important and challenging multimodal task in computer vision. Recently, a few efforts have been made to bring VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, not only the huge variation in the appearance, scale and orientation of the concepts in aerial images, but also the scarcity of the well-annotated datasets restricts the development of VQA in this domain. In this paper, we introduce a new dataset, HRVQA, which provides collected 53512 aerial images of 1024*1024 pixels and semi-automatically generated 1070240 QA pairs. To benchmark the understanding capability of VQA models for aerial images, we evaluate the relevant methods on HRVQA. Moreover, we propose a novel model, GFTransformer, with gated attention modules and a mutual fusion module. The experiments show that the proposed dataset is quite challenging, especially the specific attribute related questions. Our method achieves superior performance in comparison to the previous state-of-the-art approaches. The dataset and the source code will be released at https://hrvqa.nl/

    Reusable Slotwise Mechanisms

    Full text link
    Agents with the ability to comprehend and reason about the dynamics of objects would be expected to exhibit improved robustness and generalization in novel scenarios. However, achieving this capability necessitates not only an effective scene representation but also an understanding of the mechanisms governing interactions among object subsets. Recent studies have made significant progress in representing scenes using object slots. In this work, we introduce Reusable Slotwise Mechanisms, or RSM, a framework that models object dynamics by leveraging communication among slots along with a modular architecture capable of dynamically selecting reusable mechanisms for predicting the future states of each object slot. Crucially, RSM leverages the Central Contextual Information (CCI), enabling selected mechanisms to access the remaining slots through a bottleneck, effectively allowing for modeling of higher order and complex interactions that might require a sparse subset of objects. Experimental results demonstrate the superior performance of RSM compared to state-of-the-art methods across various future prediction and related downstream tasks, including Visual Question Answering and action planning. Furthermore, we showcase RSM's Out-of-Distribution generalization ability to handle scenes in intricate scenarios

    Visual Question Answering: Exploring Trade-offs Between Task Accuracy and Explainability

    Get PDF
    Given visual input and a natural language question about it, the visual question answering (VQA) task is to answer the question correctly. To improve a system\u27s reliability and trustworthiness, it is imperative that it links the text (question and answer) to specific visual regions. This dissertation first explores the VQA task in a multi-modal setting where questions are based on video as well as subtitles. An algorithm is introduced to process each modality and their features are fused to solve the task. Additionally, to understand the model\u27s emphasis on visual data, this study collects a diagnostic set of questions which strictly require the knowledge of visual input based on a human annotator\u27s judgment. The next phase of this research deals with grounding in VQA systems without any detectors or object annotations. To this end, weak supervision is employed for grounding by training on the VQA task alone. In the initial part of this study, a rubric is provided to measure the grounding performance. This reveals that high accuracy is no guarantee for good grounding, i.e., the system is getting the correct answer despite not attending to the visual evidence. Techniques are introduced to improve VQA grounding by combining attention and capsule networks. This approach benefits the grounding ability in both CNNs and transformers. Lastly, we focus on question answering in videos. By depicting activities and objects as well as their relationships as a graph, a video can be represented compactly capturing necessary information to produce an answer. An algorithm is devised that learns to construct such graphs and uses question-to-graph attention; this solution obtains significant improvement for complex reasoning-based questions on STAR and AGQA benchmarks. Hence, by obtaining higher accuracy and better grounding, this dissertation bridges the gap between task accuracy and explainability of reasoning in VQA systems
    • …
    corecore