20 research outputs found

    ICSVR: Investigating Compositional and Semantic Understanding in Video Retrieval Models

    Full text link
    Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects \& attributes and actions are joined using correct semantics to form a proper text query. These components (objects \& attributes, actions and semantics) each play an important role to help distinguish among videos and retrieve the correct ground truth video. However, it is unclear what is the effect of these components on the video retrieval performance. We therefore, conduct a systematic study to evaluate the compositional and semantic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned on downstream video retrieval datasets (Eg. Frozen-in-Time, Violet, MCQ etc.) (ii) which adapt pre-trained image-text representations like CLIP for video retrieval (Eg. CLIP4Clip, XCLIP, CLIP2Video etc.). Our experiments reveal that actions and semantics play a minor role compared to objects \& attributes in video understanding. Moreover, video retrieval models that use pre-trained image-text representations (CLIP) have better semantic and compositional understanding as compared to models pre-trained on video-text data

    Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

    Full text link
    Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always not feasible due to the difficulty in collecting large-scale clean and aligned data, and exponential computational costs involved in the pretraining phase. Therefore, the pertinent question to ask is: Can image-text models be adapted to video tasks and is there any benefit to using these models over pretraining directly on videos? In this work, we focus on this question by proposing a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting. We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP). Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC. Furthermore, they perform moderately on video captioning and poorly on video QA. These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step

    COCO-Counterfactuals: Automatically Constructed Counterfactual Examples for Image-Text Pairs

    Full text link
    Counterfactual examples have proven to be valuable in the field of natural language processing (NLP) for both evaluating and improving the robustness of language models to spurious correlations in datasets. Despite their demonstrated utility for NLP, multimodal counterfactual examples have been relatively unexplored due to the difficulty of creating paired image-text data with minimal counterfactual changes. To address this challenge, we introduce a scalable framework for automatic generation of counterfactual examples using text-to-image diffusion models. We use our framework to create COCO-Counterfactuals, a multimodal counterfactual dataset of paired image and text captions based on the MS-COCO dataset. We validate the quality of COCO-Counterfactuals through human evaluations and show that existing multimodal models are challenged by our counterfactual image-text pairs. Additionally, we demonstrate the usefulness of COCO-Counterfactuals for improving out-of-domain generalization of multimodal vision-language models via training data augmentation.Comment: Accepted to NeurIPS 2023 Datasets and Benchmarks Trac

    Blockchain-based trust management and authentication of devices in smart grid

    Get PDF
    The digitalization of the power grid and advancement in intelligent technologies have enabled the service provider to convert the existing electrical grid into a smart grid. The transformation of the grid will help in integrating cleaner energy technologies with energy management to improve power network efficiency. Internet of things (IoT) and various network components need to be deployed to harness the full potential of the smart grid. Also, integrating intermittent renewable energy sources, energy storage, intelligent control of selected power-intensive loads, etc will improve energy efficiency. But deployment of this information and communication technologies will make the grid more vulnerable to cyber attacks from hackers. In this work, blockchain-based self-sovereign identification and authentication technique is presented to avert identity theft and masquerading. The proposed approach can minimize the chances of identity-based security breaches in the smart grid. This paper provides an overview of the model of identification and authentication of IoT devices in Smart Grid based on Blockchain technology. The Blockchain based implementation of identification and authentication of devices is proposed to validate the model in the distributed electrical energy network. The model is able to authenticate the device using Blockchain in a trusted model. The system works according to plan validating the authenticity of transaction in a node in log(n) time, which justifies presented result.publishedVersio

    Spectrum Sensing in Cognitive Radio Using CNN-RNN and Transfer Learning

    Get PDF
    Cognitive radio has been proposed to improve spectrum utilization in wireless communication. Spectrum sensing is an essential component of cognitive radio. The traditional methods of spectrum sensing are based on feature extraction of a received signal at a given point. The development in artificial intelligence and deep learning have given an opportunity to improve the accuracy of spectrum sensing by using cooperative spectrum sensing and analyzing the radio scene. This research proposed a hybrid model of convolution and recurrent neural network for spectrum sensing. The research further enhances the accuracy of sensing for low SNR signals through transfer learning. The results of modelling show improvement in spectrum sensing using CNN-RNN compared to other models studied in this field. The complexity of an algorithm is analyzed to show an improvement in the performance of the algorithm.publishedVersio

    NeuroComparatives: Neuro-Symbolic Distillation of Comparative Knowledge

    Full text link
    Comparative knowledge (e.g., steel is stronger and heavier than styrofoam) is an essential component of our world knowledge, yet understudied in prior literature. In this paper, we study the task of comparative knowledge acquisition, motivated by the dramatic improvements in the capabilities of extreme-scale language models like GPT-4, which have fueled efforts towards harvesting their knowledge into knowledge bases. While acquisition of such comparative knowledge is much easier from models like GPT-4, compared to their considerably smaller and weaker counterparts such as GPT-2, not even the most powerful models are exempt from making errors. We thus ask: to what extent are models at different scales able to generate valid and diverse comparative knowledge? We introduce NeuroComparatives, a novel framework for comparative knowledge distillation overgenerated from language models such as GPT-variants and Llama, followed by stringent filtering of the generated knowledge. Our framework acquires comparative knowledge between everyday objects, producing a corpus of up to 8.8M comparisons over 1.74M entity pairs - 10X larger and 30% more diverse than existing resources. Moreover, human evaluations show that NeuroComparatives outperform existing resources (up to 32% absolute improvement). We also demonstrate the utility of our distilled NeuroComparatives on three downstream tasks. Our results show that neuro-symbolic manipulation of smaller models offer complementary benefits to the currently dominant practice of prompting extreme-scale language models for knowledge distillation

    MuMUR : Multilingual Multimodal Universal Retrieval

    Full text link
    Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs. We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space based on pretrained multilingual models. We evaluate our proposed approach on a diverse set of retrieval datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and Multi30k . Experimental results demonstrate that our approach achieves state-of-the-art results on all video retrieval datasets outperforming previous models. Additionally, our framework MuMUR significantly beats other multilingual video retrieval dataset. We also observe that MuMUR exhibits strong performance on image retrieval. This demonstrates the universal ability of MuMUR to perform retrieval across all visual inputs (image and video) and text inputs (monolingual and multilingual).Comment: This is an extension of the previous MKTVR paper (for which you can find a reference here : https://dl.acm.org/doi/abs/10.1007/978-3-031-28244-7_42 or in a previous version on arxiv). This version was published to the Information Retrieval Journa

    ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

    Full text link
    Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream VL tasks. Although the most advanced work improves performance by building bridges between encoders, it suffers from ineffective layer-by-layer utilization of uni-modal representations and cannot flexibly exploit different levels of uni-modal semantic knowledge. In this work, we propose ManagerTower, a novel VL model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels. The managers introduced in each cross-modal layer can adaptively aggregate uni-modal semantic knowledge to facilitate more comprehensive cross-modal alignment and fusion. ManagerTower outperforms previous strong baselines both with and without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower achieves superior performances on various downstream VL tasks, especially 79.15% accuracy on VQAv2 Test-Std, 86.56% IR@1 and 95.64% TR@1 on Flickr30K. Code and checkpoints are available at https://github.com/LooperXX/ManagerTower.Comment: Accepted by ACL 2023 Main Conference, Ora

    VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

    Full text link
    Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems. However, although visualization and interpretability tools have become available for NLP models, internal mechanisms of vision and multimodal transformers remain largely opaque. With the success of these transformers, it is increasingly critical to understand their inner workings, as unraveling these black-boxes will lead to more capable and trustworthy models. To contribute to this quest, we propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers. VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety of statistics in attention heads throughout all layers for both vision and language components, (2) visualizes cross-modal and intra-modal attentions through easily readable heatmaps, and (3) plots the hidden representations of vision and language tokens as they pass through the transformer layers. In this paper, we demonstrate the functionalities of VL-InterpreT through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks. Furthermore, we also present a few interesting findings about multimodal transformer behaviors that were learned through our tool.Comment: CVPR 2022 demo trac

    Tpl2 is required for VEGF-A-stimulated signal transduction and endothelial cell function

    Get PDF
    New blood vessel sprouting (angiogenesis) and vascular physiology are fundamental features of metazoan species but we do not fully understand how signal transduction pathways regulate diverse vascular responses. The vascular endothelial growth factor (VEGF) family bind membrane-bound receptor tyrosine kinases (VEGFRs), which trigger multiple signal transduction pathways and diverse cellular responses. We evaluated whether the MAP3K family member and proto-oncoprotein Tpl2 (MAP3K8) regulates basal and VEGF-A-stimulated signal transduction in endothelial cells. Notably, stimulation with exogenous VEGF-A increased Tpl2 mRNA levels and consequently de novo protein synthesis. Depletion of Tpl2 levels reveals a role in both basal and VEGF-A-stimulated endothelial cell responses, including endothelial-leukocyte interactions, monolayer permeability and new blood vessel formation. Under basal conditions, Tpl2 modulates a signal transduction cascade resulting in phosphorylation of a nuclear transcription factor (ATF-2) and altered endothelial gene expression, a pathway previously identified as crucial in VEGF-dependent vascular responses. Loss of Tpl2 expression or activity impairs signal transduction through Akt, eNOS and ATF-2, broadly impacting on endothelial function. Our study now provides a mechanism for Tpl2 as a central component of signal transduction pathways in the endothelium
    corecore