20 research outputs found
ICSVR: Investigating Compositional and Semantic Understanding in Video Retrieval Models
Video retrieval (VR) involves retrieving the ground truth video from the
video database given a text caption or vice-versa. The two important components
of compositionality: objects \& attributes and actions are joined using correct
semantics to form a proper text query. These components (objects \& attributes,
actions and semantics) each play an important role to help distinguish among
videos and retrieve the correct ground truth video. However, it is unclear what
is the effect of these components on the video retrieval performance. We
therefore, conduct a systematic study to evaluate the compositional and
semantic understanding of video retrieval models on standard benchmarks such as
MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video
retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned
on downstream video retrieval datasets (Eg. Frozen-in-Time, Violet, MCQ etc.)
(ii) which adapt pre-trained image-text representations like CLIP for video
retrieval (Eg. CLIP4Clip, XCLIP, CLIP2Video etc.). Our experiments reveal that
actions and semantics play a minor role compared to objects \& attributes in
video understanding. Moreover, video retrieval models that use pre-trained
image-text representations (CLIP) have better semantic and compositional
understanding as compared to models pre-trained on video-text data
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Foundational multimodal models pre-trained on large scale image-text pairs or
video-text pairs or both have shown strong generalization abilities on
downstream tasks. However unlike image-text models, pretraining video-text
models is always not feasible due to the difficulty in collecting large-scale
clean and aligned data, and exponential computational costs involved in the
pretraining phase. Therefore, the pertinent question to ask is: Can image-text
models be adapted to video tasks and is there any benefit to using these models
over pretraining directly on videos? In this work, we focus on this question by
proposing a detailed study on the generalization abilities of image-text models
when evaluated on video understanding tasks in a zero-shot setting. We
investigate 9 foundational image-text models on a diverse set of video tasks
that include video action recognition (video AR), video retrieval (video RT),
video question answering (video QA), video multiple choice (video MC) and video
captioning (video CP). Our experiments show that image-text models exhibit
impressive performance on video AR, video RT and video MC. Furthermore, they
perform moderately on video captioning and poorly on video QA. These findings
shed a light on the benefits of adapting foundational image-text models to an
array of video tasks while avoiding the costly pretraining step
COCO-Counterfactuals: Automatically Constructed Counterfactual Examples for Image-Text Pairs
Counterfactual examples have proven to be valuable in the field of natural
language processing (NLP) for both evaluating and improving the robustness of
language models to spurious correlations in datasets. Despite their
demonstrated utility for NLP, multimodal counterfactual examples have been
relatively unexplored due to the difficulty of creating paired image-text data
with minimal counterfactual changes. To address this challenge, we introduce a
scalable framework for automatic generation of counterfactual examples using
text-to-image diffusion models. We use our framework to create
COCO-Counterfactuals, a multimodal counterfactual dataset of paired image and
text captions based on the MS-COCO dataset. We validate the quality of
COCO-Counterfactuals through human evaluations and show that existing
multimodal models are challenged by our counterfactual image-text pairs.
Additionally, we demonstrate the usefulness of COCO-Counterfactuals for
improving out-of-domain generalization of multimodal vision-language models via
training data augmentation.Comment: Accepted to NeurIPS 2023 Datasets and Benchmarks Trac
Blockchain-based trust management and authentication of devices in smart grid
The digitalization of the power grid and advancement in intelligent technologies have enabled the service provider to convert the existing electrical grid into a smart grid. The transformation of the grid will help in integrating cleaner energy technologies with energy management to improve power network efficiency. Internet of things (IoT) and various network components need to be deployed to harness the full potential of the smart grid. Also, integrating intermittent renewable energy sources, energy storage, intelligent control of selected power-intensive loads, etc will improve energy efficiency. But deployment of this information and communication technologies will make the grid more vulnerable to cyber attacks from hackers. In this work, blockchain-based self-sovereign identification and authentication technique is presented to avert identity theft and masquerading. The proposed approach can minimize the chances of identity-based security breaches in the smart grid. This paper provides an overview of the model of identification and authentication of IoT devices in Smart Grid based on Blockchain technology. The Blockchain based implementation of identification and authentication of devices is proposed to validate the model in the distributed electrical energy network. The model is able to authenticate the device using Blockchain in a trusted model. The system works according to plan validating the authenticity of transaction in a node in log(n) time, which justifies presented result.publishedVersio
Spectrum Sensing in Cognitive Radio Using CNN-RNN and Transfer Learning
Cognitive radio has been proposed to improve spectrum utilization in wireless communication. Spectrum sensing is an essential component of cognitive radio. The traditional methods of spectrum sensing are based on feature extraction of a received signal at a given point. The development in artificial intelligence and deep learning have given an opportunity to improve the accuracy of spectrum sensing by using cooperative spectrum sensing and analyzing the radio scene. This research proposed a hybrid model of convolution and recurrent neural network for spectrum sensing. The research further enhances the accuracy of sensing for low SNR signals through transfer learning. The results of modelling show improvement in spectrum sensing using CNN-RNN compared to other models studied in this field. The complexity of an algorithm is analyzed to show an improvement in the performance of the algorithm.publishedVersio
NeuroComparatives: Neuro-Symbolic Distillation of Comparative Knowledge
Comparative knowledge (e.g., steel is stronger and heavier than styrofoam) is
an essential component of our world knowledge, yet understudied in prior
literature. In this paper, we study the task of comparative knowledge
acquisition, motivated by the dramatic improvements in the capabilities of
extreme-scale language models like GPT-4, which have fueled efforts towards
harvesting their knowledge into knowledge bases. While acquisition of such
comparative knowledge is much easier from models like GPT-4, compared to their
considerably smaller and weaker counterparts such as GPT-2, not even the most
powerful models are exempt from making errors. We thus ask: to what extent are
models at different scales able to generate valid and diverse comparative
knowledge?
We introduce NeuroComparatives, a novel framework for comparative knowledge
distillation overgenerated from language models such as GPT-variants and Llama,
followed by stringent filtering of the generated knowledge. Our framework
acquires comparative knowledge between everyday objects, producing a corpus of
up to 8.8M comparisons over 1.74M entity pairs - 10X larger and 30% more
diverse than existing resources. Moreover, human evaluations show that
NeuroComparatives outperform existing resources (up to 32% absolute
improvement). We also demonstrate the utility of our distilled
NeuroComparatives on three downstream tasks. Our results show that
neuro-symbolic manipulation of smaller models offer complementary benefits to
the currently dominant practice of prompting extreme-scale language models for
knowledge distillation
MuMUR : Multilingual Multimodal Universal Retrieval
Multi-modal retrieval has seen tremendous progress with the development of
vision-language models. However, further improving these models require
additional labelled data which is a huge manual effort. In this paper, we
propose a framework MuMUR, that utilizes knowledge transfer from a multilingual
model to boost the performance of multi-modal (image and video) retrieval. We
first use state-of-the-art machine translation models to construct pseudo
ground-truth multilingual visual-text pairs. We then use this data to learn a
joint vision-text representation where English and non-English text queries are
represented in a common embedding space based on pretrained multilingual
models. We evaluate our proposed approach on a diverse set of retrieval
datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades
and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and
Multi30k . Experimental results demonstrate that our approach achieves
state-of-the-art results on all video retrieval datasets outperforming previous
models. Additionally, our framework MuMUR significantly beats other
multilingual video retrieval dataset. We also observe that MuMUR exhibits
strong performance on image retrieval. This demonstrates the universal ability
of MuMUR to perform retrieval across all visual inputs (image and video) and
text inputs (monolingual and multilingual).Comment: This is an extension of the previous MKTVR paper (for which you can
find a reference here :
https://dl.acm.org/doi/abs/10.1007/978-3-031-28244-7_42 or in a previous
version on arxiv). This version was published to the Information Retrieval
Journa
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Two-Tower Vision-Language (VL) models have shown promising improvements on
various downstream VL tasks. Although the most advanced work improves
performance by building bridges between encoders, it suffers from ineffective
layer-by-layer utilization of uni-modal representations and cannot flexibly
exploit different levels of uni-modal semantic knowledge. In this work, we
propose ManagerTower, a novel VL model architecture that gathers and combines
the insights of pre-trained uni-modal experts at different levels. The managers
introduced in each cross-modal layer can adaptively aggregate uni-modal
semantic knowledge to facilitate more comprehensive cross-modal alignment and
fusion. ManagerTower outperforms previous strong baselines both with and
without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower
achieves superior performances on various downstream VL tasks, especially
79.15% accuracy on VQAv2 Test-Std, 86.56% IR@1 and 95.64% TR@1 on Flickr30K.
Code and checkpoints are available at https://github.com/LooperXX/ManagerTower.Comment: Accepted by ACL 2023 Main Conference, Ora
VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
Breakthroughs in transformer-based models have revolutionized not only the
NLP field, but also vision and multimodal systems. However, although
visualization and interpretability tools have become available for NLP models,
internal mechanisms of vision and multimodal transformers remain largely
opaque. With the success of these transformers, it is increasingly critical to
understand their inner workings, as unraveling these black-boxes will lead to
more capable and trustworthy models. To contribute to this quest, we propose
VL-InterpreT, which provides novel interactive visualizations for interpreting
the attentions and hidden representations in multimodal transformers.
VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety
of statistics in attention heads throughout all layers for both vision and
language components, (2) visualizes cross-modal and intra-modal attentions
through easily readable heatmaps, and (3) plots the hidden representations of
vision and language tokens as they pass through the transformer layers. In this
paper, we demonstrate the functionalities of VL-InterpreT through the analysis
of KD-VLP, an end-to-end pretraining vision-language multimodal
transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and
WebQA, two visual question answering benchmarks. Furthermore, we also present a
few interesting findings about multimodal transformer behaviors that were
learned through our tool.Comment: CVPR 2022 demo trac
Tpl2 is required for VEGF-A-stimulated signal transduction and endothelial cell function
New blood vessel sprouting (angiogenesis) and vascular physiology are fundamental features of metazoan species but we do not fully understand how signal transduction pathways regulate diverse vascular responses. The vascular endothelial growth factor (VEGF) family bind membrane-bound receptor tyrosine kinases (VEGFRs), which trigger multiple signal transduction pathways and diverse cellular responses. We evaluated whether the MAP3K family member and proto-oncoprotein Tpl2 (MAP3K8) regulates basal and VEGF-A-stimulated signal transduction in endothelial cells. Notably, stimulation with exogenous VEGF-A increased Tpl2 mRNA levels and consequently de novo protein synthesis. Depletion of Tpl2 levels reveals a role in both basal and VEGF-A-stimulated endothelial cell responses, including endothelial-leukocyte interactions, monolayer permeability and new blood vessel formation. Under basal conditions, Tpl2 modulates a signal transduction cascade resulting in phosphorylation of a nuclear transcription factor (ATF-2) and altered endothelial gene expression, a pathway previously identified as crucial in VEGF-dependent vascular responses. Loss of Tpl2 expression or activity impairs signal transduction through Akt, eNOS and ATF-2, broadly impacting on endothelial function. Our study now provides a mechanism for Tpl2 as a central component of signal transduction pathways in the endothelium