79 research outputs found
Towards Robust Referring Image Segmentation
Referring Image Segmentation (RIS) aims to connect image and language via
outputting the corresponding object masks given a text description, which is a
fundamental vision-language task. Despite lots of works that have achieved
considerable progress for RIS, in this work, we explore an essential question,
"what if the description is wrong or misleading of the text description?". We
term such a sentence as a negative sentence. However, we find that existing
works cannot handle such settings. To this end, we propose a novel formulation
of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the
negative sentence inputs besides the regularly given text inputs. We present
three different datasets via augmenting the input negative sentences and a new
metric to unify both input types. Furthermore, we design a new
transformer-based model named RefSegformer, where we introduce a token-based
vision and language fusion module. Such module can be easily extended to our
R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves
the new state-of-the-art results on three regular RIS datasets and three R-RIS
datasets, which serves as a new solid baseline for further research. The
project page is at \url{https://lxtgh.github.io/project/robust_ref_seg/}.Comment: technical repor
PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter
The Retrieval Question Answering (ReQA) task employs the retrieval-augmented
framework, composed of a retriever and generator. The generator formulates the
answer based on the documents retrieved by the retriever. Incorporating Large
Language Models (LLMs) as generators is beneficial due to their advanced QA
capabilities, but they are typically too large to be fine-tuned with budget
constraints while some of them are only accessible via APIs. To tackle this
issue and further improve ReQA performance, we propose a trainable Pluggable
Reward-Driven Contextual Adapter (PRCA), keeping the generator as a black box.
Positioned between the retriever and generator in a Pluggable manner, PRCA
refines the retrieved information by operating in a token-autoregressive
strategy via maximizing rewards of the reinforcement learning phase. Our
experiments validate PRCA's effectiveness in enhancing ReQA performance on
three datasets by up to 20% improvement to fit black-box LLMs into existing
frameworks, demonstrating its considerable potential in the LLMs era.Comment: Accepted by the Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing. (EMNLP2023
Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval
Cross-modal retrieval (CMR) has been extensively applied in various domains,
such as multimedia search engines and recommendation systems. Most existing CMR
methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a
less explored domain, has posed a great challenge due to the difficulty to
uncover discriminative features from audio clips and texts. Existing studies
are restricted in the following two ways: 1) Most researchers utilize
contrastive learning to construct a common subspace where similarities among
data can be measured. However, they considers only cross-modal transformation,
neglecting the intra-modal separability. Besides, the temperature parameter is
not adaptively adjusted along with semantic guidance, which degrades the
performance. 2) These methods do not take latent representation reconstruction
into account, which is essential for semantic alignment. This paper introduces
a novel audio-text oriented CMR approach, termed Contrastive Latent Space
Reconstruction Learning (CLSR). CLSR improves contrastive representation
learning by taking intra-modal separability into account and adopting an
adaptive temperature control strategy. Moreover, the latent representation
reconstruction modules are embedded into the CMR framework, which improves
modal interaction. Experiments in comparison with some state-of-the-art methods
on two audio-text datasets have validated the superiority of CLSR.Comment: Accepted by The 35th IEEE International Conference on Tools with
Artificial Intelligence. (ICTAI 2023
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation
In this work, we focus on open vocabulary instance segmentation to expand a
segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex
pipelines to establish one-to-one mappings between image regions and words in
captions. However, such methods build noisy supervision by matching non-visible
words to image regions, such as adjectives and verbs. Meanwhile, context words
are also important for inferring the existence of novel objects as they show
high inter-correlations with novel categories. To overcome these limitations,
we devise a joint \textbf{Caption Grounding and Generation (CGG)} framework,
which incorporates a novel grounding loss that only focuses on matching object
nouns to improve learning efficiency. We also introduce a caption generation
head that enables additional supervision and contextual modeling as a
complementation to the grounding loss. Our analysis and results demonstrate
that grounding and generation components complement each other, significantly
enhancing the segmentation performance for novel classes. Experiments on the
COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS)
and Open Set Panoptic Segmentation (OSPS) demonstrate the superiority of the
CGG. Specifically, CGG achieves a substantial improvement of 6.8% mAP for novel
classes without extra data on the OVIS task and 15% PQ improvements for novel
classes on the OSPS benchmark.Comment: ICCV-202
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
In the realm of Large Language Models, the balance between instruction data
quality and quantity has become a focal point. Recognizing this, we introduce a
self-guided methodology for LLMs to autonomously discern and select cherry
samples from vast open-source datasets, effectively minimizing manual curation
and potential cost for instruction tuning an LLM. Our key innovation, the
Instruction-Following Difficulty (IFD) metric, emerges as a pivotal tool to
identify discrepancies between a model's expected responses and its autonomous
generation prowess. Through the adept application of IFD, cherry samples are
pinpointed, leading to a marked uptick in model training efficiency. Empirical
validations on renowned datasets like Alpaca and WizardLM underpin our
findings; with a mere 10% of conventional data input, our strategy showcases
improved results. This synthesis of self-guided cherry-picking and the IFD
metric signifies a transformative leap in the optimization of LLMs, promising
both efficiency and resource-conscious advancements. Codes, data, and models
are available: https://github.com/MingLiiii/Cherry_LL
Shoggoth: Towards Efficient Edge-Cloud Collaborative Real-Time Video Inference via Adaptive Online Learning
This paper proposes Shoggoth, an efficient edge-cloud collaborative
architecture, for boosting inference performance on real-time video of changing
scenes. Shoggoth uses online knowledge distillation to improve the accuracy of
models suffering from data drift and offloads the labeling process to the
cloud, alleviating constrained resources of edge devices. At the edge, we
design adaptive training using small batches to adapt models under limited
computing power, and adaptive sampling of training frames for robustness and
reducing bandwidth. The evaluations on the realistic dataset show 15%-20% model
accuracy improvement compared to the edge-only strategy and fewer network costs
than the cloud-only strategy.Comment: Accepted by 60th ACM/IEEE Design Automation Conference (DAC2023
Blur the Linguistic Boundary: Interpreting Chinese Buddhist Sutra in English via Neural Machine Translation
Buddhism is an influential religion with a long-standing history and profound
philosophy. Nowadays, more and more people worldwide aspire to learn the
essence of Buddhism, attaching importance to Buddhism dissemination. However,
Buddhist scriptures written in classical Chinese are obscure to most people and
machine translation applications. For instance, general Chinese-English neural
machine translation (NMT) fails in this domain. In this paper, we proposed a
novel approach to building a practical NMT model for Buddhist scriptures. The
performance of our translation pipeline acquired highly promising results in
ablation experiments under three criteria.Comment: This paper is accepted by ICTAI 2022. The 34th IEEE International
Conference on Tools with Artificial Intelligence (ICTAI
EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices
Real-time video analytics on edge devices for changing scenes remains a
difficult task. As edge devices are usually resource-constrained, edge deep
neural networks (DNNs) have fewer weights and shallower architectures than
general DNNs. As a result, they only perform well in limited scenarios and are
sensitive to data drift. In this paper, we introduce EdgeMA, a practical and
efficient video analytics system designed to adapt models to shifts in
real-world video streams over time, addressing the data drift problem. EdgeMA
extracts the gray level co-occurrence matrix based statistical texture feature
and uses the Random Forest classifier to detect the domain shift. Moreover, we
have incorporated a method of model adaptation based on importance weighting,
specifically designed to update models to cope with the label distribution
shift. Through rigorous evaluation of EdgeMA on a real-world dataset, our
results illustrate that EdgeMA significantly improves inference accuracy.Comment: Accepted by 30th International Conference on Neural Information
Processing (ICONIP 2023
Towards Open Vocabulary Learning: A Survey
In the field of visual scene understanding, deep neural networks have made
impressive advancements in various core tasks like segmentation, tracking, and
detection. However, most approaches operate on the close-set assumption,
meaning that the model can only identify pre-defined categories that are
present in the training set. Recently, open vocabulary settings were proposed
due to the rapid progress of vision language pre-training. These new approaches
seek to locate and recognize categories beyond the annotated label space. The
open vocabulary approach is more general, practical, and effective compared to
weakly supervised and zero-shot settings. This paper provides a thorough review
of open vocabulary learning, summarizing and analyzing recent developments in
the field. In particular, we begin by comparing it to related concepts such as
zero-shot learning, open-set recognition, and out-of-distribution detection.
Then, we review several closely related tasks in the case of segmentation and
detection, including long-tail problems, few-shot, and zero-shot settings. For
the method survey, we first present the basic knowledge of detection and
segmentation in close-set as the preliminary knowledge. Next, we examine
various scenarios in which open vocabulary learning is used, identifying common
design elements and core ideas. Then, we compare the recent detection and
segmentation approaches in commonly used datasets and benchmarks. Finally, we
conclude with insights, issues, and discussions regarding future research
directions. To our knowledge, this is the first comprehensive literature review
of open vocabulary learning. We keep tracing related works at
https://github.com/jianzongwu/Awesome-Open-Vocabulary.Comment: Project page at https://github.com/jianzongwu/Awesome-Open-Vocabular
- …