Search CORE

79 research outputs found

Towards Robust Referring Image Segmentation

Author: Ding Henghui
Li Xia
Li Xiangtai
Tao Dacheng
Tong Yunhai
Wu Jianzong
Publication venue
Publication date: 20/09/2022
Field of study

Referring Image Segmentation (RIS) aims to connect image and language via outputting the corresponding object masks given a text description, which is a fundamental vision-language task. Despite lots of works that have achieved considerable progress for RIS, in this work, we explore an essential question, "what if the description is wrong or misleading of the text description?". We term such a sentence as a negative sentence. However, we find that existing works cannot handle such settings. To this end, we propose a novel formulation of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the negative sentence inputs besides the regularly given text inputs. We present three different datasets via augmenting the input negative sentences and a new metric to unify both input types. Furthermore, we design a new transformer-based model named RefSegformer, where we introduce a token-based vision and language fusion module. Such module can be easily extended to our R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves the new state-of-the-art results on three regular RIS datasets and three R-RIS datasets, which serves as a new solid baseline for further research. The project page is at \url{https://lxtgh.github.io/project/robust_ref_seg/}.Comment: technical repor

arXiv.org e-Print Archive

PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter

Author: Cheng Ning
Li Ming
Li Zhitao
Wang Jianzong
Xiao Jing
Yang Haoyan
Zhang Yong
Publication venue
Publication date: 22/10/2023
Field of study

The Retrieval Question Answering (ReQA) task employs the retrieval-augmented framework, composed of a retriever and generator. The generator formulates the answer based on the documents retrieved by the retriever. Incorporating Large Language Models (LLMs) as generators is beneficial due to their advanced QA capabilities, but they are typically too large to be fine-tuned with budget constraints while some of them are only accessible via APIs. To tackle this issue and further improve ReQA performance, we propose a trainable Pluggable Reward-Driven Contextual Adapter (PRCA), keeping the generator as a black box. Positioned between the retriever and generator in a Pluggable manner, PRCA refines the retrieved information by operating in a token-autoregressive strategy via maximizing rewards of the reinforcement learning phase. Our experiments validate PRCA's effectiveness in enhancing ReQA performance on three datasets by up to 20% improvement to fit black-box LLMs into existing frameworks, demonstrating its considerable potential in the LLMs era.Comment: Accepted by the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. (EMNLP2023

arXiv.org e-Print Archive

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Author: Cheng Ning
Li Huaxiong
Luo Kaiyi
Wang Jianzong
Xiao Jing
Zhang Xulong
Publication venue
Publication date: 15/09/2023
Field of study

Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR.Comment: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023

arXiv.org e-Print Archive

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Author: Cheng Guangliang
Ding Henghui
Li Xia
Li Xiangtai
Loy Chen Change
Tong Yunhai
Wu Jianzong
Publication venue
Publication date: 23/07/2023
Field of study

In this work, we focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories. Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and words in captions. However, such methods build noisy supervision by matching non-visible words to image regions, such as adjectives and verbs. Meanwhile, context words are also important for inferring the existence of novel objects as they show high inter-correlations with novel categories. To overcome these limitations, we devise a joint \textbf{Caption Grounding and Generation (CGG)} framework, which incorporates a novel grounding loss that only focuses on matching object nouns to improve learning efficiency. We also introduce a caption generation head that enables additional supervision and contextual modeling as a complementation to the grounding loss. Our analysis and results demonstrate that grounding and generation components complement each other, significantly enhancing the segmentation performance for novel classes. Experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS) demonstrate the superiority of the CGG. Specifically, CGG achieves a substantial improvement of 6.8% mAP for novel classes without extra data on the OVIS task and 15% PQ improvements for novel classes on the OSPS benchmark.Comment: ICCV-202

arXiv.org e-Print Archive

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Author: Chen Jiuhai
Chen Lichang
Cheng Ning
Li Ming
Li Zhitao
Wang Jianzong
Xiao Jing
Zhang Yong
Zhou Tianyi
Publication venue
Publication date: 15/09/2023
Field of study

In the realm of Large Language Models, the balance between instruction data quality and quantity has become a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from vast open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal tool to identify discrepancies between a model's expected responses and its autonomous generation prowess. Through the adept application of IFD, cherry samples are pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on renowned datasets like Alpaca and WizardLM underpin our findings; with a mere 10% of conventional data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the optimization of LLMs, promising both efficiency and resource-conscious advancements. Codes, data, and models are available: https://github.com/MingLiiii/Cherry_LL

arXiv.org e-Print Archive

Shoggoth: Towards Efficient Edge-Cloud Collaborative Real-Time Video Inference via Adaptive Online Learning

Author: Li Guokuan
Lu Kai
Qu Xiaoyang
Wan Jiguang
Wang Jianzong
Wang Liang
Xiao Jing
Zhang Nan
Publication venue
Publication date: 27/06/2023
Field of study

This paper proposes Shoggoth, an efficient edge-cloud collaborative architecture, for boosting inference performance on real-time video of changing scenes. Shoggoth uses online knowledge distillation to improve the accuracy of models suffering from data drift and offloads the labeling process to the cloud, alleviating constrained resources of edge devices. At the edge, we design adaptive training using small batches to adapt models under limited computing power, and adaptive sampling of training frames for robustness and reducing bandwidth. The evaluations on the realistic dataset show 15%-20% model accuracy improvement compared to the edge-only strategy and fewer network costs than the cloud-only strategy.Comment: Accepted by 60th ACM/IEEE Design Automation Conference (DAC2023

arXiv.org e-Print Archive

Blur the Linguistic Boundary: Interpreting Chinese Buddhist Sutra in English via Neural Machine Translation

Author: Cheng Ning
Huang Zhangcheng
Kong Lingwei
Li Denghao
Qu Xiaoyang
Wang Jianzong
Xiao Jing
Zeng Yuqiao
Publication venue
Publication date: 29/09/2022
Field of study

Buddhism is an influential religion with a long-standing history and profound philosophy. Nowadays, more and more people worldwide aspire to learn the essence of Buddhism, attaching importance to Buddhism dissemination. However, Buddhist scriptures written in classical Chinese are obscure to most people and machine translation applications. For instance, general Chinese-English neural machine translation (NMT) fails in this domain. In this paper, we proposed a novel approach to building a practical NMT model for Buddhist scriptures. The performance of our translation pipeline acquired highly promising results in ablation experiments under three criteria.Comment: This paper is accepted by ICTAI 2022. The 34th IEEE International Conference on Tools with Artificial Intelligence (ICTAI

arXiv.org e-Print Archive

EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices

Author: Hu Kaiyu
Jiang Guilin
Li Guokuan
Qu Xiaoyang
Wan Jiguang
Wang Jianzong
Wang Liang
Xiao Jing
Zhang Nan
Publication venue
Publication date: 16/08/2023
Field of study

Real-time video analytics on edge devices for changing scenes remains a difficult task. As edge devices are usually resource-constrained, edge deep neural networks (DNNs) have fewer weights and shallower architectures than general DNNs. As a result, they only perform well in limited scenarios and are sensitive to data drift. In this paper, we introduce EdgeMA, a practical and efficient video analytics system designed to adapt models to shifts in real-world video streams over time, addressing the data drift problem. EdgeMA extracts the gray level co-occurrence matrix based statistical texture feature and uses the Random Forest classifier to detect the domain shift. Moreover, we have incorporated a method of model adaptation based on importance weighting, specifically designed to update models to cope with the label distribution shift. Through rigorous evaluation of EdgeMA on a real-world dataset, our results illustrate that EdgeMA significantly improves inference accuracy.Comment: Accepted by 30th International Conference on Neural Information Processing (ICONIP 2023

arXiv.org e-Print Archive

Towards Open Vocabulary Learning: A Survey

Author: Ding Henghui
Ghanem Bernard
Jiang Xudong
Li Xia
Li Xiangtai
Tao Dacheng
Tong Yunhai
Wu Jianzong
Yang Yibo
Yuan Shilin Xu. Haobo
Zhang Jiangning
Publication venue
Publication date: 27/06/2023
Field of study

In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However, most approaches operate on the close-set assumption, meaning that the model can only identify pre-defined categories that are present in the training set. Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training. These new approaches seek to locate and recognize categories beyond the annotated label space. The open vocabulary approach is more general, practical, and effective compared to weakly supervised and zero-shot settings. This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field. In particular, we begin by comparing it to related concepts such as zero-shot learning, open-set recognition, and out-of-distribution detection. Then, we review several closely related tasks in the case of segmentation and detection, including long-tail problems, few-shot, and zero-shot settings. For the method survey, we first present the basic knowledge of detection and segmentation in close-set as the preliminary knowledge. Next, we examine various scenarios in which open vocabulary learning is used, identifying common design elements and core ideas. Then, we compare the recent detection and segmentation approaches in commonly used datasets and benchmarks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To our knowledge, this is the first comprehensive literature review of open vocabulary learning. We keep tracing related works at https://github.com/jianzongwu/Awesome-Open-Vocabulary.Comment: Project page at https://github.com/jianzongwu/Awesome-Open-Vocabular

arXiv.org e-Print Archive