Search CORE

32 research outputs found

ODN: Opening the Deep Network for Open-set Action Recognition

Author: Shi Yemin
Shu Yu
Tian Yonghong
Wang Yaowei
Yuan Qingsheng
Zou Yixiong
Publication venue
Publication date: 23/01/2019
Field of study

In recent years, the performance of action recognition has been significantly improved with the help of deep neural networks. Most of the existing action recognition works hold the \textit{closed-set} assumption that all action categories are known beforehand while deep networks can be well trained for these categories. However, action recognition in the real world is essentially an \textit{open-set} problem, namely, it is impossible to know all action categories beforehand and consequently infeasible to prepare sufficient training samples for those emerging categories. In this case, applying closed-set recognition methods will definitely lead to unseen-category errors. To address this challenge, we propose the Open Deep Network (ODN) for the open-set action recognition task. Technologically, ODN detects new categories by applying a multi-class triplet thresholding method, and then dynamically reconstructs the classification layer and "opens" the deep network by adding predictors for new categories continually. In order to transfer the learned knowledge to the new category, two novel methods, Emphasis Initialization and Allometry Training, are adopted to initialize and incrementally train the new predictor so that only few samples are needed to fine-tune the model. Extensive experiments show that ODN can effectively detect and recognize new categories with little human intervention, thus applicable to the open-set action recognition tasks in the real world. Moreover, ODN can even achieve comparable performance to some closed-set methods.Comment: 6 pages, 3 figures, ICME 201

arXiv.org e-Print Archive

Crossref

LLaSM: Large Language and Speech Model

Author: Chen Guangyao
Dong Siwei
Huang Wenhao
Shi Daochen
Shi Yemin
Shu Yu
Xiang Qiqi
Zhang Ruihua
Publication venue
Publication date: 16/09/2023
Field of study

Multi-modal large language models have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset LLaSM-Audio-Instructions. Code and demo are available at https://github.com/LinkSoul-AI/LLaSM and https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions dataset is available at https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions

arXiv.org e-Print Archive

AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

Author: An Jingkun
Chen Bohua
Feng Haoran
Li Zongjian
Pan Chengwei
Shi Yemin
Zhu Yinghao
Publication venue
Publication date: 03/04/2024
Field of study

Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques

arXiv.org e-Print Archive

AutoAgents: A Framework for Automatic Agent Generation

Author: Chen Guangyao
Dong Siwei
Fu Jie
Karlsson Börje F.
Sesay Jaward
Shi Yemin
Shu Yu
Zhang Ge
Publication venue
Publication date: 15/10/2023
Field of study

Large language models (LLMs) have enabled remarkable advances in automated task-solving with multi-agent systems. However, most existing LLM-based multi-agent approaches rely on predefined agents to handle simple tasks, limiting the adaptability of multi-agent collaboration to different scenarios. Therefore, we introduce AutoAgents, an innovative framework that adaptively generates and coordinates multiple specialized agents to build an AI team according to different tasks. Specifically, AutoAgents couples the relationship between tasks and roles by dynamically generating multiple required agents based on task content and planning solutions for the current task based on the generated expert agents. Multiple specialized agents collaborate with each other to efficiently accomplish tasks. Concurrently, an observer role is incorporated into the framework to reflect on the designated plans and agents' responses and improve upon them. Our experiments on various benchmarks demonstrate that AutoAgents generates more coherent and accurate solutions than the existing multi-agent methods. This underscores the significance of assigning different roles to different tasks and of team cooperation, offering new perspectives for tackling complex tasks. The repository of this project is available at https://github.com/Link-AGI/AutoAgents

arXiv.org e-Print Archive

Chinese Open Instruction Generalist: A Preliminary Release

Author: Dong Siwei
Fu Jie
Huang Wenhao
Li Yizhi
Li Zhaoqun
Lin Chenghua
Liu Ruibo
Shi Yemin
Shu Yu
Wang Zekun
Yuan Ruibin
Zhang Ge
Publication venue
Publication date: 18/04/2023
Field of study

Instruction tuning is widely recognized as a key technique for building generalist language models, which has attracted the attention of researchers and the public with the release of InstructGPT~\citep{ouyang2022training} and ChatGPT\footnote{\url{https://chat.openai.com/}}. Despite impressive progress in English-oriented large-scale language models (LLMs), it is still under-explored whether English-based foundation LLMs can perform similarly on multilingual tasks compared to English tasks with well-designed instruction tuning and how we can construct the corpora needed for the tuning. To remedy this gap, we propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We also summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora. The resulting \textbf{C}hinese \textbf{O}pen \textbf{I}nstruction \textbf{G}eneralist (\textbf{COIG}) corpora are available in Huggingface\footnote{\url{https://huggingface.co/datasets/BAAI/COIG}} and Github\footnote{\url{https://github.com/FlagOpen/FlagInstruct}}, and will be continuously updated

arXiv.org e-Print Archive

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Author: Benetos Emmanouil
Chen Wenhu
Chen Xingran
Dannenberg Roger
Fu Jie
Guo Yike
Gyenge Norbert
Huang Wenhao
Li Yizhi
Lin Chenghua
Liu Ruibo
Ma Yinghao
Ragni Anton
Shi Yemin
Xia Gus
Yin Hanzhi
Yuan Ruibin
Zhang Ge
Publication venue
Publication date: 31/05/2023
Field of study

Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is primarily due to the distinctive challenges associated with modelling musical knowledge, particularly its tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified a superior combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). These teachers effectively guide our student model, a BERT-style transformer encoder, to better model music audio. In addition, we introduce an in-batch noise mixture augmentation to enhance the representation robustness. Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attains state-of-the-art (SOTA) overall scores. The code and models are online: https://github.com/yizhilll/MERT

arXiv.org e-Print Archive

Centralized Space Learning for open-set computer-aided diagnosis

Author: Yemin Shi
Zhongzhi Yu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2023
Field of study

Abstract In computer-aided diagnosis (CAD), diagnosing untrained diseases as known categories will cause serious medical accidents, which makes it crucial to distinguish the new class (open set) meanwhile preserving the known classes (closed set) performance so as to enhance the robustness. However, how to accurately define the decision boundary between known and unknown classes is still an open problem, as unknown classes are never seen during the training process, especially in medical area. Moreover, manipulating the latent distribution of known classes further influences the unknown’s and makes it even harder. In this paper, we propose the Centralized Space Learning (CSL) method to address the open-set recognition problem in CADs by learning a centralized space to separate the known and unknown classes with the assistance of proxy images generated by a generative adversarial network (GAN). With three steps, including known space initialization, unknown anchor generation and centralized space refinement, CSL learns the optimized space distribution with unknown samples cluster around the center while the known spread away from the center, achieving a significant identification between the known and the unknown. Extensive experiments on multiple datasets and tasks illustrate the proposed CSL’s practicability in CAD and the state-of-the-art open-set recognition performance

Directory of Open Access Journals