32 research outputs found
ODN: Opening the Deep Network for Open-set Action Recognition
In recent years, the performance of action recognition has been significantly
improved with the help of deep neural networks. Most of the existing action
recognition works hold the \textit{closed-set} assumption that all action
categories are known beforehand while deep networks can be well trained for
these categories. However, action recognition in the real world is essentially
an \textit{open-set} problem, namely, it is impossible to know all action
categories beforehand and consequently infeasible to prepare sufficient
training samples for those emerging categories. In this case, applying
closed-set recognition methods will definitely lead to unseen-category errors.
To address this challenge, we propose the Open Deep Network (ODN) for the
open-set action recognition task. Technologically, ODN detects new categories
by applying a multi-class triplet thresholding method, and then dynamically
reconstructs the classification layer and "opens" the deep network by adding
predictors for new categories continually. In order to transfer the learned
knowledge to the new category, two novel methods, Emphasis Initialization and
Allometry Training, are adopted to initialize and incrementally train the new
predictor so that only few samples are needed to fine-tune the model. Extensive
experiments show that ODN can effectively detect and recognize new categories
with little human intervention, thus applicable to the open-set action
recognition tasks in the real world. Moreover, ODN can even achieve comparable
performance to some closed-set methods.Comment: 6 pages, 3 figures, ICME 201
LLaSM: Large Language and Speech Model
Multi-modal large language models have garnered significant interest
recently. Though, most of the works focus on vision-language multi-modal models
providing strong capabilities in following vision-and-language instructions.
However, we claim that speech is also an important modality through which
humans interact with the world. Hence, it is crucial for a general-purpose
assistant to be able to follow multi-modal speech-and-language instructions. In
this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an
end-to-end trained large multi-modal speech-language model with cross-modal
conversational abilities, capable of following speech-and-language
instructions. Our early experiments show that LLaSM demonstrates a more
convenient and natural way for humans to interact with artificial intelligence.
Specifically, we also release a large Speech Instruction Following dataset
LLaSM-Audio-Instructions. Code and demo are available at
https://github.com/LinkSoul-AI/LLaSM and
https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions
dataset is available at
https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions
AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
Text-to-Image (T2I) diffusion models have achieved remarkable success in
image generation. Despite their progress, challenges remain in both
prompt-following ability, image quality and lack of high-quality datasets,
which are essential for refining these models. As acquiring labeled data is
costly, we introduce AGFSync, a framework that enhances T2I diffusion models
through Direct Preference Optimization (DPO) in a fully AI-driven approach.
AGFSync utilizes Vision-Language Models (VLM) to assess image quality across
style, coherence, and aesthetics, generating feedback data within an AI-driven
loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and
SDXL, our extensive experiments on the TIFA dataset demonstrate notable
improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2
benchmark, consistently outperforming the base models. AGFSync's method of
refining T2I diffusion models paves the way for scalable alignment techniques
AutoAgents: A Framework for Automatic Agent Generation
Large language models (LLMs) have enabled remarkable advances in automated
task-solving with multi-agent systems. However, most existing LLM-based
multi-agent approaches rely on predefined agents to handle simple tasks,
limiting the adaptability of multi-agent collaboration to different scenarios.
Therefore, we introduce AutoAgents, an innovative framework that adaptively
generates and coordinates multiple specialized agents to build an AI team
according to different tasks. Specifically, AutoAgents couples the relationship
between tasks and roles by dynamically generating multiple required agents
based on task content and planning solutions for the current task based on the
generated expert agents. Multiple specialized agents collaborate with each
other to efficiently accomplish tasks. Concurrently, an observer role is
incorporated into the framework to reflect on the designated plans and agents'
responses and improve upon them. Our experiments on various benchmarks
demonstrate that AutoAgents generates more coherent and accurate solutions than
the existing multi-agent methods. This underscores the significance of
assigning different roles to different tasks and of team cooperation, offering
new perspectives for tackling complex tasks. The repository of this project is
available at https://github.com/Link-AGI/AutoAgents
Chinese Open Instruction Generalist: A Preliminary Release
Instruction tuning is widely recognized as a key technique for building
generalist language models, which has attracted the attention of researchers
and the public with the release of InstructGPT~\citep{ouyang2022training} and
ChatGPT\footnote{\url{https://chat.openai.com/}}. Despite impressive progress
in English-oriented large-scale language models (LLMs), it is still
under-explored whether English-based foundation LLMs can perform similarly on
multilingual tasks compared to English tasks with well-designed instruction
tuning and how we can construct the corpora needed for the tuning.
To remedy this gap, we propose the project as an attempt to create a Chinese
instruction dataset by various methods adapted to the intrinsic characteristics
of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples,
which have been manually checked to guarantee high quality. We also summarize
the existing English and Chinese instruction corpora and briefly describe some
potential applications of the newly constructed Chinese instruction corpora.
The resulting \textbf{C}hinese \textbf{O}pen \textbf{I}nstruction
\textbf{G}eneralist (\textbf{COIG}) corpora are available in
Huggingface\footnote{\url{https://huggingface.co/datasets/BAAI/COIG}} and
Github\footnote{\url{https://github.com/FlagOpen/FlagInstruct}}, and will be
continuously updated
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
Self-supervised learning (SSL) has recently emerged as a promising paradigm
for training generalisable models on large-scale data in the fields of vision,
text, and speech. Although SSL has been proven effective in speech and audio,
its application to music audio has yet to be thoroughly explored. This is
primarily due to the distinctive challenges associated with modelling musical
knowledge, particularly its tonal and pitched characteristics of music. To
address this research gap, we propose an acoustic Music undERstanding model
with large-scale self-supervised Training (MERT), which incorporates teacher
models to provide pseudo labels in the masked language modelling (MLM) style
acoustic pre-training. In our exploration, we identified a superior combination
of teacher models, which outperforms conventional speech and audio approaches
in terms of performance. This combination includes an acoustic teacher based on
Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical
teacher based on the Constant-Q Transform (CQT). These teachers effectively
guide our student model, a BERT-style transformer encoder, to better model
music audio. In addition, we introduce an in-batch noise mixture augmentation
to enhance the representation robustness. Furthermore, we explore a wide range
of settings to overcome the instability in acoustic language model
pre-training, which allows our designed paradigm to scale from 95M to 330M
parameters. Experimental results indicate that our model can generalise and
perform well on 14 music understanding tasks and attains state-of-the-art
(SOTA) overall scores. The code and models are online:
https://github.com/yizhilll/MERT
Centralized Space Learning for open-set computer-aided diagnosis
Abstract In computer-aided diagnosis (CAD), diagnosing untrained diseases as known categories will cause serious medical accidents, which makes it crucial to distinguish the new class (open set) meanwhile preserving the known classes (closed set) performance so as to enhance the robustness. However, how to accurately define the decision boundary between known and unknown classes is still an open problem, as unknown classes are never seen during the training process, especially in medical area. Moreover, manipulating the latent distribution of known classes further influences the unknown’s and makes it even harder. In this paper, we propose the Centralized Space Learning (CSL) method to address the open-set recognition problem in CADs by learning a centralized space to separate the known and unknown classes with the assistance of proxy images generated by a generative adversarial network (GAN). With three steps, including known space initialization, unknown anchor generation and centralized space refinement, CSL learns the optimized space distribution with unknown samples cluster around the center while the known spread away from the center, achieving a significant identification between the known and the unknown. Extensive experiments on multiple datasets and tasks illustrate the proposed CSL’s practicability in CAD and the state-of-the-art open-set recognition performance