335 research outputs found
LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad
Launchpad is a musical instrument that allows users to create and perform
music by pressing illuminated buttons. To assist and inspire the design of the
Launchpad light effect, and provide a more accessible approach for beginners to
create music visualization with this instrument, we proposed the LaunchpadGPT
model to generate music visualization designs on Launchpad automatically. Based
on the language model with excellent generation ability, our proposed
LaunchpadGPT takes an audio piece of music as input and outputs the lighting
effects of Launchpad-playing in the form of a video (Launchpad-playing video).
We collect Launchpad-playing videos and process them to obtain music and
corresponding video frame of Launchpad-playing as prompt-completion pairs, to
train the language model. The experiment result shows the proposed method can
create better music visualization than random generation methods and hold the
potential for a broader range of music visualization applications. Our code is
available at https://github.com/yunlong10/LaunchpadGPT/.Comment: Accepted by International Computer Music Conference (ICMC) 202
A Two-Stage Framework with Self-Supervised Distillation For Cross-Domain Text Classification
Cross-domain text classification aims to adapt models to a target domain that
lacks labeled data. It leverages or reuses rich labeled data from the different
but related source domain(s) and unlabeled data from the target domain. To this
end, previous work focuses on either extracting domain-invariant features or
task-agnostic features, ignoring domain-aware features that may be present in
the target domain and could be useful for the downstream task. In this paper,
we propose a two-stage framework for cross-domain text classification. In the
first stage, we finetune the model with mask language modeling (MLM) and
labeled data from the source domain. In the second stage, we further fine-tune
the model with self-supervised distillation (SSD) and unlabeled data from the
target domain. We evaluate its performance on a public cross-domain text
classification benchmark and the experiment results show that our method
achieves new state-of-the-art results for both single-source domain adaptations
(94.17% 1.03%) and multi-source domain adaptations (95.09%
1.34%)
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning
Our winning entry for the CVPR 2023 Generic Event Boundary Captioning (GEBC)
competition is detailed in this paper. Unlike conventional video captioning
tasks, GEBC demands that the captioning model possess an understanding of
immediate changes in status around the designated video boundary, making it a
difficult task. This paper proposes an effective model LLMVA-GEBC (Large
Language Model with Video Adapter for Generic Event Boundary Captioning): (1)
We utilize a pretrained LLM for generating human-like captions with high
quality. (2) To adapt the model to the GEBC task, we take the video Q-former as
an adapter and train it with the frozen visual feature extractors and LLM. Our
proposed method achieved a 76.14 score on the test set and won the first place
in the challenge. Our code is available at
https://github.com/zjr2000/LLMVA-GEBC .Comment: Winner solution to Generic Event Boundary Captioning task in LOVEU
Challenge (CVPR 2023 workshop
- …