50 research outputs found
City-Identification of Flickr Videos Using Semantic Acoustic Features
City-identification of videos aims to determine the likelihood of a video
belonging to a set of cities. In this paper, we present an approach using only
audio, thus we do not use any additional modality such as images, user-tags or
geo-tags. In this manner, we show to what extent the city-location of videos
correlates to their acoustic information. Success in this task suggests
improvements can be made to complement the other modalities. In particular, we
present a method to compute and use semantic acoustic features to perform
city-identification and the features show semantic evidence of the
identification. The semantic evidence is given by a taxonomy of urban sounds
and expresses the potential presence of these sounds in the city- soundtracks.
We used the MediaEval Placing Task set, which contains Flickr videos labeled by
city. In addition, we used the UrbanSound8K set containing audio clips labeled
by sound- type. Our method improved the state-of-the-art performance and
provides a novel semantic approach to this tas
AudioPairBank: Towards A Large-Scale Tag-Pair-Based Audio Content Analysis
Recently, sound recognition has been used to identify sounds, such as car and
river. However, sounds have nuances that may be better described by
adjective-noun pairs such as slow car, and verb-noun pairs such as flying
insects, which are under explored. Therefore, in this work we investigate the
relation between audio content and both adjective-noun pairs and verb-noun
pairs. Due to the lack of datasets with these kinds of annotations, we
collected and processed the AudioPairBank corpus consisting of a combined total
of 1,123 pairs and over 33,000 audio files. One contribution is the previously
unavailable documentation of the challenges and implications of collecting
audio recordings with these type of labels. A second contribution is to show
the degree of correlation between the audio content and the labels through
sound recognition experiments, which yielded results of 70% accuracy, hence
also providing a performance benchmark. The results and study in this paper
encourage further exploration of the nuances in audio and are meant to
complement similar research performed on images and text in multimedia
analysis.Comment: This paper is a revised version of "AudioSentibank: Large-scale
Semantic Ontology of Acoustic Concepts for Audio Content Analysis
Experiments on the DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording
In this paper we present our work on Task 1 Acoustic Scene Classi- fication
and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments
we have low-level and high-level features, classifier optimization and other
heuristics specific to each task. Our performance for both tasks improved the
baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9%
compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based
Error Rate of 0.76 compared to the baseline of 0.91
Natural Language Supervision for General-Purpose Audio Representations
Audio-Language models jointly learn multimodal text and audio representations
that enable Zero-Shot inference. Models rely on the encoders to create powerful
representations of the input and generalize to multiple tasks ranging from
sounds, music, and speech. Although models have achieved remarkable
performance, there is still a performance gap with task-specific models. In
this paper, we propose a Contrastive Language-Audio Pretraining model that is
pretrained with a diverse collection of 4.6M audio-text pairs employing two
innovative encoders for Zero-Shot inference. To learn audio representations, we
trained an audio encoder on 22 audio tasks, instead of the standard training of
sound event classification. To learn language representations, we trained an
autoregressive decoder-only model instead of the standard encoder-only models.
Then, the audio and language representations are brought into a joint
multimodal space using Contrastive Learning. We used our encoders to improve
the downstream performance by a margin. We extensively evaluated the
generalization of our representations on 26 downstream tasks, the largest in
the literature. Our model achieves state of the art results in several tasks
leading the way towards general-purpose audio representations
Prompting Audios Using Acoustic Properties For Emotion Representation
Emotions lie on a continuum, but current models treat emotions as a finite
valued discrete variable. This representation does not capture the diversity in
the expression of emotion. To better represent emotions we propose the use of
natural language descriptions (or prompts). In this work, we address the
challenge of automatically generating these prompts and training a model to
better learn emotion representations from audio and prompt pairs. We use
acoustic properties that are correlated to emotion like pitch, intensity,
speech rate, and articulation rate to automatically generate prompts i.e.
'acoustic prompts'. We use a contrastive learning objective to map speech to
their respective acoustic prompts. We evaluate our model on Emotion Audio
Retrieval and Speech Emotion Recognition. Our results show that the acoustic
prompts significantly improve the model's performance in EAR, in various
Precision@K metrics. In SER, we observe a 3.8% relative accuracy improvement on
the Ravdess dataset.Comment: arXiv admin note: substantial text overlap with arXiv:2211.0773
Training Audio Captioning Models without Audio
Automated Audio Captioning (AAC) is the task of generating natural language
descriptions given an audio stream. A typical AAC system requires manually
curated training data of audio segments and corresponding text caption
annotations. The creation of these audio-caption pairs is costly, resulting in
general data scarcity for the task. In this work, we address this major
limitation and propose an approach to train AAC systems using only text. Our
approach leverages the multimodal space of contrastively trained audio-text
models, such as CLAP. During training, a decoder generates captions conditioned
on the pretrained CLAP text encoder. During inference, the text encoder is
replaced with the pretrained CLAP audio encoder. To bridge the modality gap
between text and audio embeddings, we propose the use of noise injection or a
learnable adapter, during training. We find that the proposed text-only
framework performs competitively with state-of-the-art models trained with
paired audio, showing that efficient text-to-audio transfer is possible.
Finally, we showcase both stylized audio captioning and caption enrichment
while training without audio or human-created text captions