3,960 research outputs found
Learning Audio Sequence Representations for Acoustic Event Classification
Acoustic Event Classification (AEC) has become a significant task for
machines to perceive the surrounding auditory scene. However, extracting
effective representations that capture the underlying characteristics of the
acoustic events is still challenging. Previous methods mainly focused on
designing the audio features in a 'hand-crafted' manner. Interestingly,
data-learnt features have been recently reported to show better performance. Up
to now, these were only considered on the frame-level. In this paper, we
propose an unsupervised learning framework to learn a vector representation of
an audio sequence for AEC. This framework consists of a Recurrent Neural
Network (RNN) encoder and a RNN decoder, which respectively transforms the
variable-length audio sequence into a fixed-length vector and reconstructs the
input sequence on the generated vector. After training the encoder-decoder, we
feed the audio sequences to the encoder and then take the learnt vectors as the
audio sequence representations. Compared with previous methods, the proposed
method can not only deal with the problem of arbitrary-lengths of audio
streams, but also learn the salient information of the sequence. Extensive
evaluation on a large-size acoustic event database is performed, and the
empirical results demonstrate that the learnt audio sequence representation
yields a significant performance improvement by a large margin compared with
other state-of-the-art hand-crafted sequence features for AEC
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Pop Music Highlighter: Marking the Emotion Keypoints
The goal of music highlight extraction is to get a short consecutive segment
of a piece of music that provides an effective representation of the whole
piece. In a previous work, we introduced an attention-based convolutional
recurrent neural network that uses music emotion classification as a surrogate
task for music highlight extraction, for Pop songs. The rationale behind that
approach is that the highlight of a song is usually the most emotional part.
This paper extends our previous work in the following two aspects. First,
methodology-wise we experiment with a new architecture that does not need any
recurrent layers, making the training process faster. Moreover, we compare a
late-fusion variant and an early-fusion variant to study which one better
exploits the attention mechanism. Second, we conduct and report an extensive
set of experiments comparing the proposed attention-based methods against a
heuristic energy-based method, a structural repetition-based method, and a few
other simple feature-based methods for this task. Due to the lack of
public-domain labeled data for highlight extraction, following our previous
work we use the RWC POP 100-song data set to evaluate how the detected
highlights overlap with any chorus sections of the songs. The experiments
demonstrate the effectiveness of our methods over competing methods. For
reproducibility, we open source the code and pre-trained model at
https://github.com/remyhuang/pop-music-highlighter/.Comment: Transactions of the ISMIR vol. 1, no.
Easing Embedding Learning by Comprehensive Transcription of Heterogeneous Information Networks
Heterogeneous information networks (HINs) are ubiquitous in real-world
applications. In the meantime, network embedding has emerged as a convenient
tool to mine and learn from networked data. As a result, it is of interest to
develop HIN embedding methods. However, the heterogeneity in HINs introduces
not only rich information but also potentially incompatible semantics, which
poses special challenges to embedding learning in HINs. With the intention to
preserve the rich yet potentially incompatible information in HIN embedding, we
propose to study the problem of comprehensive transcription of heterogeneous
information networks. The comprehensive transcription of HINs also provides an
easy-to-use approach to unleash the power of HINs, since it requires no
additional supervision, expertise, or feature engineering. To cope with the
challenges in the comprehensive transcription of HINs, we propose the HEER
algorithm, which embeds HINs via edge representations that are further coupled
with properly-learned heterogeneous metrics. To corroborate the efficacy of
HEER, we conducted experiments on two large-scale real-words datasets with an
edge reconstruction task and multiple case studies. Experiment results
demonstrate the effectiveness of the proposed HEER model and the utility of
edge representations and heterogeneous metrics. The code and data are available
at https://github.com/GentleZhu/HEER.Comment: 10 pages. In Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, London, United Kingdom,
ACM, 201
An Analysis of Rhythmic Staccato-Vocalization Based on Frequency Demodulation for Laughter Detection in Conversational Meetings
Human laugh is able to convey various kinds of meanings in human
communications. There exists various kinds of human laugh signal, for example:
vocalized laugh and non vocalized laugh. Following the theories of psychology,
among all the vocalized laugh type, rhythmic staccato-vocalization
significantly evokes the positive responses in the interactions. In this paper
we attempt to exploit this observation to detect human laugh occurrences, i.e.,
the laughter, in multiparty conversations from the AMI meeting corpus. First,
we separate the high energy frames from speech, leaving out the low energy
frames through power spectral density estimation. We borrow the algorithm of
rhythm detection from the area of music analysis to use that on the high energy
frames. Finally, we detect rhythmic laugh frames, analyzing the candidate
rhythmic frames using statistics. This novel approach for detection of
`positive' rhythmic human laughter performs better than the standard laughter
classification baseline.Comment: 5 pages, 1 figure, conference pape
- …