433 research outputs found
Deep factorization for speech signal
Various informative factors mixed in speech signals, leading to great
difficulty when decoding any of the factors. An intuitive idea is to factorize
each speech frame into individual informative factors, though it turns out to
be highly difficult. Recently, we found that speaker traits, which were assumed
to be long-term distributional properties, are actually short-time patterns,
and can be learned by a carefully designed deep neural network (DNN). This
discovery motivated a cascade deep factorization (CDF) framework that will be
presented in this paper. The proposed framework infers speech factors in a
sequential way, where factors previously inferred are used as conditional
variables when inferring other factors. We will show that this approach can
effectively factorize speech signals, and using these factors, the original
speech spectrum can be recovered with a high accuracy. This factorization and
reconstruction approach provides potential values for many speech processing
tasks, e.g., speaker recognition and emotion recognition, as will be
demonstrated in the paper.Comment: Accepted by ICASSP 2018. arXiv admin note: substantial text overlap
with arXiv:1706.0177
Survey of deep representation learning for speech emotion recognition
Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual eort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated \textit{deep representation learning} where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER
An open-source voice type classifier for child-centered daylong recordings
Spontaneous conversations in real-world settings such as those found in
child-centered recordings have been shown to be amongst the most challenging
audio files to process. Nevertheless, building speech processing models
handling such a wide variety of conditions would be particularly useful for
language acquisition studies in which researchers are interested in the
quantity and quality of the speech that children hear and produce, as well as
for early diagnosis and measuring effects of remediation. In this paper, we
present our approach to designing an open-source neural network to classify
audio segments into vocalizations produced by the child wearing the recording
device, vocalizations produced by other children, adult male speech, and adult
female speech. To this end, we gathered diverse child-centered corpora which
sums up to a total of 260 hours of recordings and covers 10 languages. Our
model can be used as input for downstream tasks such as estimating the number
of words produced by adult speakers, or the number of linguistic units produced
by children. Our architecture combines SincNet filters with a stack of
recurrent layers and outperforms by a large margin the state-of-the-art system,
the Language ENvironment Analysis (LENA) that has been used in numerous child
language studies.Comment: accepted to Interspeech 202
Deep Learning-Based Speech Emotion Recognition Using Librosa
Speech Emotion Recognition is a challenge of computational paralinguistic and speech processing that tries to identify and classify the emotions expressed in spoken language. The objective is to infer from a speaker's speech patterns, such as prosody, pitch, and rhythm, their emotional state, such as happiness, rage, sadness, or frustration. In the modern world, one of the most crucial marketing tactics is emotion detection. For a person, you might tailor several things in order to best fit their interests. Due to this, we made the decision to work on a project where we could identify a person's emotions based just on their speech, allowing us to handle a variety of AI-related applications. Examples include the ability of call centers to play music during tense exchanges. Another example might be a smart automobile that slows down when someone is scared or furious. In Python, we processed and extracted features from the audio files using the Librosa module. A Python library for audio and music analysis is called Librosa. It offers the fundamental components required to develop systems for retrieving music-related information. Because of this, there is a lot of potential for this kind of application in the market that would help businesses and ensure customer safety
- …