4,192 research outputs found
CDSD: Chinese Dysarthria Speech Database
We present the Chinese Dysarthria Speech Database (CDSD) as a valuable
resource for dysarthria research. This database comprises speech data from 24
participants with dysarthria. Among these participants, one recorded an
additional 10 hours of speech data, while each recorded one hour, resulting in
34 hours of speech material. To accommodate participants with varying cognitive
levels, our text pool primarily consists of content from the AISHELL-1 dataset
and speeches by primary and secondary school students. When participants read
these texts, they must use a mobile device or the ZOOM F8n multi-track field
recorder to record their speeches. In this paper, we elucidate the data
collection and annotation processes and present an approach for establishing a
baseline for dysarthric speech recognition. Furthermore, we conducted a
speaker-dependent dysarthric speech recognition experiment using an additional
10 hours of speech data from one of our participants. Our research findings
indicate that, through extensive data-driven model training, fine-tuning
limited quantities of specific individual data yields commendable results in
speaker-dependent dysarthric speech recognition. However, we observe
significant variations in recognition results among different dysarthric
speakers. These insights provide valuable reference points for
speaker-dependent dysarthric speech recognition.Comment: 9 pages, 3 figure
Wavelet-based voice morphing
This paper presents a new multi-scale voice morphing algorithm. This algorithm enables a user to transform one person's speech pattern into another person's pattern with distinct characteristics, giving it a new identity, while preserving the original content. The voice morphing algorithm performs the morphing at different subbands by using the theory of wavelets and models the spectral conversion using the theory of Radial Basis Function Neural Networks. The results obtained on the TIMIT speech database demonstrate effective transformation of the speaker identity
Design of a phonetic corpus for speech recognition in catalan
In this paper, we present the design of a corpus for speech recognition to be used for the recording of a speech database in Catalan. A previous database in Spanish was the reference in setting the specifications about the characteristics of the sentences and in the minimum number of units required. An analysis of unit frequencies were carried out in order to know which units were relevant for training and to compare the results with the figures from the designed corpus. Three different sub-corpora were generated, one for training, ...Peer ReviewedPostprint (published version
Emovo Corpus: an Italian Emotional Speech Database
This article describes the first emotional corpus, named EMOVO, applicable to Italian language,. It is a database built from the voices of up to 6 actors who played 14 sentences simulating 6 emotional states (disgust, fear, anger, joy, surprise, sadness) plus the neutral state. These emotions are the well-known Big Six found in most of the literature related to emotional speech. The recordings were made with professional equipment in the Fondazione Ugo Bordoni laboratories. The paper also describes a subjective validation test of the corpus, based on emotion-discrimination of two sentences carried out by two different groups of 24 listeners. The test was successful because it yielded an overall recognition accuracy of 80%. It is observed that emotions less easy to recognize are joy and disgust, whereas the most easy to detect are anger, sadness and the neutral state
Gender classification in two emotional speech database
Gender classification is a challenging problem, which finds applications in speaker indexing, speaker recognition, speaker diarization, annotation and retrieval of multimedia databases, voice synthesis, smart human-computer interaction, biometrics, social robots etc. Although it has been studied for more than thirty years, by no means it is a solved problem. Processing emotional speech in order to identify speakers gender makes the problem even more interesting. A large pool of 1379 features is created including 605 novel features. A branch and bound feature selection algorithm is applied to select a subset of 15 features among the 1379 originally extracted. Support vector machines with various kernels are tested as gender classifiers, when applied to two databases, namely: the Berlin database of Emotional Speech and the Danish Emotional Speech database. The reported classification results outperformthose obtained by state-of-the-art techniques, since a perfect classification accuracy is obtained. © 2008 IEEE
- …