12 research outputs found
Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities
Voice conversion (VC) using sequence-to-sequence learning of context
posterior probabilities is proposed. Conventional VC using shared context
posterior probabilities predicts target speech parameters from the context
posterior probabilities estimated from the source speech parameters. Although
conventional VC can be built from non-parallel data, it is difficult to convert
speaker individuality such as phonetic property and speaking rate contained in
the posterior probabilities because the source posterior probabilities are
directly used for predicting target speech parameters. In this work, we assume
that the training data partly include parallel speech data and propose
sequence-to-sequence learning between the source and target posterior
probabilities. The conversion models perform non-linear and variable-length
transformation from the source probability sequence to the target one. Further,
we propose a joint training algorithm for the modules. In contrast to
conventional VC, which separately trains the speech recognition that estimates
posterior probabilities and the speech synthesis that predicts target speech
parameters, our proposed method jointly trains these modules along with the
proposed probability conversion modules. Experimental results demonstrate that
our approach outperforms the conventional VC.Comment: Accepted to INTERSPEECH 201
Are words easier to learn from infant- than adult-directed speech? A quantitative corpus-based investigation
We investigate whether infant-directed speech (IDS) could facilitate word
form learning when compared to adult-directed speech (ADS). To study this, we
examine the distribution of word forms at two levels, acoustic and
phonological, using a large database of spontaneous speech in Japanese. At the
acoustic level we show that, as has been documented before for phonemes, the
realizations of words are more variable and less discriminable in IDS than in
ADS. At the phonological level, we find an effect in the opposite direction:
the IDS lexicon contains more distinctive words (such as onomatopoeias) than
the ADS counterpart. Combining the acoustic and phonological metrics together
in a global discriminability score reveals that the bigger separation of
lexical categories in the phonological space does not compensate for the
opposite effect observed at the acoustic level. As a result, IDS word forms are
still globally less discriminable than ADS word forms, even though the effect
is numerically small. We discuss the implication of these findings for the view
that the functional role of IDS is to improve language learnability.Comment: Draf
Arabic Speaker-Independent Continuous Automatic Speech Recognition Based on a Phonetically Rich and Balanced Speech Corpus
This paper describes and proposes an efficient and effective framework for the design and development of a
speaker-independent continuous automatic Arabic speech recognition system based on a phonetically rich and balanced
speech corpus. The speech corpus contains a total of 415 sentences recorded by 40 (20 male and 20 female) Arabic native
speakers from 11 different Arab countries representing the three major regions (Levant, Gulf, and Africa) in the Arab world.
The proposed Arabic speech recognition system is based on the Carnegie Mellon University (CMU) Sphinx tools, and the
Cambridge HTK tools were also used at some testing stages. The speech engine uses 3-emitting state Hidden Markov Models
(HMM) for tri-phone based acoustic models. Based on experimental analysis of about 7 hours of training speech data, the
acoustic model is best using continuous observation’s probability model of 16 Gaussian mixture distributions and the state
distributions were tied to 500 senones. The language model contains both bi-grams and tri-grams. For similar speakers but
different sentences, the system obtained a word recognition accuracy of 92.67% and 93.88% and a Word Error Rate (WER) of
11.27% and 10.07% with and without diacritical marks respectively. For different speakers with similar sentences, the system
obtained a word recognition accuracy of 95.92% and 96.29% and a WER of 5.78% and 5.45% with and without diacritical
marks respectively. Whereas different speakers and different sentences, the system obtained a word recognition accuracy of
89.08% and 90.23% and a WER of 15.59% and 14.44% with and without diacritical marks respectively
Automatic Prosodic Segmentation by F0 Clustering Using Superpositional Modeling.
In this paper, we propose an automatic method for detecting
accent phrase boundaries in Japanese continuous speech by
using F0 information. In the training phase, hand labeled
accent patterns are parameterized according to a superpositional
model proposed by Fujisaki, and assigned to some
clusters by a clustering method, in which accent templates
are calculated as centroid of each cluster. In the segmentation
phase, automatic N-best extraction of boundaries is
performed by One-Stage DP matching between the reference
templates and the target F0 contour. About 90% of
accent phrase boundaries were correctly detected in speaker
independent experiments with the ATR Japanese continuous
speech database
Modifed Minimum Classification Error Learning and Its Application to Neural Networks
A novel method to improve the generalization performance of the Minimum Classification Error (MCE) / Generalized Probabilistic Descent (GPD) learning is proposed. The MCE/GPD learning proposed by Juang and Katagiri in 1992 results in better recognition performance than the maximum-likelihood (ML) based learning in various areas of pattern recognition. Despite its superiority in recognition performance,
as well as other learning algorithms, it still suffers from the problem of "over-fitting" to the training samples. In the present study, a regularization technique has been employed to the MCE learning to overcome this problem. Feed-forward neural networks are employed as a recognition platform to evaluate the recognition performance of the proposed method. Recognition experiments are conducted on several sorts of data sets
An Evaluation of Target Speech for a Nonaudible Murmur Enhancement System in Noisy Environments
Abstract-Nonaudible murmur (NAM) is a soft whispered voice recorded with NAM microphone through body conduction. NAM allows for silent speech communication as it makes it possible for the speaker to convey their message in a nonaudible voice. However, its intelligibility and naturalness are significantly degraded compared to those of natural speech owing to acoustic changes caused by body conduction. To address this issue, statistical voice conversion (VC) methods from NAM to normal speech (NAM-to-Speech) and to a whispered voice (NAM-toWhisper) have been proposed. It has been reported that these NAM enhancement methods significantly improve speech quality and intelligibility of NAM, and NAM-to-Whisper is more effective than NAM-to-Speech. However, it is still not obvious which method is more effective if a listener listens to the enhanced speech in noisy environments, a situation that often happens in silent speech communication. In this paper, assuming a typical situation in which NAM is uttered by a speaker in a quiet environment and conveyed to a listener in noisy environments, we investigate what kinds of target speech are more effective for NAM enhancement. We also propose NAM enhancement methods for converting NAM to other types of target voiced speech. Experiments show that the conversion process into voiced speech is more effective than that into unvoiced speech for generating more intelligible speech in noisy environments
Analysis of Speaker Adaptation Algorithms for HMM-based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm
In this paper we analyze the effects of several factors and configuration choices encountered during training and model construction when we want to obtain better and more stable adaptation in HMM-based speech synthesis. We then propose a new adaptation algorithm called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms. Here we investigate six major aspects of the speaker adaptation: initial models transform functions, estimation criteria, and sensitivity of several linear regression adaptation algorithms algorithms. Analyzing the effect of the initial model, we compare speaker-dependent models, gender-independent models, and the simultaneous use of the gender-dependent models to single use of the gender-dependent models. Analyzing the effect of the transform functions, we compare the transform function for only mean vectors with that for mean vectors and covariance matrices. Analyzing the effect of the estimation criteria, we compare the ML criterion with a robust estimation criterion called structural MAP. We evaluate the sensitivity of several thresholds for the piecewise linear regression algorithms and take up methods combining MAP adaptation with the linear regression algorithms. We incorporate these adaptation algorithms into our speech synthesis system and present several subjective and objective evaluation results showing the utility and effectiveness of these algorithms in speaker adaptation for HMM-based speech synthesis
Mechanisms of vowel devoicing in Japanese
The processes of vowel devoicing in Standard Japanese were examined with respect
to the phonetic and phonological environments and the syllable structure of Japanese, in
comparison with vowel reduction processes in other languages, in most of which vowel
reduction occurs optionally in fast or casual speech. This thesis examined whether
Japanese vowel devoicing was a phonetic phenomenon caused by glottal assimilation
between a high vowel and its adjacent voiceless consonants, or it was a more
phonologically controlled compulsory process.
Experimental results showed that Japanese high vowel devoicing must be analysed
separately in two devoicing conditions, namely single and consecutive devoicing
environments. Devoicing was almost compulsory regardless of the presence of
proposed blocking factors such as type of preceding consonant, accentuation, position
in an utterance, as long as there was no devoiceable vowel in adjacent morae (single
devoicing condition). However, under consecutive devoicing conditions, blocking
factors became effective and prevented some devoiceable vowels from becoming
voiceless.
The effect of speaking rate was also generally minimal in the single devoicing
condition, but in the consecutive devoicing condition, the vowels were devoiced more
at faster tempi than slower tempi, which created many examples of consecutively
devoiced vowels over two morae.
Durational observations found that vowel devoicing involves not only phonatory
change, but also slight durational reduction. However, the shorter duration of devoiced
syllables were adjusted at the word level, so that the whole duration of a word with
devoiced vowels remained similar to the word without devoiced vowels, regardless of
the number of devoiced vowels in the word.
It must be noted that there was no clear-cut distinction between voiced and
devoiced vowels, and the phonetic realisation of a devoiced vowel could vary from
fully voiced to completely voiceless. A high vowel may be voiced in a typical
devoicing environment, but its intensity is significantly weaker than those of vowels in
a non-devoicing environment, at all speaking tempi. The mean differences of vowel
intensities between these environments were generally higher at faster tempi.
The results implied that even when the vowel was voiced, its production process
moved in favour of devoicing. However, in consecutive devoicing conditions, this
process did not always apply. When some of the devoiceable vowels were devoiced in
the consecutive devoicing environment, the intensities of devoiceable vowels were not
significantly lower than those of other vowels.
The results of intensity measurements of voiced vowels in the devoicing and nondevoicing
environments suggested that Japanese vowel devoicing was part of the
overall process of complex vowel weakening, and that a completely devoiced vowel
was the final state of the weakening process. Japanese vowel devoicing is primarily a
process of glottal assimilation, but the results in the consecutive devoicing condition
showed that this process was constrained by Japanese syllable structure
Recommended from our members
Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem
of identifying a speaker from its voice regardless of the content (i.e.
text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system.
A novel approach towards speaker identification is developed using
wavelet analysis, and multiple neural networks including Probabilistic
Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND
voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state-
of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA).
Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear.
Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that
the proposed scheme is one of the best candidates for the fusion of
face and voice due to its low computational time and high recognition accuracy