11 research outputs found

    Investigating Stranded GMM for Improving Automatic Speech Recognition

    Get PDF
    International audienceThis paper investigates recently proposed Stranded Gaussian Mixture acoustic Model (SGMM) for Automatic Speech Recognition (ASR). This model extends conventional hidden Markov model (HMM-GMM) by explicitly introducing dependencies between components of the observation Gaussian mixture densities. The main objective of the paper is to experimentally study, how useful SGMM can be for dealing with data, which contains different sources of acoustic variability. First studied sources of variability are age and gender in quiet environment (TIdigits task including child speech). Second, the SGMM modeling is applied on data produced by different speakers and corrupted by non-stationary noise (CHiME 2013 challenge data). Finally, SGMM is applied on the same noisy data, but after performing speech enhancement (i.e., the remaining variability mostly comes from residual noise and different speakers). Although SGMM was originally proposed for robust speech recognition of noisy data, in this work it was found, that the model is more efficient for handling speaker variability in quiet environment

    Voice Activity Detection Using Deep Neural Network

    Get PDF
    13301甲第4842号博士(工学)金沢大学博士論文要旨Abstract 以下に掲載:Eurasip Journal on Audio, Speech and Music Processing 2018(1) pp.1-15 2018. Springer International Publishing. 共著者:Suci Dwijayanti, Kei Yamamori, Masato Miyosh

    Intelligenсe architectonics methodology: international language training of students

    Get PDF
    The intelligence architectonics methodology refers to personification principles model focused on creating the conditions for widening boundaries of language potential of a student in terms of intercultural communication based on emotional balance and comfort as well as intellectual capacity and ability development via varied activities in all stages of the human speech communication process within the “chain” of educational programs – at school, university, and at In-service institutions for specialists based on recurrent education. Given the intricacy of the dynamic communication process and its fundamental importance in human intercultural communication, this survey is intended to provide a comprehensive model of speech dynamics priority for addressing the following issues – how to consider the cultural paradigm that reads “from observation to generalization and replication through cooperation”. Special emphasis is on the social significance of language and cultural mission of educational organizations in the aspect of modeling the corpus of life-based integrative communicative situations in the security education and information space. In this regard the main principles of language education via teaching intercultural communication are in need of thorough investigation of the analysis-synthesis activities within the dynamic intellect and culture development

    Speech analysis using very low-dimensional bottleneck features and phone-class dependent neural networks

    Get PDF
    The first part of this thesis focuses on very low-dimensional bottleneck features (BNFs), extracted from deep neural networks (DNNs) for speech analysis and recognition. Very low-dimensional BNFs are analysed in terms of their capability of representing speech and their suitability for modelling speech dynamics. Nine-dimensional BNFs obtained from a phone discrimination DNN are shown to give comparable phone recognition accuracy to 39-dimensional MFCCs, and an average of 34% higher phone recognition accuracy than formant-based features of the same dimensions. They also preserve the trajectory continuity well and thus hold promise for modelling speech dynamics. Visualisations and interpretations of the BNFs are presented, with phonetically motivated studies of the strategies that DNNs employ to create these features. The relationships between BNF representations resulting from different initialisations of DNNs are explored. The second part of this thesis considers BNFs from the perspective of feature extraction. It is motivated by the observation that different types of speech sounds lend themselves to different acoustic analysis, and that the mapping from spectra-in-context to phone posterior probabilities implemented by the DNN is a continuous approximation to a discontinuous function. This suggests that it may be advantageous to replace the single DNN with a set of phone class dependent DNNs. In this case, the appropriate mathematical structure is a manifold. It is shown that this approach leads to significant improvements in frame level phone classification accuracy

    An analysis-by-synthesis approach to vocal tract modeling for robust speech recognition

    Full text link
    In this thesis we present a novel approach to speech recognition that incorporates knowledge of the speech production process. The major contribution is the development of a speech recognition system that is motivated by the physical generative process of speech, rather than the purely statistical approach that has been the basis for virtually all current recognizers. We follow an analysis-by-synthesis approach. We begin by attributing a physical meaning to the inner states of the recognition system pertaining to the configurations the human vocal tract takes over time. We utilize a geometric model of the vocal tract, adapt it to our speakers, and derive realistic vocal tract shapes from electromagnetic articulograph (EMA) measurements in the MOCHA database. We then synthesize speech from the vocal tract configurations using a physiologically-motivated articulatory synthesis model of speech generation. Finally, the observation probability of the Hidden Markov Model (HMM) used for phone classification is a function of the distortion between the speech synthesized from the vocal tract configurations and the real speech. The output of each state in the HMM is based on a mixture of density functions

    Dynamical models for neonatal intensive care monitoring

    Get PDF
    The vital signs monitoring data of an infant receiving intensive care are a rich source of information about its health condition. One major concern about the state of health of such patients is the onset of neonatal sepsis, a life-threatening bloodstream infection. As early signs are subtle and current diagnosis procedures involve slow laboratory testing, sepsis detection based on the monitored physiological dynamics is a clinically significant task. This challenging problem can be thoroughly modelled as real-time inference within a machine learning framework. In this thesis, we develop probabilistic dynamical models centred around the goal of providing useful predictions about the onset of neonatal sepsis. This research is characterised by the careful incorporation of domain knowledge for the purpose of extracting the infant’s true physiology from the monitoring data. We make two main contributions. The first one is the formulation of sepsis detection as learning and inference in an Auto-Regressive Hidden Markov Model (AR-HMM). The model investigates the extent to which physiological events observed in the patient’s monitoring traces could be used for the early detection of neonatal sepsis. In addition, the proposed approach involves exact marginalisation over missing data at inference time. When applying the ARHMM on a real-world dataset, we found that it can produce effective predictions about the onset of sepsis. Second, both sepsis and clinical event detection are formulated as learning and inference in a Hierarchical Switching Linear Dynamical System (HSLDS). The HSLDS models dynamical systems where complex interactions between modes of operation can be represented as a twolevel hidden discrete hierarchical structure. For neonatal condition monitoring, the lower layer models clinical events and is controlled by upper layer variables with semantics sepsis/nonsepsis. The model parameterisation and estimation procedures are adapted to the specifics of physiological monitoring data. We demonstrate that the performance of the HSLDS for the detection of sepsis is not statistically different from the AR-HMM, despite the fact that the latter model is given “ground truth” annotations of the patient’s physiology

    The application of continuous state HMMs to an automatic speech recognition task

    Get PDF
    Hidden Markov Models (HMMs) have been a popular choice for automatic speech recognition (ASR) for several decades due to their mathematical formulation and computational efficiency, which has consistently resulted in a better performance compared to other methods during this period. However, HMMs are based on the assumption of statistical independence among speech frames, which conflicts with the physiological basis of speech production. Consequently, researchers have produced a substantial amount of literature to extend the HMM model assumptions and incorporate dynamic properties of speech into the underlying model. One such approach involves segmental models, which addresses a frame-wise independence assumption. However, the computational inefficiencies associated with segmental models have limited their practical application. In recent years, there has been a shift from HMM-based systems to neural networks (NN) and deep learning approaches, which offer superior performance com- pared to conventional statistical models. However, as the complexity of neural models increases, so does the number of parameters involved, requiring a greater dependency on training data to optimise model parameters. This present study extends prior research on segmental HMMs by introducing a Segmental Continuous-State Hidden Markov Model (CSHMM) examining a resolution to the issue of inter-segmental continuity. This is an alternative approach when compared to contemporary speech modelling methods that rely on data-centric NN techniques, with the goal of establishing a statistical model that more accurately reflects the speech production process. The Continuous-State Segmental model offers a flexible mathematical framework which can impose a continuity constraint between adjoining segments addressing a fundamental drawback of conventional HMMs, namely, the independence assumption. Additionally, the CSHMM also benefits from a practical training and decoding algorithm which overcomes the computational inefficiency inherent in conventional decoding algorithms for traditional Segmental HMMs. This study has formulated four trajectory-based segmental models using a CSHMM framework. CSHMMs have not been extensively studied for ASR tasks due to the absence of open-source standardised speech tool-kits that enable convenient exploration of CSHMMs. As a result, to perform sufficient experiments in this study, training and decoding software has been developed, which can be accessed in (Seivwright, 2015). The experiments in this study report baseline phone recognition results for the four distinct Segmental CSHMM systems using the TIMIT database. These baseline results are compared against a simple Hidden Markov Model-Gaussian Mixture Model (HMM- GMM) system. In all experiments, a compact acoustic feature representation in the form of bottleneck features (BNF), is employed, motivated by an investigation into the BNFs and their relationship to articulatory properties. Although the proposed CSHMM systems do not surpass discrete-state HMMs in performance, this research has demonstrated a strong association between inter-segmental continuity and the corresponding phonetic categories being modelled. Furthermore, this thesis presents a method for achieving finer control over continuity between segments, which can be expanded to investigate co-articulation in the context of CSHMMs
    corecore