8 research outputs found

    Volterra Series for Analyzing MLP based Phoneme Posterior Probability Estimator

    Get PDF
    We present a framework to apply Volterra series to analyze multilayered perceptrons trained to estimate the posterior probabilities of phonemes in automatic speech recognition. The identified Volterra kernels reveal the spectro-temporal patterns that are learned by the trained system for each phoneme. To demonstrate the applicability of Volterra series, we analyze a multilayered perceptron trained using Mel filter bank energy features and analyze its first order Volterra kernels

    Template-based ASR using Posterior features and synthetic references: comparing different TTS systems

    Get PDF
    In recent works, the use of phone class-conditional posterior probabilities (posterior features) directly as features provided successful results in template-based ASR systems. In this paper, motivated by the high quality of current text-to-speech systems and the robustness of posterior features toward undesired variability, we investigate the use of synthetic speech to generate reference templates. The use of synthetic speech in template-based ASR not only allows to address the issue of in-domain data collection but also expansion of vocabulary. On 75- and 600-word task-independent and speaker-independent setup of Phonebook corpus, we show the feasibility of this approach by investigating different synthetic voices produced by HTS-based synthesizer trained on two different databases. Our study shows that synthetic speech templates can yield performance comparable to the natural speech templates, especially with synthetic voices that have high intelligibility

    Computational Models of Representation and Plasticity in the Central Auditory System

    Get PDF
    The performance for automated speech processing tasks like speech recognition and speech activity detection rapidly degrades in challenging acoustic conditions. It is therefore necessary to engineer systems that extract meaningful information from sound while exhibiting invariance to background noise, different speakers, and other disruptive channel conditions. In this thesis, we take a biomimetic approach to these problems, and explore computational strategies used by the central auditory system that underlie neural information extraction from sound. In the first part of this thesis, we explore coding strategies employed by the central auditory system that yield neural responses that exhibit desirable noise robustness. We specifically demonstrate that a coding strategy based on sustained neural firings yields richly structured spectro-temporal receptive fields (STRFs) that reflect the structure and diversity of natural sounds. The emergent receptive fields are comparable to known physiological neuronal properties and can be employed as a signal processing strategy to improve noise invariance in a speech recognition task. Next, we extend the model of sound encoding based on spectro-temporal receptive fields to incorporate the cognitive effects of selective attention. We propose a framework for modeling attention-driven plasticity that induces changes to receptive fields driven by task demands. We define a discriminative cost function whose optimization and solution reflect a biologically plausible strategy for STRF adaptation that helps listeners better attend to target sounds. Importantly, the adaptation patterns predicted by the framework have a close correspondence with known neurophysiological data. We next generalize the framework to act on the spectro-temporal dynamics of task-relevant stimuli, and make predictions for tasks that have yet to be experimentally measured. We argue that our generalization represents a form of object-based attention, which helps shed light on the current debate about auditory attentional mechanisms. Finally, we show how attention-modulated STRFs form a high-fidelity representation of the attended target, and we apply our results to obtain improvements in a speech activity detection task. Overall, the results of this thesis improve our general understanding of central auditory processing, and our computational frameworks can be used to guide further studies in animal models. Furthermore, our models inspire signal processing strategies that are useful for automated speech and sound processing tasks

    "Can you hear me now?":Automatic assessment of background noise intrusiveness and speech intelligibility in telecommunications

    Get PDF
    This thesis deals with signal-based methods that predict how listeners perceive speech quality in telecommunications. Such tools, called objective quality measures, are of great interest in the telecommunications industry to evaluate how new or deployed systems affect the end-user quality of experience. Two widely used measures, ITU-T Recommendations P.862 âPESQâ and P.863 âPOLQAâ, predict the overall listening quality of a speech signal as it would be rated by an average listener, but do not provide further insight into the composition of that score. This is in contrast to modern telecommunication systems, in which components such as noise reduction or speech coding process speech and non-speech signal parts differently. Therefore, there has been a growing interest for objective measures that assess different quality features of speech signals, allowing for a more nuanced analysis of how these components affect quality. In this context, the present thesis addresses the objective assessment of two quality features: background noise intrusiveness and speech intelligibility. The perception of background noise is investigated with newly collected datasets, including signals that go beyond the traditional telephone bandwidth, as well as Lombard (effortful) speech. We analyze listener scores for noise intrusiveness, and their relation to scores for perceived speech distortion and overall quality. We then propose a novel objective measure of noise intrusiveness that uses a sparse representation of noise as a model of high-level auditory coding. The proposed approach is shown to yield results that highly correlate with listener scores, without requiring training data. With respect to speech intelligibility, we focus on the case where the signal is degraded by strong background noises or very low bit-rate coding. Considering that listeners use prior linguistic knowledge in assessing intelligibility, we propose an objective measure that works at the phoneme level and performs a comparison of phoneme class-conditional probability estimations. The proposed approach is evaluated on a large corpus of recordings from public safety communication systems that use low bit-rate coding, and further extended to the assessment of synthetic speech, showing its applicability to a large range of distortion types. The effectiveness of both measures is evaluated with standardized performance metrics, using corpora that follow established recommendations for subjective listening tests

    Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics

    Full text link
    Three recent breakthroughs due to AI in arts and science serve as motivation: An award winning digital image, protein folding, fast matrix multiplication. Many recent developments in artificial neural networks, particularly deep learning (DL), applied and relevant to computational mechanics (solid, fluids, finite-element technology) are reviewed in detail. Both hybrid and pure machine learning (ML) methods are discussed. Hybrid methods combine traditional PDE discretizations with ML methods either (1) to help model complex nonlinear constitutive relations, (2) to nonlinearly reduce the model order for efficient simulation (turbulence), or (3) to accelerate the simulation by predicting certain components in the traditional integration methods. Here, methods (1) and (2) relied on Long-Short-Term Memory (LSTM) architecture, with method (3) relying on convolutional neural networks. Pure ML methods to solve (nonlinear) PDEs are represented by Physics-Informed Neural network (PINN) methods, which could be combined with attention mechanism to address discontinuous solutions. Both LSTM and attention architectures, together with modern and generalized classic optimizers to include stochasticity for DL networks, are extensively reviewed. Kernel machines, including Gaussian processes, are provided to sufficient depth for more advanced works such as shallow networks with infinite width. Not only addressing experts, readers are assumed familiar with computational mechanics, but not with DL, whose concepts and applications are built up from the basics, aiming at bringing first-time learners quickly to the forefront of research. History and limitations of AI are recounted and discussed, with particular attention at pointing out misstatements or misconceptions of the classics, even in well-known references. Positioning and pointing control of a large-deformable beam is given as an example.Comment: 275 pages, 158 figures. Appeared online on 2023.03.01 at CMES-Computer Modeling in Engineering & Science

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 4th International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2005, held 29-31 October 2005, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies
    corecore