109 research outputs found

    Intelligibility model optimisation approaches for speech pre-enhancement

    Get PDF
    The goal of improving the intelligibility of broadcast speech is being met by a recent new direction in speech enhancement: near-end intelligibility enhancement. In contrast to the conventional speech enhancement approach that processes the corrupted speech at the receiver-side of the communication chain, the near-end intelligibility enhancement approach pre-processes the clean speech at the transmitter-side, i.e. before it is played into the environmental noise. In this work, we describe an optimisation-based approach to near-end intelligibility enhancement using models of speech intelligibility to improve the intelligibility of speech in noise. This thesis first presents a survey of speech intelligibility models and how the adverse acoustic conditions affect the intelligibility of speech. The purpose of this survey is to identify models that we can adopt in the design of the pre-enhancement system. Then, we investigate the strategies humans use to increase speech intelligibility in noise. We then relate human strategies to existing algorithms for near-end intelligibility enhancement. A closed-loop feedback approach to near-end intelligibility enhancement is then introduced. In this framework, speech modifications are guided by a model of intelligibility. For the closed-loop system to work, we develop a simple spectral modification strategy that modifies the first few coefficients of an auditory cepstral representation such as to maximise an intelligibility measure. We experiment with two contrasting measures of objective intelligibility. The first, as a baseline, is an audibility measure named 'glimpse proportion' that is computed as the proportion of the spectro-temporal representation of the speech signal that is free from masking. We then propose a discriminative intelligibility model, building on the principles of missing data speech recognition, to model the likelihood of specific phonetic confusions that may occur when speech is presented in noise. The discriminative intelligibility measure is computed using a statistical model of speech from the speaker that is to be enhanced. Interim results showed that, unlike the glimpse proportion based system, the discriminative based system did not improve intelligibility. We investigated the reason behind that and we found that the discriminative based system was not able to target the phonetic confusion with the fixed spectral shaping. To address that, we introduce a time-varying spectral modification. We also propose to perform the optimisation on a segment-by-segment basis which enables a robust solution against the fluctuating noise. We further combine our system with a noise-independent enhancement technique, i.e. dynamic range compression. We found significant improvement in non-stationary noise condition, but no significant differences to the state-of-the art system (spectral shaping and dynamic range compression) where found in stationary noise condition

    Investigating supra-intelligibility aspects of speech

    Get PDF
    158 p.Synthetic and recorded speech form a great part of oureveryday listening experience, and much of our exposure tothese forms of speech occurs in potentially noisy settings such as on public transport, in the classroom or workplace, while driving, and in our homes. Optimising speech output to ensure that salient information is both correctly and effortlessly received is a main concern for the designers of applications that make use of the speech modality. Most of the focus in adapting speech output to challenging listening conditions has been on intelligibility, and specifically on enhancing intelligibility by modifying speech prior to presentation. However, the quality of the generated speech is not always satisfying for the recipient, which might lead to fatigue, or reluctance in using this communication modality. Consequently, a sole focus on intelligibility enhancement provides an incomplete picture of a listener¿s experience since the effect of modified or synthetic speech on other characteristics risks being ignored. These concerns motivate the study of 'supra-intelligibility' factors such as the additional cognitive demand that modified speech may well impose upon listeners, as well as quality, naturalness, distortion and pleasantness. This thesis reports on an investigation into two supra-intelligibility factors: listening effort and listener preferences. Differences in listening effort across four speech types (plain natural, Lombard, algorithmically-enhanced, and synthetic speech) were measured using existing methods, including pupillometry, subjective judgements, and intelligibility scores. To explore the effects of speech features on listener preferences, a new tool, SpeechAdjuster, was developed. SpeechAdjuster allows the manipulation of virtually any aspect of speech and supports the joint elicitation of listener preferences and intelligibility measures. The tool reverses the roles of listener and experimenter by allowing listeners direct control of speech characteristics in real-time. Several experiments to explore the effects of speech properties on listening preferences and intelligibility using SpeechAdjuster were conducted. Participants were permitted to change a speech feature during an open-ended adjustment phase, followed by a test phase in which they identified speech presented with the feature value selected at the end of the adjustment phase. Experiments with native normal-hearing listeners measured the consequences of allowing listeners to change speech rate, fundamental frequency, and other features which led to spectral energy redistribution. Speech stimuli were presented in both quiet and masked conditions. Results revealed that listeners prefer feature modifications similar to those observed in naturally modified speech in noise (Lombard speech). Further, Lombard speech required the least listening effort compared to either plain natural, algorithmically-enhanced, or synthetic speech. For stationary noise, as noise level increased listeners chose slower speech rates and flatter tilts compared to the original speech. Only the choice of fundamental frequency was not consistent with that observed in Lombard speech. It is possible that features such as fundamental frequency that talkers naturally modify are by-products of the speech type (e.g. hyperarticulated speech) and might not be advantageous for the listener.Findings suggest that listener preferences provide information about the processing of speech over and above that measured by intelligibility. One of the listeners¿ concerns was to maximise intelligibility. In noise, listeners preferred the feature values for which more information survived masking, choosing speech rates that led to a contrast with the modulation rate of the masker, or modifications that led to a shift of spectral energy concentration to higher frequencies compared to those of the masker. For all features being modified by listeners, preferences were evident even when intelligibility was at or close to ceiling levels. Such preferences might result from a desire to reduce the cognitive effort of understanding speech, or from a desire to reproduce the sound of typical speech features experienced in real-world noisy conditions, or to optimise the quality of the modified signal. Investigation of supra-intelligibility aspects of speech promises to improve the quality of speech enhancement algorithms, bringing with it the potential of reducing the effort of understanding artificially-modified or generated forms of speech

    Context-aware speech synthesis: A human-inspired model for monitoring and adapting synthetic speech

    Get PDF
    The aim of this PhD thesis is to illustrate the development a computational model for speech synthesis, which mimics the behaviour of human speaker when they adapt their production to their communicative conditions. The PhD project was motivated by the observed differences between state-of-the- art synthesiser’s speech and human production. In particular, synthesiser outcome does not exhibit any adaptation to communicative context such as environmental disturbances, listener’s needs, or speech content meanings, as the human speech does. No evaluation is performed by standard synthesisers to check whether their production is suitable for the communication requirements. Inspired by Lindblom's Hyper and Hypo articulation theory (H&H) theory of speech production, the computational model of Hyper and Hypo articulation theory (C2H) is proposed. This novel computational model for automatic speech production is designed to monitor its outcome and to be able to control the effort involved in the synthetic speech generation. Speech transformations are based on the hypothesis that low-effort attractors for a human speech production system can be identified. Such acoustic configurations are close to minimum possible effort that a speaker can make in speech production. The interpolation/extrapolation along the key dimension of hypo/hyper-articulation can be motivated by energetic considerations of phonetic contrast. The complete reactive speech synthesis is enabled by adding a negative perception feedback loop to the speech production chain in order to constantly assess the communicative effectiveness of the proposed adaptation. The distance to the original communicative intents is the control signal that drives the speech transformations. A hidden Markov model (HMM)-based speech synthesiser along with the continuous adaptation of its statistical models is used to implement the C2H model. A standard version of the synthesis software does not allow for transformations of speech during the parameter generation. Therefore, the generation algorithm of one the most well-known speech synthesis frameworks, HMM/DNN-based speech synthesis framework (HTS), is modified. The short-time implementation of speech intelligibility index (SII), named extended speech intelligibility index (eSII), is also chosen as the main perception measure in the feedback loop to control the transformation. The effectiveness of the proposed model is tested by performing acoustic analysis, objective, and subjective evaluations. A key assessment is to measure the control of the speech clarity in noisy condition, and the similarities between the emerging modifications and human behaviour. Two objective scoring methods are used to assess the speech intelligibility of the implemented system: the speech intelligibility index (SII) and the index based upon the Dau measure (Dau). Results indicate that the intelligibility of C2H-generated speech can be continuously controlled. The effectiveness of reactive speech synthesis and of the phonetic contrast motivated transforms is confirmed by the acoustic and objective results. More precisely, in the maximum-strength hyper-articulation transformations, the improvement with respect to non-adapted speech is above 10% for all intelligibility indices and tested noise conditions

    A Process for the Restoration of Performances from Musical Errors on Live Progressive Rock Albums

    Get PDF
    In the course of my practice of producing live progressive rock albums, a significant challenge has emerged: how to repair performance errors while retaining the intended expressive performance. Using a practice as research methodology, I develop a novel process, Error Analysis and Performance Restoration (EAPR), to restore a performer’s intention where an error was assessed to have been made. In developing this process, within the context of my practice, I investigate: the nature of live albums and the groups to which I am accountable, a definition of performance errors, an examination of their causes, and the existing literature on these topics. In presenting EAPR, I demonstrate, drawing from existing research, a mechanism by which originally intended performances can be extracted from recorded errors. The EAPR process exists as a conceptual model; each album has a specific implementation to address the needs of that album, and the currently available technology. Restoration techniques are developed as part of this implementation. EAPR is developed and demonstrated through my work restoring performances on a front-line commercial live release, the Creative Submission Album. The specific EAPR implementation I design for it is laid out, and detailed examples of its techniques demonstrated

    Variation and change in the vowel system of Tyneside English

    Get PDF
    PhD ThesisThis thesis presents a variationist account of phonological variation and change in the vowel system of Tyneside English. The distributions of the phonetic exponents of five vowel variables are assessed with respect to the social variables sex, age and social class. Using a corpus of conversational and word-list material, for which 32 speakers of Tyneside English were recorded, between 30 and 40 tokens per speaker of the variables (i), (u), (e), (o) and (3) were transcribed impressionistically and subclassified by following phonological context. The results of this analysis are significant on several counts. First, the speakers sampled appear to differentiate themselves within the speech community through the variable use of certain socially marked phonetic variants, which can be correlated with the sex, age and class variables. Secondly, the speakers style shift to a greater or lesser degree according to combinations of the three social factors, such that surface variability is reduced as a function of increased formality. Third, the overall pattern among the sample population seems to be one of increasing uniformity or convergence: it is speculated that social mobility among upper working- and lower-middle class groups may lead to accent levelling, whereby local speech forms are supplanted by supra-local or innovative intermediate ones. That is, the patterns observed here may be indicative of change in progress. Last, a comparison of the results for the (phonologically) paired variables (i u) and (e o) shows a strong tendency for Tyneside speakers to use these 'symmetrically', in that choice of variant in one variable predicts choice of variant in the other. It is suggested that the symmetry in the system is exploited by Tyneside speakers for the purposes of indicating social affiliation and identity, and is in this sense an extra sociolinguistic resource upon which speakers can draw. In addition, the variants of (3) are discussed with reference to the reported merger of this variable with (a); it is suggested that the apparent 'unmerging' of these two classes is unproblematic from a structural point of view, as the putative (3)—(o) merger appears never to have been completed.UK Economic and Social Research Council (award number R00429524350

    Sonic interactions in virtual environments

    Get PDF
    This book tackles the design of 3D spatial interactions in an audio-centered and audio-first perspective, providing the fundamental notions related to the creation and evaluation of immersive sonic experiences. The key elements that enhance the sensation of place in a virtual environment (VE) are: Immersive audio: the computational aspects of the acoustical-space properties of Virutal Reality (VR) technologies Sonic interaction: the human-computer interplay through auditory feedback in VE VR systems: naturally support multimodal integration, impacting different application domains Sonic Interactions in Virtual Environments will feature state-of-the-art research on real-time auralization, sonic interaction design in VR, quality of the experience in multimodal scenarios, and applications. Contributors and editors include interdisciplinary experts from the fields of computer science, engineering, acoustics, psychology, design, humanities, and beyond. Their mission is to shape an emerging new field of study at the intersection of sonic interaction design and immersive media, embracing an archipelago of existing research spread in different audio communities and to increase among the VR communities, researchers, and practitioners, the awareness of the importance of sonic elements when designing immersive environments

    Sonic Interactions in Virtual Environments

    Get PDF

    Augmentative communication device design, implementation and evaluation

    Get PDF
    The ultimate aim of this thesis was to design and implement an advanced software based Augmentative Communication Device (ACD) , or Voice Output Communication Aid NOCA), for non-vocal Learning Disabled individuals by applying current psychological models, theories, and experimental techniques. By taking account of potential user's cognitive and linguistic abilities a symbol based device (Easy Speaker) was produced which outputs naturalistic digitised human speech and sound and makes use of a photorealistic symbol set. In order to increase the size of the available symbol set a hypermedia style dynamic screen approach was employed. The relevance of the hypermedia metaphor in relation to models of knowledge representation and language processing was explored.Laboratory based studies suggested that potential user's could learn to productively operate the software, became faster and more efficient over time when performing set conversational tasks. Studies with unimpaired individuals supported the notion that digitised speech was less cognitively demanding to decode, or listen to.With highly portable, touch based, PC compatible systems beginning to appear it is hoped that the otherwise silent will be able to use the software as their primary means of communication with the speaking world. Extensive field trials over a six month period with a prototype device and in collaboration with user's caregivers strongly suggested this might be the case.Off-device improvements were also noted suggesting that Easy Speaker, or similar software has the potential to be used as a communication training tool. Such training would be likely 10 improve overall communicative effectiveness.To conclude, a model for successful ACD development was proposed
    • …
    corecore