11,258 research outputs found

    Automatic generation of audio content for open learning resources

    Get PDF
    This paper describes how digital talking books (DTBs) with embedded functionality for learners can be generated from content structured according to the OU OpenLearn schema. It includes examples showing how a software transformation developed from open source components can be used to remix OpenLearn content, and discusses issues concerning the generation of synthesised speech for educational purposes. Factors which may affect the quality of a learner's experience with open educational audio resources are identified, and in conclusion plans for testing the effect of these factors are outlined

    A silent speech system based on permanent magnet articulography and direct synthesis

    Get PDF
    In this paper we present a silent speech interface (SSI) system aimed at restoring speech communication for individuals who have lost their voice due to laryngectomy or diseases affecting the vocal folds. In the proposed system, articulatory data captured from the lips and tongue using permanent magnet articulography (PMA) are converted into audible speech using a speaker-dependent transformation learned from simultaneous recordings of PMA and audio signals acquired before laryngectomy. The transformation is represented using a mixture of factor analysers, which is a generative model that allows us to efficiently model non-linear behaviour and perform dimensionality reduction at the same time. The learned transformation is then deployed during normal usage of the SSI to restore the acoustic speech signal associated with the captured PMA data. The proposed system is evaluated using objective quality measures and listening tests on two databases containing PMA and audio recordings for normal speakers. Results show that it is possible to reconstruct speech from articulator movements captured by an unobtrusive technique without an intermediate recognition step. The SSI is capable of producing speech of sufficient intelligibility and naturalness that the speaker is clearly identifiable, but problems remain in scaling up the process to function consistently for phonetically rich vocabularies

    Speech vocoding for laboratory phonology

    Get PDF
    Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications

    Animation of a hierarchical image based facial model and perceptual analysis of visual speech

    Get PDF
    In this Thesis a hierarchical image-based 2D talking head model is presented, together with robust automatic and semi-automatic animation techniques, and a novel perceptual method for evaluating visual-speech based on the McGurk effect. The novelty of the hierarchical facial model stems from the fact that sub-facial areas are modelled individually. To produce a facial animation, animations for a set of chosen facial areas are first produced, either by key-framing sub-facial parameter values, or using a continuous input speech signal, and then combined into a full facial output. Modelling hierarchically has several attractive qualities. It isolates variation in sub-facial regions from the rest of the face, and therefore provides a high degree of control over different facial parts along with meaningful image based animation parameters. The automatic synthesis of animations may be achieved using speech not originally included in the training set. The model is also able to automatically animate pauses, hesitations and non-verbal (or non-speech related) sounds and actions. To automatically produce visual-speech, two novel analysis and synthesis methods are proposed. The first method utilises a Speech-Appearance Model (SAM), and the second uses a Hidden Markov Coarticulation Model (HMCM) - based on a Hidden Markov Model (HMM). To evaluate synthesised animations (irrespective of whether they are rendered semi automatically, or using speech), a new perceptual analysis approach based on the McGurk effect is proposed. This measure provides both an unbiased and quantitative method for evaluating talking head visual speech quality and overall perceptual realism. A combination of this new approach, along with other objective and perceptual evaluation techniques, are employed for a thorough evaluation of hierarchical model animations.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Animation of a hierarchical image based facial model and perceptual analysis of visual speech

    Get PDF
    In this Thesis a hierarchical image-based 2D talking head model is presented, together with robust automatic and semi-automatic animation techniques, and a novel perceptual method for evaluating visual-speech based on the McGurk effect. The novelty of the hierarchical facial model stems from the fact that sub-facial areas are modelled individually. To produce a facial animation, animations for a set of chosen facial areas are first produced, either by key-framing sub-facial parameter values, or using a continuous input speech signal, and then combined into a full facial output. Modelling hierarchically has several attractive qualities. It isolates variation in sub-facial regions from the rest of the face, and therefore provides a high degree of control over different facial parts along with meaningful image based animation parameters. The automatic synthesis of animations may be achieved using speech not originally included in the training set. The model is also able to automatically animate pauses, hesitations and non-verbal (or non-speech related) sounds and actions. To automatically produce visual-speech, two novel analysis and synthesis methods are proposed. The first method utilises a Speech-Appearance Model (SAM), and the second uses a Hidden Markov Coarticulation Model (HMCM) - based on a Hidden Markov Model (HMM). To evaluate synthesised animations (irrespective of whether they are rendered semi automatically, or using speech), a new perceptual analysis approach based on the McGurk effect is proposed. This measure provides both an unbiased and quantitative method for evaluating talking head visual speech quality and overall perceptual realism. A combination of this new approach, along with other objective and perceptual evaluation techniques, are employed for a thorough evaluation of hierarchical model animations

    Recording and Analysis of Head Movements, Interaural Level and Time Differences in Rooms and Real-World Listening Scenarios

    Get PDF
    The science of how we use interaural differences to localise sounds has been studied for over a century and in many ways is well understood. But in many of these psychophysical experiments listeners are required to keep their head still, as head movements cause changes in interaural level and time differences (ILD and ITD respectively). But a fixed head is unrealistic. Here we report an analysis of the actual ILDs and ITDs that occur as people naturally move and relate them to gyroscope measurements of the actual motion. We used recordings of binaural signals in a number of rooms and listening scenarios (home, office, busy street etc). The listener's head movements were also recorded in synchrony with the audio, using a micro-electromechanical gyroscope. We calculated the instantaneous ILD and ITDs and analysed them over time and frequency, comparing them with measurements of head movements. The results showed that instantaneous ITDs were widely distributed across time and frequency in some multi-source environments while ILDs were less widely distributed. The type of listening environment affected head motion. These findings suggest a complex interaction between interaural cues, egocentric head movement and the identification of sound sources in real-world listening situations

    The effect direction plot: visual display of non-standardised effects across multiple outcome domains

    Get PDF
    Visual display of reported impacts is a valuable aid to both reviewers and readers of systematic reviews. Forest plots are routinely prepared to report standardised effect sizes, but where standardised effect sizes are not available for all included studies a forest plot may misrepresent the available evidence. Tabulated data summaries to accompany the narrative synthesis can be lengthy and inaccessible. Moreover, the link between the data and the synthesis conclusions may be opaque. This paper details the preparation of visual summaries of effect direction for multiple outcomes across 29 quantitative studies of the health impacts of housing improvement. A one page summary of reported health outcomes was prepared to accompany a 10 000-word narrative synthesis. The one page summary included details of study design, internal validity, sample size, time of follow-up, as well as changes in intermediate outcomes, for example, housing condition. This approach to visually summarising complex data can aid the reviewer in cross-study analysis and improve accessibility and transparency of the narrative synthesis where standardised effect sizes are not available

    A mechatronic approach to supernormal auditory localisation

    Get PDF
    Remote audio perception is a fundamental requirement for telepresence and teleoperation in applications that range from work in hostile environments to security and entertainment. The following paper presents the use of a mechatronic system to test the efficacy of audio for telepresence. It describes work to determine whether the use of supernormal inter-aural distance is a valid means of approaching an enhanced method of hearing for telepresence. The particular audio variable investigated is the azimuth angle of error and the construction of a dedicated mechatronic test rig is reported and the results obtained. The paper concludes by observing that the combination of the mechatronic system and supernormal audition does enhance the ability to localise sound sources and that further work in this area is justified
    • …
    corecore