140 research outputs found

    The Natural Statistics of Audiovisual Speech

    Get PDF
    Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver

    Glottal-synchronous speech processing

    No full text
    Glottal-synchronous speech processing is a field of speech science where the pseudoperiodicity of voiced speech is exploited. Traditionally, speech processing involves segmenting and processing short speech frames of predefined length; this may fail to exploit the inherent periodic structure of voiced speech which glottal-synchronous speech frames have the potential to harness. Glottal-synchronous frames are often derived from the glottal closure instants (GCIs) and glottal opening instants (GOIs). The SIGMA algorithm was developed for the detection of GCIs and GOIs from the Electroglottograph signal with a measured accuracy of up to 99.59%. For GCI and GOI detection from speech signals, the YAGA algorithm provides a measured accuracy of up to 99.84%. Multichannel speech-based approaches are shown to be more robust to reverberation than single-channel algorithms. The GCIs are applied to real-world applications including speech dereverberation, where SNR is improved by up to 5 dB, and to prosodic manipulation where the importance of voicing detection in glottal-synchronous algorithms is demonstrated by subjective testing. The GCIs are further exploited in a new area of data-driven speech modelling, providing new insights into speech production and a set of tools to aid deployment into real-world applications. The technique is shown to be applicable in areas of speech coding, identification and artificial bandwidth extension of telephone speec

    Coding Strategies for Cochlear Implants Under Adverse Environments

    Get PDF
    Cochlear implants are electronic prosthetic devices that restores partial hearing in patients with severe to profound hearing loss. Although most coding strategies have significantly improved the perception of speech in quite listening conditions, there remains limitations on speech perception under adverse environments such as in background noise, reverberation and band-limited channels, and we propose strategies that improve the intelligibility of speech transmitted over the telephone networks, reverberated speech and speech in the presence of background noise. For telephone processed speech, we propose to examine the effects of adding low-frequency and high- frequency information to the band-limited telephone speech. Four listening conditions were designed to simulate the receiving frequency characteristics of telephone handsets. Results indicated improvement in cochlear implant and bimodal listening when telephone speech was augmented with high frequency information and therefore this study provides support for design of algorithms to extend the bandwidth towards higher frequencies. The results also indicated added benefit from hearing aids for bimodal listeners in all four types of listening conditions. Speech understanding in acoustically reverberant environments is always a difficult task for hearing impaired listeners. Reverberated sounds consists of direct sound, early reflections and late reflections. Late reflections are known to be detrimental to speech intelligibility. In this study, we propose a reverberation suppression strategy based on spectral subtraction to suppress the reverberant energies from late reflections. Results from listening tests for two reverberant conditions (RT60 = 0.3s and 1.0s) indicated significant improvement when stimuli was processed with SS strategy. The proposed strategy operates with little to no prior information on the signal and the room characteristics and therefore, can potentially be implemented in real-time CI speech processors. For speech in background noise, we propose a mechanism underlying the contribution of harmonics to the benefit of electroacoustic stimulations in cochlear implants. The proposed strategy is based on harmonic modeling and uses synthesis driven approach to synthesize the harmonics in voiced segments of speech. Based on objective measures, results indicated improvement in speech quality. This study warrants further work into development of algorithms to regenerate harmonics of voiced segments in the presence of noise

    Objective and Subjective Evaluation of Wideband Speech Quality

    Get PDF
    Traditional landline and cellular communications use a bandwidth of 300 - 3400 Hz for transmitting speech. This narrow bandwidth impacts quality, intelligibility and naturalness of transmitted speech. There is an impending change within the telecommunication industry towards using wider bandwidth speech, but the enlarged bandwidth also introduces a few challenges in speech processing. Echo and noise are two challenging issues in wideband telephony, due to increased perceptual sensitivity by users. Subjective and/or objective measurements of speech quality are important in benchmarking speech processing algorithms and evaluating the effect of parameters like noise, echo, and delay in wideband telephony. Subjective measures include ratings of speech quality by listeners, whereas objective measures compute a metric based on the reference and degraded speech samples. While subjective quality ratings are the gold - standard\u27\u27, they are also time- and resource- consuming. An objective metric that correlates highly with subjective data is attractive, as it can act as a substitute for subjective quality scores in gauging the performance of different algorithms and devices. This thesis reports results from a series of experiments on subjective and objective speech quality evaluation for wideband telephony applications. First, a custom wideband noise reduction database was created that contained speech samples corrupted by different background noises at different signal to noise ratios (SNRs) and processed by six different noise reduction algorithms. Comprehensive subjective evaluation of this database revealed an interaction between the algorithm performance, noise type and SNR. Several auditory-based objective metrics such as the Loudness Pattern Distortion (LPD) measure based on the Moore - Glasberg auditory model were evaluated in predicting the subjective scores. In addition, the performance of Bayesian Multivariate Regression Splines(BMLS) was also evaluated in terms of mapping the scores calculated by the objective metrics to the true quality scores. The combination of LPD and BMLS resulted in high correlation with the subjective scores and was used as a substitution for fine - tuning the noise reduction algorithms. Second, the effect of echo and delay on the wideband speech was evaluated in both listening and conversational context, through both subjective and objective measures. A database containing speech samples corrupted by echo with different delay and frequency response characteristics was created, and was later used to collect subjective quality ratings. The LPD - BMLS objective metric was then validated using the subjective scores. Third, to evaluate the effect of echo and delay in conversational context, a realtime simulator was developed. Pairs of subjects conversed over the simulated system and rated the quality of their conversations which were degraded by different amount of echo and delay. The quality scores were analysed and LPD+BMLS combination was found to be effective in predicting subjective impressions of quality for condition-averaged data

    Telesonar: Robocall Alarm System by Detecting Echo Channel and Breath Timing

    Get PDF

    The Effect of Narrow-Band Transmission on Recognition of Paralinguistic Information From Human Vocalizations

    No full text
    Practically, no knowledge exists on the effects of speech coding and recognition for narrow-band transmission of speech signals within certain frequency ranges especially in relation to the recognition of paralinguistic cues in speech. We thus investigated the impact of narrow-band standard speech coders on the machine-based classification of affective vocalizations and clinical vocal recordings. In addition, we analyzed the effect of speech low-pass filtering by a set of different cut-off frequencies, either chosen as static values in the 0.5-5-kHz range or given dynamically by different upper limits from the first five speech formants (F1-F5). Speech coding and recognition were tested, first, according to short-term speaker states by using affective vocalizations as given by the Geneva Multimodal Emotion Portrayals. Second, in relation to long-term speaker traits, we tested vocal recording from clinical populations involving speech impairments as found in the Child Pathological Speech Database. We employ a large acoustic feature space derived from the Interspeech Computational Paralinguistics Challenge. Besides analysis of the sheer corruption outcome, we analyzed the potential of matched and multicondition training as opposed to miss-matched condition. In the results, first, multicondition and matched-condition training significantly increase performances as opposed to mismatched condition. Second, downgrades in classification accuracy occur, however, only at comparably severe levels of low-pass filtering. The downgrades especially appear for multi-categorical rather than for binary decisions. These can be dealt with reasonably by the alluded strategies

    ADAPTIVE SPEECH QUALITY IN VOICE-OVER-IP COMMUNICATIONS

    Get PDF
    The quality of VoIP communication relies significantly on the network that transports the voice packets because this network does not usually guarantee the available bandwidth, delay, and loss that are critical for real-time voice traffic. The solution proposed here is to manage the voice-over-IP stream dynamically, changing parameters as needed to assure quality. The main objective of this dissertation is to develop an adaptive speech encoding system that can be applied to conventional (telephony-grade) and wideband voice communications. This comprehensive study includes the investigation and development of three key components of the system. First, to manage VoIP quality dynamically, a tool is needed to measure real-time changes in quality. The E-model, which exists for narrowband communication, is extended to a single computational technique that measures speech quality for narrowband and wideband VoIP codecs. This part of the dissertation also develops important theoretical work in the area of wideband telephony. The second system component is a variable speech-encoding algorithm. Although VoIP performance is affected by multiple codecs and network-based factors, only three factors can be managed dynamically: voice payload size, speech compression and jitter buffer management. Using an existing adaptive jitter-buffer algorithm, voice packet-size and compression variation are studied as they affect speech quality under different network conditions. This study explains the relationships among multiple parameters as they affect speech transmission and its resulting quality. Then, based on these two components, the third system component is a novel adaptive-rate control algorithm that establishes the interaction between a VoIP sender and receiver, and manages voice quality in real-time. Simulations demonstrate that the system provides better average voice quality than traditional VoIP

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity

    Subband spectral features for speaker recognition.

    Get PDF
    Tam Yuk Yin.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references.Abstracts in English and Chinese.Chapter Chapter 1 --- Introduction --- p.1Chapter 1.1. --- Biometrics for User Authentication --- p.2Chapter 1.2. --- Voice-based User Authentication --- p.6Chapter 1.3. --- Motivation and Focus of This Work --- p.7Chapter 1.4. --- Thesis Outline --- p.9References --- p.11Chapter Chapter 2 --- Fundamentals of Automatic Speaker Recognition --- p.14Chapter 2.1. --- Speech Production --- p.14Chapter 2.2. --- Features of Speaker's Voice in Speech Signal --- p.16Chapter 2.3. --- Basics of Speaker Recognition --- p.19Chapter 2.4. --- Existing Approaches of Speaker Recognition --- p.20Chapter 2.4.1. --- Feature Extraction --- p.21Chapter 2.4.1.1 --- Overview --- p.21Chapter 2.4.1.2 --- Mel-Frequency Cepstral Coefficient (MFCC) --- p.21Chapter 2.4.2. --- Speaker Modeling --- p.24Chapter 2.4.2.1 --- Overview --- p.24Chapter 2.4.2.2 --- Gaussian Mixture Model (GMM) --- p.25Chapter 2.4.3. --- Speaker Identification (SID) --- p.26References --- p.29Chapter Chapter 3 --- Data Collection and Baseline System --- p.32Chapter 3.1. --- Data Collection --- p.32Chapter 3.2. --- Baseline System --- p.36Chapter 3.2.1. --- Experimental Set-up --- p.36Chapter 3.2.2. --- Results and Analysis --- p.39References --- p.42Chapter Chapter 4 --- Subband Spectral Envelope Features --- p.44Chapter 4.1. --- Spectral Envelope Features --- p.44Chapter 4.2. --- Subband Spectral Envelope Features --- p.46Chapter 4.3. --- Feature Extraction Procedures --- p.52Chapter 4.4. --- SID Experiments --- p.55Chapter 4.4.1. --- Experimental Set-up --- p.55Chapter 4.4.2. --- Results and Analysis --- p.55References --- p.62Chapter Chapter 5 --- Fusion of Subband Features --- p.63Chapter 5.1. --- Model Level Fusion --- p.63Chapter 5.1.1. --- Experimental Set-up --- p.63Chapter 5.1.2. --- "Results and Analysis," --- p.65Chapter 5.2. --- Feature Level Fusion --- p.69Chapter 5.2.1. --- Experimental Set-up --- p.70Chapter 5.2.2. --- "Results and Analysis," --- p.71Chapter 5.3. --- Discussion --- p.73References --- p.75Chapter Chapter 6 --- Utterance-Level SID with Text-Dependent Weights --- p.77Chapter 6.1. --- Motivation --- p.77Chapter 6.2. --- Utterance-Level SID --- p.78Chapter 6.3. --- Baseline System --- p.79Chapter 6.3.1. --- Implementation Details --- p.79Chapter 6.3.2. --- "Results and Analysis," --- p.80Chapter 6.4. --- Text-Dependent Weights --- p.81Chapter 6.4.1. --- Implementation Details --- p.81Chapter 6.4.2. --- "Results and Analysis," --- p.83Chapter 6.5. --- Text-Dependent Feature Weights --- p.86Chapter 6.5.1. --- Implementation Details --- p.86Chapter 6.5.2. --- "Results and Analysis," --- p.87Chapter 6.6. --- Text-Dependent Weights Applied in Score Combination and Subband Features --- p.88Chapter 6.6.1. --- Implementation Details --- p.89Chapter 6.6.2. --- Results and Analysis --- p.89Chapter 6.7. --- Discussion --- p.90Chapter Chapter 7 --- Conclusions and Suggested Future Work --- p.92Chapter 7.1. --- Conclusions --- p.92Chapter 7.2. --- Suggested Future Work --- p.94Appendix --- p.96Appendix 1 Speech Content for Data Collection --- p.9

    Non-Intrusive Subscriber Authentication for Next Generation Mobile Communication Systems

    Get PDF
    Merged with duplicate record 10026.1/753 on 14.03.2017 by CS (TIS)The last decade has witnessed massive growth in both the technological development, and the consumer adoption of mobile devices such as mobile handsets and PDAs. The recent introduction of wideband mobile networks has enabled the deployment of new services with access to traditionally well protected personal data, such as banking details or medical records. Secure user access to this data has however remained a function of the mobile device's authentication system, which is only protected from masquerade abuse by the traditional PIN, originally designed to protect against telephony abuse. This thesis presents novel research in relation to advanced subscriber authentication for mobile devices. The research began by assessing the threat of masquerade attacks on such devices by way of a survey of end users. This revealed that the current methods of mobile authentication remain extensively unused, leaving terminals highly vulnerable to masquerade attack. Further investigation revealed that, in the context of the more advanced wideband enabled services, users are receptive to many advanced authentication techniques and principles, including the discipline of biometrics which naturally lends itself to the area of advanced subscriber based authentication. To address the requirement for a more personal authentication capable of being applied in a continuous context, a novel non-intrusive biometric authentication technique was conceived, drawn from the discrete disciplines of biometrics and Auditory Evoked Responses. The technique forms a hybrid multi-modal biometric where variations in the behavioural stimulus of the human voice (due to the propagation effects of acoustic waves within the human head), are used to verify the identity o f a user. The resulting approach is known as the Head Authentication Technique (HAT). Evaluation of the HAT authentication process is realised in two stages. Firstly, the generic authentication procedures of registration and verification are automated within a prototype implementation. Secondly, a HAT demonstrator is used to evaluate the authentication process through a series of experimental trials involving a representative user community. The results from the trials confirm that multiple HAT samples from the same user exhibit a high degree of correlation, yet samples between users exhibit a high degree of discrepancy. Statistical analysis of the prototypes performance realised early system error rates of; FNMR = 6% and FMR = 0.025%. The results clearly demonstrate the authentication capabilities of this novel biometric approach and the contribution this new work can make to the protection of subscriber data in next generation mobile networks.Orange Personal Communication Services Lt
    • …
    corecore