722 research outputs found

    The PASCAL CHiME Speech Separation and Recognition Challenge

    Get PDF
    International audienceDistant microphone speech recognition systems that operate with humanlike robustness remain a distant goal. The key difficulty is that operating in everyday listening conditions entails processing a speech signal that is reverberantly mixed into a noise background composed of multiple competing sound sources. This paper describes a recent speech recognition evaluation that was designed to bring together researchers from multiple communities in order to foster novel approaches to this problem. The task was to identify keywords from sentences reverberantly mixed into audio backgrounds binaurally-recorded in a busy domestic environment. The challenge was designed to model the essential difficulties of multisource environment problem while remaining on a scale that would make it accessible to a wide audience. Compared to previous ASR evaluation a particular novelty of the task is that the utterances to be recognised were provided in a continuous audio background rather than as pre-segmented utterances thus allowing a range of background modelling techniques to be employed. The challenge attracted thirteen submissions. This paper describes the challenge problem, provides an overview of the systems that were entered and provides a comparison alongside both a baseline recognition system and human performance. The paper discusses insights gained from the challenge and lessons learnt for the design of future such evaluations

    Detection and handling of overlapping speech for speaker diarization

    Get PDF
    For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings, compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also due to the presence of overlapping speech. Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually lead to corrupt single-speaker models and thus to a worse segmentation. This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component analysis, linear discriminant analysis, or by a multi-layer perceptron. In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then, a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features. The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the detection of overlapping speech, especially on data originating from a single recording site. In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of the diarization algorithm. During the system development it was discovered that it is favorable to do an independent optimization of overlap exclusion and labeling with respect to the overlap detection system. We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well. The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between various settings of the overlap detection operation point. However, a high performance variability across different recordings is also typical of the baseline diarization system, without any overlap handling

    ‘Did the speaker change?’: Temporal tracking for overlapping speaker segmentation in multi-speaker scenarios

    Get PDF
    Diarization systems are an essential part of many speech processing applications, such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker domains. This thesis will focus on the first task of the diarization process, that being the task of speaker segmentation which can be thought of as trying to answer the question ‘Did the speaker change?’ in an audio recording. This thesis starts by showing that time-varying pitch properties can be used advantageously within the segmentation step of a multi-talker diarization system. It is then highlighted that an individual’s pitch is smoothly varying and, therefore, can be predicted by means of a Kalman filter. Subsequently, it is shown that if the pitch is not predictable, then this is most likely due to a change in the speaker. Finally, a novel system is proposed that uses this approach of pitch prediction for speaker change detection. This thesis then goes on to demonstrate how voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker’s utterance in the presence of an additional active speaker. This thesis then extends this work to explore the use of a new multimodal approach for overlapping speaker segmentation that tracks both the fundamental frequency (F0) and direction of arrival (DoA) of each speaker simultaneously. The proposed multiple hypothesis tracking system, which simultaneously tracks both features, shows an improvement in segmentation performance when compared to tracking these features separately. Lastly, this thesis focuses on the DoA estimation part of the newly proposed multimodal approach. It does this by exploring a polynomial extension to the multiple signal classification (MUSIC) algorithm, spatio-spectral polynomial (SSP)-MUSIC, and evaluating its performance when using speech sound sources.Open Acces

    The effects of adverse conditions on speech recognition by non-native listeners: Electrophysiological and behavioural evidence

    Get PDF
    This thesis investigated speech recognition by native (L1) and non-native (L2) listeners (i.e., native English and Korean speakers) in diverse adverse conditions using electroencephalography (EEG) and behavioural measures. Study 1 investigated speech recognition in noise for read and casually produced, spontaneous speech using behavioural measures. The results showed that the detrimental effect of casual speech was greater for L2 than L1 listeners, demonstrating real-life L2 speech recognition problems caused by casual speech. Intelligibility was also shown to decrease when the accents of the talker and listener did not match when listening to casual speech as well as read speech. Study 2 set out to develop EEG methods to measure L2 speech processing difficulties for natural, continuous speech. This study thus examined neural entrainment to the amplitude envelope of speech (i.e., slow amplitude fluctuations in speech) while subjects listened to their L1, L2 and a language that they did not understand. The results demonstrate that neural entrainment to the speech envelope is not modulated by whether or not listeners understand the language, opposite to previously reported positive relationships between speech entrainment and intelligibility. Study 3 investigated speech processing in a two-talker situation using measures of neural entrainment and N400, combined with a behavioural speech recognition task. L2 listeners had greater entrainment for target talkers than did L1 listeners, likely because their difficulty with L2 speech comprehension caused them to focus greater attention on the speech signal. L2 listeners also had a greater degree of lexical processing (i.e., larger N400) for highly predictable words than did native listeners, while native listeners had greater lexical processing when listening to foreign-accented speech. The results suggest that the increased listening effort experienced by L2 listeners during speech recognition modulates their auditory and lexical processing

    The effect of hearing aid microphone mode on performance in an auditory orienting task

    Get PDF
    OBJECTIVES: Although directional microphones on a hearing aid provide a signal-to-noise ratio benefit in a noisy background, the amount of benefit is dependent on how close the signal of interest is to the front of the user. It is assumed that when the signal of interest is off-axis, users can reorient themselves to the signal to make use of the directional microphones to improve signal-to-noise ratio. The present study tested this assumption by measuring the head-orienting behavior of bilaterally fit hearing-impaired individuals with their microphones set to omnidirectional and directional modes. The authors hypothesized that listeners using directional microphones would have greater difficulty in rapidly and accurately orienting to off-axis signals than they would when using omnidirectional microphones. DESIGN: The authors instructed hearing-impaired individuals to turn and face a female talker in simultaneous surrounding male-talker babble. Participants pressed a button when they felt they were accurately oriented in the direction of the female talker. Participants completed three blocks of trials with their hearing aids in omnidirectional mode and three blocks in directional mode, with mode order randomized. Using a Vicon motion tracking system, the authors measured head position and computed fixation error, fixation latency, trajectory complexity, and proportion of misorientations. RESULTS: Results showed that for larger off-axis target angles, listeners using directional microphones took longer to reach their targets than they did when using omnidirectional microphones, although they were just as accurate. They also used more complex movements and frequently made initial turns in the wrong direction. For smaller off-axis target angles, this pattern was reversed, and listeners using directional microphones oriented more quickly and smoothly to the targets than when using omnidirectional microphones. CONCLUSIONS: The authors argue that an increase in movement complexity indicates a switch from a simple orienting movement to a search behavior. For the most off-axis target angles, listeners using directional microphones appear to not know which direction to turn, so they pick a direction at random and simply rotate their heads until the signal becomes more audible. The changes in fixation latency and head orientation trajectories suggest that the decrease in off-axis audibility is a primary concern in the use of directional microphones, and listeners could experience a loss of initial target speech while turning toward a new signal of interest. If hearing-aid users are to receive maximum directional benefit in noisy environments, both adaptive directionality in hearing aids and clinical advice on using directional microphones should take head movement and orientation behavior into account

    Proceedings: Voice Technology for Interactive Real-Time Command/Control Systems Application

    Get PDF
    Speech understanding among researchers and managers, current developments in voice technology, and an exchange of information concerning government voice technology efforts are discussed

    Ultra-high-speed imaging of bubbles interacting with cells and tissue

    Get PDF
    Ultrasound contrast microbubbles are exploited in molecular imaging, where bubbles are directed to target cells and where their high-scattering cross section to ultrasound allows for the detection of pathologies at a molecular level. In therapeutic applications vibrating bubbles close to cells may alter the permeability of cell membranes, and these systems are therefore highly interesting for drug and gene delivery applications using ultrasound. In a more extreme regime bubbles are driven through shock waves to sonoporate or kill cells through intense stresses or jets following inertial bubble collapse. Here, we elucidate some of the underlying mechanisms using the 25-Mfps camera Brandaris128, resolving the bubble dynamics and its interactions with cells. We quantify acoustic microstreaming around oscillating bubbles close to rigid walls and evaluate the shear stresses on nonadherent cells. In a study on the fluid dynamical interaction of cavitation bubbles with adherent cells, we find that the nonspherical collapse of bubbles is responsible for cell detachment. We also visualized the dynamics of vibrating microbubbles in contact with endothelial cells followed by fluorescent imaging of the transport of propidium iodide, used as a membrane integrity probe, into these cells showing a direct correlation between cell deformation and cell membrane permeability
    • …
    corecore