28 research outputs found

    Spatial context-aware person-following for a domestic robot

    Get PDF
    Domestic robots are in the focus of research in terms of service providers in households and even as robotic companion that share the living space with humans. A major capability of mobile domestic robots that is joint exploration of space. One challenge to deal with this task is how could we let the robots move in space in reasonable, socially acceptable ways so that it will support interaction and communication as a part of the joint exploration. As a step towards this challenge, we have developed a context-aware following behav- ior considering these social aspects and applied these together with a multi-modal person-tracking method to switch between three basic following approaches, namely direction-following, path-following and parallel-following. These are derived from the observation of human-human following schemes and are activated depending on the current spatial context (e.g. free space) and the relative position of the interacting human. A combination of the elementary behaviors is performed in real time with our mobile robot in different environments. First experimental results are provided to demonstrate the practicability of the proposed approach

    Human Detection And Tracking For Human-Robot Interaction On The REEM-C Humanoid Robot

    Get PDF
    The interactions between humanoid robots and humans is a growing area of research, as frameworks and models are being continuously developed to improving the ways in which humanoids may integrate into society. These humanoids often require intelligence beyond what they are originally endowed with in order to handle more complex human-robot interaction scenarios. This intelligence can come from the use of additional sensors, including microphones and cameras, which can allow the robot to better perceive its environment. This thesis explores the scenarios of moving conversational partners, and the ways in which the REEM-C Humanoid Robot may interact with them. The additional developed intelligence focuses on external microphones deployed to the robot, with a consideration for computer vision algorithms built using the camera in the REEM-C's head. The first topic of this thesis explores how binaural acoustic intelligence can be used to estimate the direction of arrival of human speech on the REEM-C Humanoid. This includes the development of audio signal processing techniques, their optimization, and their deployment for real-time use on the REEM-C. The second topic highlights the computer vision approaches that can be used for a robotic system that may allow better human-robot interaction. This section describes the relevant algorithms and their development, in a way that is efficient and accurate for real-time robot usage. The third topic explores the natural behaviours of humans in conversation with moving interlocutors. This is measured via a motion capture study and modeled with mathematical formulations, which are then used on the REEM-C Humanoid Robot. The REEM-C uses this tracking model to follow detected human speakers using the intelligence outlined in previous sections. The final topic focuses on how the acoustic intelligence, vision algorithms and tracking model can be used in tandem for human-robot interaction with potentially multiple human subjects. This includes sensor fusion approaches that help correct for limitations in the audio and video algorithms, synchronization and evaluation of behaviour in the form of a short user study. Applications of this framework are discussed, and relevant quantitative and qualitative results are presented. A chapter to introduce the work done to establish a chatbot conversational system is also included. The final thesis work is an amalgamation of the above topics, and presents a complete and robust human-robot interaction framework with the REEM-C based on tracking moving conversational partners with audio and video intelligence

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    Non-Intrusive Speech Intelligibility Prediction

    Get PDF

    Binaural scene analysis : localization, detection and recognition of speakers in complex acoustic scenes

    Get PDF
    The human auditory system has the striking ability to robustly localize and recognize a specific target source in complex acoustic environments while ignoring interfering sources. Surprisingly, this remarkable capability, which is referred to as auditory scene analysis, is achieved by only analyzing the waveforms reaching the two ears. Computers, however, are presently not able to compete with the performance achieved by the human auditory system, even in the restricted paradigm of confronting a computer algorithm based on binaural signals with a highly constrained version of auditory scene analysis, such as localizing a sound source in a reverberant environment or recognizing a speaker in the presence of interfering noise. In particular, the problem of focusing on an individual speech source in the presence of competing speakers, termed the cocktail party problem, has been proven to be extremely challenging for computer algorithms. The primary objective of this thesis is the development of a binaural scene analyzer that is able to jointly localize, detect and recognize multiple speech sources in the presence of reverberation and interfering noise. The processing of the proposed system is divided into three main stages: localization stage, detection of speech sources, and recognition of speaker identities. The only information that is assumed to be known a priori is the number of target speech sources that are present in the acoustic mixture. Furthermore, the aim of this work is to reduce the performance gap between humans and machines by improving the performance of the individual building blocks of the binaural scene analyzer. First, a binaural front-end inspired by auditory processing is designed to robustly determine the azimuth of multiple, simultaneously active sound sources in the presence of reverberation. The localization model builds on the supervised learning of azimuthdependent binaural cues, namely interaural time and level differences. Multi-conditional training is performed to incorporate the uncertainty of these binaural cues resulting from reverberation and the presence of competing sound sources. Second, a speech detection module that exploits the distinct spectral characteristics of speech and noise signals is developed to automatically select azimuthal positions that are likely to correspond to speech sources. Due to the established link between the localization stage and the recognition stage, which is realized by the speech detection module, the proposed binaural scene analyzer is able to selectively focus on a predefined number of speech sources that are positioned at unknown spatial locations, while ignoring interfering noise sources emerging from other spatial directions. Third, the speaker identities of all detected speech sources are recognized in the final stage of the model. To reduce the impact of environmental noise on the speaker recognition performance, a missing data classifier is combined with the adaptation of speaker models using a universal background model. This combination is particularly beneficial in nonstationary background noise

    Distant Speech Recognition of Natural Spontaneous Multi-party Conversations

    Get PDF
    Distant speech recognition (DSR) has gained wide interest recently. While deep networks keep improving ASR overall, the performance gap remains between using close-talking recordings and distant recordings. Therefore the work in this thesis aims at providing some insights for further improvement of DSR performance. The investigation starts with collecting the first multi-microphone and multi-media corpus of natural spontaneous multi-party conversations in native English with the speaker location tracked, i.e. the Sheffield Wargame Corpus (SWC). The state-of-the-art recognition systems with the acoustic models trained standalone and adapted both show word error rates (WERs) above 40% on headset recordings and above 70% on distant recordings. A comparison between SWC and AMI corpus suggests a few unique properties in the real natural spontaneous conversations, e.g. the very short utterances and the emotional speech. Further experimental analysis based on simulated data and real data quantifies the impact of such influence factors on DSR performance, and illustrates the complex interaction among multiple factors which makes the treatment of each influence factor much more difficult. The reverberation factor is studied further. It is shown that the reverberation effect on speech features could be accurately modelled with a temporal convolution in the complex spectrogram domain. Based on that a polynomial reverberation score is proposed to measure the distortion level of short utterances. Compared to existing reverberation metrics like C50, it avoids a rigid early-late-reverberation partition without compromising the performance on ranking the reverberation level of recording environments and channels. Furthermore, the existing reverberation measurement is signal independent thus unable to accurately estimate the reverberation distortion level in short recordings. Inspired by the phonetic analysis on the reverberation distortion via self-masking and overlap-masking, a novel partition of reverberation distortion into the intra-phone smearing and the inter-phone smearing is proposed, so that the reverberation distortion level is first estimated on each part and then combined

    Automatic speech recognition: from study to practice

    Get PDF
    Today, automatic speech recognition (ASR) is widely used for different purposes such as robotics, multimedia, medical and industrial application. Although many researches have been performed in this field in the past decades, there is still a lot of room to work. In order to start working in this area, complete knowledge of ASR systems as well as their weak points and problems is inevitable. Besides that, practical experience improves the theoretical knowledge understanding in a reliable way. Regarding to these facts, in this master thesis, we have first reviewed the principal structure of the standard HMM-based ASR systems from technical point of view. This includes, feature extraction, acoustic modeling, language modeling and decoding. Then, the most significant challenging points in ASR systems is discussed. These challenging points address different internal components characteristics or external agents which affect the ASR systems performance. Furthermore, we have implemented a Spanish language recognizer using HTK toolkit. Finally, two open research lines according to the studies of different sources in the field of ASR has been suggested for future work
    corecore