4 research outputs found

    Multimodal Based Audio-Visual Speech Recognition for Hard-of-Hearing: State of the Art Techniques and Challenges

    Get PDF
    Multimodal Integration (MI) is the study of merging the knowledge acquired by the nervous system using sensory modalities such as speech, vision, touch, and gesture. The applications of MI expand over the areas of Audio-Visual Speech Recognition (AVSR), Sign Language Recognition (SLR), Emotion Recognition (ER), Bio Metrics Applications (BMA), Affect Recognition (AR), Multimedia Retrieval (MR), etc. The fusion of modalities such as hand gestures- facial, lip- hand position, etc., are mainly used sensory modalities for the development of hearing-impaired multimodal systems. This paper encapsulates an overview of multimodal systems available within literature towards hearing impaired studies. This paper also discusses some of the studies related to hearing-impaired acoustic analysis. It is observed that very less algorithms have been developed for hearing impaired AVSR as compared to normal hearing. Thus, the study of audio-visual based speech recognition systems for the hearing impaired is highly demanded for the people who are trying to communicate with natively speaking languages.  This paper also highlights the state-of-the-art techniques in AVSR and the challenges faced by the researchers for the development of AVSR systems

    Using the Multi-Stream Approach for Continuous Audio-Visual Speech Recognition: Experiments on the M2VTS Database

    Get PDF
    The Multi-Stream automatic speech recognition approach was investigated in this work as a framework for Audio-Visual data fusion and speech recognition. This method presents many potential advantages for such a task. It particularly allows for synchronous decoding of continuous speech while still allowing for some asynchrony of the visual and acoustic information streams. First, the Multi-Stream formalism is briefly recalled. Then, on top of the Multi-Stream motivations, experiments on the M2VTS multimodal database are presented and discussed. To our knowledge, these are the first experiments about multi-speaker continuous Audio-Visual Speech Recognition (AVSR). It is shown that the Multi-Stream approach can yield improved Audio-Visual speech recognition performance when the acoustic signal is corrupted by noise as well as for clean speech

    Using the Multi-Stream Approach for Continuous Audio-Visual Speech Recognition

    Get PDF
    The Multi-Stream automatic speech recognition approach was investigated in this work as a framework for Audio-Visual data fusion and speech recognition. This method presents many potential advantages for such a task. It particularly allows for synchronous decoding of continuous speech while still allowing for some asynchrony of the visual and acoustic information streams. First, the Multi-Stream formalism is briefly recalled. Then, on top of the Multi-Stream motivations, experiments on the {\sc M2VTS} multimodal database are presented and discussed. To our knowledge, these are the first experiments about multi-speaker continuous Audio-Visual Speech Recognition (AVSR). It is shown that the Multi-Stream approach can yield improved Audio-Visual speech recognition performance when the acoustic signal is corrupted by noise as well as for clean speech

    Audio-Visual Large Vocabulary Continuous Speech Recognition In The Broadcast Domain

    No full text
    We consider the problem of combining visual cues with audio signals for the purpose of improved automatic machine recognition of speech. Although signifcant progress has been made in machine transcription of large vocabulary continuous speech (LVCSR) over the last few years, the technology to date is most eective only under controlled conditions such as low noise, speaker dependent recognition and read speech (as opposed to conversational speech) etc. On the otherhand, while augmenting the recognition of speech utterances with visual cues has attracted the attention of researchers over the last couple of years, most eorts in this domain can be considered to be only preliminary in the sense that unlike LVCSR eorts, tasks have been limited to small vocabulary (e.g., command, digits) and often to speaker dependent training or isolated word speech where word boundaries are arti cially well de ned. INTRODUCTION The potential for joint audio-visual-based speech recognition is well estab..
    corecore