2 research outputs found

    Multimodal Based Audio-Visual Speech Recognition for Hard-of-Hearing: State of the Art Techniques and Challenges

    Get PDF
    Multimodal Integration (MI) is the study of merging the knowledge acquired by the nervous system using sensory modalities such as speech, vision, touch, and gesture. The applications of MI expand over the areas of Audio-Visual Speech Recognition (AVSR), Sign Language Recognition (SLR), Emotion Recognition (ER), Bio Metrics Applications (BMA), Affect Recognition (AR), Multimedia Retrieval (MR), etc. The fusion of modalities such as hand gestures- facial, lip- hand position, etc., are mainly used sensory modalities for the development of hearing-impaired multimodal systems. This paper encapsulates an overview of multimodal systems available within literature towards hearing impaired studies. This paper also discusses some of the studies related to hearing-impaired acoustic analysis. It is observed that very less algorithms have been developed for hearing impaired AVSR as compared to normal hearing. Thus, the study of audio-visual based speech recognition systems for the hearing impaired is highly demanded for the people who are trying to communicate with natively speaking languages.聽 This paper also highlights the state-of-the-art techniques in AVSR and the challenges faced by the researchers for the development of AVSR systems

    Robust front-end for audio, visual and audio鈥搗isual speech classification

    No full text
    This paper proposes a robust front-end for speech classification which can be employed with acoustic, visual or audio鈥搗isual information, indistinctly. Wavelet multiresolution analysis is employed to represent temporal input data associated with speech information. These wavelet-based features are then used as inputs to a Random Forest classifier to perform the speech classification. The performance of the proposed speech classification scheme is evaluated in different scenarios, namely, considering only acoustic information, only visual information (lip-reading), and fused audio鈥搗isual information. These evaluations are carried out over three different audio鈥搗isual databases, two of them public ones and the remaining one compiled by the authors of this paper. Experimental results show that a good performance is achieved with the proposed system over the three databases and for the different kinds of input information being considered. In addition, the proposed method performs better than other reported methods in the literature over the same two public databases. All the experiments were implemented using the same configuration parameters. These results also indicate that the proposed method performs satisfactorily, neither requiring the tuning of the wavelet decomposition parameters nor of the Random Forests classifier parameters, for each particular database and input modalities.Fil: Terissi, Lucas Daniel. Consejo Nacional de Investigaciones Cient铆ficas y T茅cnicas. Centro Cient铆fico Tecnol贸gico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Informaci贸n y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Informaci贸n y de Sistemas; ArgentinaFil: Sad, Gonzalo Daniel. Consejo Nacional de Investigaciones Cient铆ficas y T茅cnicas. Centro Cient铆fico Tecnol贸gico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Informaci贸n y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Informaci贸n y de Sistemas; ArgentinaFil: G贸mez, Juan Carlos. Consejo Nacional de Investigaciones Cient铆ficas y T茅cnicas. Centro Cient铆fico Tecnol贸gico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Informaci贸n y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Informaci贸n y de Sistemas; Argentin
    corecore