6 research outputs found
Multimodal Based Audio-Visual Speech Recognition for Hard-of-Hearing: State of the Art Techniques and Challenges
Multimodal Integration (MI) is the study of merging the knowledge acquired by the nervous system using sensory modalities such as speech, vision, touch, and gesture. The applications of MI expand over the areas of Audio-Visual Speech Recognition (AVSR), Sign Language Recognition (SLR), Emotion Recognition (ER), Bio Metrics Applications (BMA), Affect Recognition (AR), Multimedia Retrieval (MR), etc. The fusion of modalities such as hand gestures- facial, lip- hand position, etc., are mainly used sensory modalities for the development of hearing-impaired multimodal systems. This paper encapsulates an overview of multimodal systems available within literature towards hearing impaired studies. This paper also discusses some of the studies related to hearing-impaired acoustic analysis. It is observed that very less algorithms have been developed for hearing impaired AVSR as compared to normal hearing. Thus, the study of audio-visual based speech recognition systems for the hearing impaired is highly demanded for the people who are trying to communicate with natively speaking languages. This paper also highlights the state-of-the-art techniques in AVSR and the challenges faced by the researchers for the development of AVSR systems
A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation
Body language (BL) refers to the non-verbal communication expressed through
physical movements, gestures, facial expressions, and postures. It is a form of
communication that conveys information, emotions, attitudes, and intentions
without the use of spoken or written words. It plays a crucial role in
interpersonal interactions and can complement or even override verbal
communication. Deep multi-modal learning techniques have shown promise in
understanding and analyzing these diverse aspects of BL. The survey emphasizes
their applications to BL generation and recognition. Several common BLs are
considered i.e., Sign Language (SL), Cued Speech (CS), Co-speech (CoS), and
Talking Head (TH), and we have conducted an analysis and established the
connections among these four BL for the first time. Their generation and
recognition often involve multi-modal approaches. Benchmark datasets for BL
research are well collected and organized, along with the evaluation of SOTA
methods on these datasets. The survey highlights challenges such as limited
labeled data, multi-modal learning, and the need for domain adaptation to
generalize models to unseen speakers or languages. Future research directions
are presented, including exploring self-supervised learning techniques,
integrating contextual information from other modalities, and exploiting
large-scale pre-trained multi-modal models. In summary, this survey paper
provides a comprehensive understanding of deep multi-modal learning for various
BL generations and recognitions for the first time. By analyzing advancements,
challenges, and future directions, it serves as a valuable resource for
researchers and practitioners in advancing this field. n addition, we maintain
a continuously updated paper list for deep multi-modal learning for BL
recognition and generation: https://github.com/wentaoL86/awesome-body-language
Inner Lips Features Extraction based on CLNF with Hybrid Dynamic Template for Cued Speech
International audienceIn previous French Cued Speech (CS) studies, one of the widely used methods is painting blue color on the speaker’s lips to make lips feature extraction easier. In this paper, in order to get rid of this artifice, a novel automatic method to extract the inner lips contour of CS speakers is presented. This method is based on a recent facial contour extraction model developed in computer vision, called Constrained Local Neural Field (CLNF), which provides eight characteristic landmarks describing the inner lips contour. However, directly applied to our CS data, CLNF fails in about 41.4% of cases. Therefore, we propose two methods to correct the B parameter (aperture of inner lips) and A parameter (width of inner lips), respectively. For correcting the B parameter, a hybrid dynamic correlation template method (HD-CTM) using the first derivative of smoothed luminance variation is proposed. HD-CTM is first applied to detect the outer lower lips position. Then, the inner lower lips position is obtained by subtracting the validated lower lips thickness (VLLT). For correcting the A parameter, a periodical spline interpolation with a geometrical deformation of six CLNF inner lips landmarks is explored. Combined with an automatic round lips detector, this method is efficient to correct A parameter for round lips (the third vowel viseme made of French vowels with a small opening). HD-CTM is evaluated on 4800 images of three French speakers. It corrects about 95% CLNF errors of the B parameter, and total RMSE of one pixel (i.e., 0.05 cm on average) is achieved. The periodical spline interpolation method is tested on 927 round lips images. The total error of CLNF is reduced significantly, which is comparable to the state of the art. Moreover, the third viseme is properly distributed in the parameter A and B plane after using this method