504 research outputs found

    Emotional Storyteller for Vision Impaired and Hearing-Impaired Children

    Get PDF
    Tellie is an innovative mobile app designed to offer an immersive and emotionally enriched storytelling experience for children who are visually and hearing impaired. It achieves this through four main objectives: Text extraction utilizes the CRAFT model and a combination of Convolutional Neural Networks (CNNs), Connectionist Temporal Classification (CTC), and Long Short-Term Memory (LSTM) networks to accurately extract and recognize text from images in storybooks. Recognition of Emotions in Sentences employs BERT to detect and distinguish emotions at the sentence level including happiness, anger, sadness, and surprise. Conversion of Text to Human Natural Audio with Emotion transforms text into emotionally expressive audio using Tacotron2 and Wave Glow, enhancing the synthesized speech with emotional styles to create engaging audio narratives. Conversion of Text to Sign Language: To cater to the Deaf and hard-of-hearing community, Tellie translates text into sign language using CNNs, ensuring alignment with real sign language expressions. These objectives combine to create Tellie, a groundbreaking app that empowers visually and hearing-impaired children with access to captivating storytelling experiences, promoting accessibility and inclusivity through the harmonious integration of language, creativity, and technology. This research demonstrates the potential of advanced technologies in fostering inclusive and emotionally engaging storytelling for all children

    Multimodal Language Analysis with Recurrent Multistage Fusion

    Full text link
    Computational modeling of human multimodal language is an emerging research area in natural language processing spanning the language, visual and acoustic modalities. Comprehending multimodal language requires modeling not only the interactions within each modality (intra-modal interactions) but more importantly the interactions between modalities (cross-modal interactions). In this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which decomposes the fusion problem into multiple stages, each of them focused on a subset of multimodal signals for specialized, effective fusion. Cross-modal interactions are modeled using this multistage fusion approach which builds upon intermediate representations of previous stages. Temporal and intra-modal interactions are modeled by integrating our proposed fusion approach with a system of recurrent neural networks. The RMFN displays state-of-the-art performance in modeling human multimodal language across three public datasets relating to multimodal sentiment analysis, emotion recognition, and speaker traits recognition. We provide visualizations to show that each stage of fusion focuses on a different subset of multimodal signals, learning increasingly discriminative multimodal representations.Comment: EMNLP 201
    • …
    corecore