2,920 research outputs found

    Transfer Learning for Speech and Language Processing

    Full text link
    Transfer learning is a vital technique that generalizes models trained for one setting or task to other settings or tasks. For example in speech recognition, an acoustic model trained for one language can be used to recognize speech in another language, with little or no re-training data. Transfer learning is closely related to multi-task learning (cross-lingual vs. multilingual), and is traditionally studied in the name of `model adaptation'. Recent advance in deep learning shows that transfer learning becomes much easier and more effective with high-level abstract features learned by deep models, and the `transfer' can be conducted not only between data distributions and data types, but also between model structures (e.g., shallow nets and deep nets) or even model types (e.g., Bayesian models and neural models). This review paper summarizes some recent prominent research towards this direction, particularly for speech and language processing. We also report some results from our group and highlight the potential of this very interesting research field.Comment: 13 pages, APSIPA 201

    FACE READERS: The Frontier of Computer Vision and Math Learning

    Get PDF
    The future of AI-assisted individualized learning includes computer vision to inform intelligent tutors and teachers about student affect, motivation and performance. Facial expression recognition is essential in recognizing subtle differences when students ask for hints or fail to solve problems. Facial features and classification labels enable intelligent tutors to predict students’ performance and recommend activities. Videos can capture students’ faces and model their effort and progress; machine learning classifiers can support intelligent tutors to provide interventions. One goal of this research is to support deep dives by teachers to identify students’ individual needs through facial expression and to provide immediate feedback. Another goal is to develop data-directed education to gauge students’ pre-existing knowledge and analyze real-time data that will engage both teachers and students in more individualized and precision teaching and learning. This paper identifies three phases in the process of recognizing and predicting student progress based on analyzing facial features: Phase I: Collecting datasets and identifying salient labels for facial features and student attention/engagement; Phase II: Building and training deep learning models of facial features; and Phase III: Predicting student problem-solving outcome. © 2023 Copyright for this paper by its authors

    Distilled Non-Semantic Speech Embeddings with Binary Neural Networks for Low-Resource Devices

    Full text link
    This work introduces BRILLsson, a novel binary neural network-based representation learning model for a broad range of non-semantic speech tasks. We train the model with knowledge distillation from a large and real-valued TRILLsson model with only a fraction of the dataset used to train TRILLsson. The resulting BRILLsson models are only 2MB in size with a latency less than 8ms, making them suitable for deployment in low-resource devices such as wearables. We evaluate BRILLsson on eight benchmark tasks (including but not limited to spoken language identification, emotion recognition, health condition diagnosis, and keyword spotting), and demonstrate that our proposed ultra-light and low-latency models perform as well as large-scale models

    Theseus : a 3D virtual reality orientation game with real-time guidance system for cognitive training

    Full text link
    Des études soutiennent que l’entraînement cognitif est une méthode efficace pour ralentirle déclin cognitif chez les personnes âgées. Les jeux sérieux basés sur la réalité virtuelle(RV) ont trouvé une application dans ce domaine en raison du haut niveau d’immersionet d’interactivité offert par les environnements virtuels (EV). Ce projet implémente unjeu d’orientation 3D en réalité virtuelle entièrement immersif avec un système pour guiderl’utilisateur en temps réel. Le jeu d’orientation 3D est utilisé comme exercice pour entraînerles capacités cognitives des utilisateurs. Les effets immédiats du jeu d’orientation sur lescapacités de mémoire et d’attention ont été étudiés sur quinze personnes âgées présentant undéclin cognitif subjectif (DCS). Il a été observé que bien qu’il n’y ait pas eu d’améliorationsignificative des résultats pour les exercices d’attention, les participants ont obtenu demeilleurs résultats aux exercices de mémoire spécifiques après avoir joué au jeu d’orientation. Le manque de succès dans la réalisation de l’objectif requis peut parfois augmenter lesémotions négatives chez les êtres humains, et plus particulièrement chez les personnes quisouffrent de déclin cognitif. C’est pourquoi le jeu a été équipé d’un système de guidageavec indices de localisation en temps réel pour contrôler les émotions négatives et aiderles participants à accomplir leurs tâches. Le système de guidage est basé sur des règleslogiques; chaque indice est délivré si une condition spécifique est remplie. Le changement desémotions des participants a montré que les indices sont efficaces pour réduire la frustration,étant donné qu’ils sont facilement compréhensibles et conçus pour donner un retour positif. La dernière partie du projet se concentre sur le système de guidage et met en oeuvre unmoyen pour l’activer entièrement selon les émotions d’une personne. Le problème consisteà identifier l’état des émotions qui devraient déclencher l’activation du système de guidage.Ce problème prend la forme d’un processus de décision markovien (PDM), qui peut êtrerésolu via l’apprentissage par renforcement (AR). Le réseau profond Q (RPQ) avec relectured’expérience (ER), qui est l’un des algorithmes d’apprentissage par renforcement les plusavancés pour la prédiction d’actions dans un espace d’action discret, a été utilisé dans cecontexte. L’algorithme a été formé sur des données d’émotions simulées, et testé sur les données de quinze personnes âgées acquises lors d’expériences menées dans la première partiedu projet. On observe que la méthode basée sur l’AR est plus performante que la méthodebasée sur les règles pour identifier l’état mental d’une personne afin de lui fournir des indices.Studies support cognitive training as an efficient method to slow the cognitive declinein older adults. Virtual reality (VR) based serious games have found application in thisfield due to the high level of immersion and interactivity offered by virtual environments(VE). This project implements a fully immersive 3D virtual reality orientation game with areal-time guidance system to be used as an exercise for cognitive training. The immediateaftereffects of playing the orientation game on memory and attention abilities were studiedon fifteen older adults with subjective cognitive decline (SCD). It was observed that whilethere was no significant improvement in attention exercises, the participants performedbetter in specific memory exercises after playing the orientation game. Sometimes lack of success in achieving the required objective may increase the negativeemotions in humans and more so in people who suffer from cognitive decline. Hence, thegame was equipped with a real-time guidance system with location hints to control negativeemotions and help participants to complete the tasks. The guidance system is based onlogical rules; each hint is delivered if a specific condition is met. Change in emotions ofparticipants showed that hints are effective in reducing frustration, given that the hints areeasily comprehensible and designed to give positive feedback. The final part of the project focuses on the guidance system and implements a way toactivate it entirely based on a person’s emotions. The problem calls for identifying the stateof the emotions that should trigger the guidance system’s activation. This problem takes theform of a Markov decision process (MDP), which can be solved by setting it in a reinforcementlearning framework. Deep Q-Learning network (DQN) with experience replay (ER),which is one of the state-of-the-art reinforcement learning algorithms for predicting actionsin discrete action space, was used in this context. The algorithm was trained on simulateddata of emotions and tested on the data of fifteen older adults acquired in experimentsconducted in the first part of the project. It is observed that the RL based method performsbetter than the rule-based method in identifying the mental state of a person to provide hints

    Non-acted multi-view audio-visual dyadic interactions. Project master thesis: multitask learning for facial attributes analysis

    Get PDF
    Treballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona, Any: 2019, Tutor: Sergio Escalera Guerrero, Cristina Palmero i Julio C. S. Jacques Junior[en] In this thesis we explore the use of Multitask Learning for improving performance in facial attributes tasks such as gender, age and ethnicity prediction. These tasks, along with emotion recognition will be part of a new dyadic interaction dataset which was recorded during the development of this thesis. This work includes the implementation of two state of the art multitask deep learning models and the discussion of the results obtained from these methods in a preliminary dataset, as well as a first evaluation in a sample of the dyadic interaction dataset. This will serve as a baseline for a future implementation of Multitask Learning methods in the fully annotated dyadic interaction dataset

    Unified Pretraining Target Based Video-music Retrieval With Music Rhythm And Video Optical Flow Information

    Full text link
    Background music (BGM) can enhance the video's emotion. However, selecting an appropriate BGM often requires domain knowledge. This has led to the development of video-music retrieval techniques. Most existing approaches utilize pretrained video/music feature extractors trained with different target sets to obtain average video/music-level embeddings. The drawbacks are two-fold. One is that different target sets for video/music pretraining may cause the generated embeddings difficult to match. The second is that the underlying temporal correlation between video and music is ignored. In this paper, our proposed approach leverages a unified target set to perform video/music pretraining and produces clip-level embeddings to preserve temporal information. The downstream cross-modal matching is based on the clip-level features with embedded music rhythm and optical flow information. Experiments demonstrate that our proposed method can achieve superior performance over the state-of-the-art methods by a significant margin

    Audio-Infused Automatic Image Colorization by Exploiting Audio Scene Semantics

    Full text link
    Automatic image colorization is inherently an ill-posed problem with uncertainty, which requires an accurate semantic understanding of scenes to estimate reasonable colors for grayscale images. Although recent interaction-based methods have achieved impressive performance, it is still a very difficult task to infer realistic and accurate colors for automatic colorization. To reduce the difficulty of semantic understanding of grayscale scenes, this paper tries to utilize corresponding audio, which naturally contains extra semantic information about the same scene. Specifically, a novel audio-infused automatic image colorization (AIAIC) network is proposed, which consists of three stages. First, we take color image semantics as a bridge and pretrain a colorization network guided by color image semantics. Second, the natural co-occurrence of audio and video is utilized to learn the color semantic correlations between audio and visual scenes. Third, the implicit audio semantic representation is fed into the pretrained network to finally realize the audio-guided colorization. The whole process is trained in a self-supervised manner without human annotation. In addition, an audiovisual colorization dataset is established for training and testing. Experiments demonstrate that audio guidance can effectively improve the performance of automatic colorization, especially for some scenes that are difficult to understand only from visual modality
    • …
    corecore