29,998 research outputs found

    Mandarin Singing Voice Synthesis Based on Harmonic Plus Noise Model and Singing Expression Analysis

    Full text link
    The purpose of this study is to investigate how humans interpret musical scores expressively, and then design machines that sing like humans. We consider six factors that have a strong influence on the expression of human singing. The factors are related to the acoustic, phonetic, and musical features of a real singing signal. Given real singing voices recorded following the MIDI scores and lyrics, our analysis module can extract the expression parameters from the real singing signals semi-automatically. The expression parameters are used to control the singing voice synthesis (SVS) system for Mandarin Chinese, which is based on the harmonic plus noise model (HNM). The results of perceptual experiments show that integrating the expression factors into the SVS system yields a notable improvement in perceptual naturalness, clearness, and expressiveness. By one-to-one mapping of the real singing signal and expression controls to the synthesizer, our SVS system can simulate the interpretation of a real singer with the timbre of a speaker.Comment: 8 pages, technical repor

    Computer vision methods for unconstrained gesture recognition in the context of sign language annotation

    Get PDF
    Cette thèse porte sur l'étude des méthodes de vision par ordinateur pour la reconnaissance de gestes naturels dans le contexte de l'annotation de la Langue des Signes. La langue des signes (LS) est une langue gestuelle développée par les sourds pour communiquer. Un énoncé en LS consiste en une séquence de signes réalisés par les mains, accompagnés d'expressions du visage et de mouvements du haut du corps, permettant de transmettre des informations en parallèles dans le discours. Même si les signes sont définis dans des dictionnaires, on trouve une très grande variabilité liée au contexte lors de leur réalisation. De plus, les signes sont souvent séparés par des mouvements de co-articulation. Cette extrême variabilité et l'effet de co-articulation représentent un problème important dans les recherches en traitement automatique de la LS. Il est donc nécessaire d'avoir de nombreuses vidéos annotées en LS, si l'on veut étudier cette langue et utiliser des méthodes d'apprentissage automatique. Les annotations de vidéo en LS sont réalisées manuellement par des linguistes ou experts en LS, ce qui est source d'erreur, non reproductible et extrêmement chronophage. De plus, la qualité des annotations dépend des connaissances en LS de l'annotateur. L'association de l'expertise de l'annotateur aux traitements automatiques facilite cette tâche et représente un gain de temps et de robustesse. Le but de nos recherches est d'étudier des méthodes de traitement d'images afin d'assister l'annotation des corpus vidéo: suivi des composantes corporelles, segmentation des mains, segmentation temporelle, reconnaissance de gloses. Au cours de cette thèse nous avons étudié un ensemble de méthodes permettant de réaliser l'annotation en glose. Dans un premier temps, nous cherchons à détecter les limites de début et fin de signe. Cette méthode d'annotation nécessite plusieurs traitements de bas niveau afin de segmenter les signes et d'extraire les caractéristiques de mouvement et de forme de la main. D'abord nous proposons une méthode de suivi des composantes corporelles robuste aux occultations basée sur le filtrage particulaire. Ensuite, un algorithme de segmentation des mains est développé afin d'extraire la région des mains même quand elles se trouvent devant le visage. Puis, les caractéristiques de mouvement sont utilisées pour réaliser une première segmentation temporelle des signes qui est par la suite améliorée grâce à l'utilisation de caractéristiques de forme. En effet celles-ci permettent de supprimer les limites de segmentation détectées en milieu des signes. Une fois les signes segmentés, on procède à l'extraction de caractéristiques visuelles pour leur reconnaissance en termes de gloses à l'aide de modèles phonologiques. Nous avons évalué nos algorithmes à l'aide de corpus internationaux, afin de montrer leur avantages et limitations. L'évaluation montre la robustesse de nos méthodes par rapport à la dynamique et le grand nombre d'occultations entre les différents membres. L'annotation résultante est indépendante de l'annotateur et représente un gain de robustese important.This PhD thesis concerns the study of computer vision methods for the automatic recognition of unconstrained gestures in the context of sign language annotation. Sign Language (SL) is a visual-gestural language developed by deaf communities. Continuous SL consists on a sequence of signs performed one after another involving manual and non-manual features conveying simultaneous information. Even though standard signs are defined in dictionaries, we find a huge variability caused by the context-dependency of signs. In addition signs are often linked by movement epenthesis which consists on the meaningless gesture between signs. The huge variability and the co-articulation effect represent a challenging problem during automatic SL processing. It is necessary to have numerous annotated video corpus in order to train statistical machine translators and study this language. Generally the annotation of SL video corpus is manually performed by linguists or computer scientists experienced in SL. However manual annotation is error-prone, unreproducible and time consuming. In addition de quality of the results depends on the SL annotators knowledge. Associating annotator knowledge to image processing techniques facilitates the annotation task increasing robustness and speeding up the required time. The goal of this research concerns on the study and development of image processing technique in order to assist the annotation of SL video corpus: body tracking, hand segmentation, temporal segmentation, gloss recognition. Along this PhD thesis we address the problem of gloss annotation of SL video corpus. First of all we intend to detect the limits corresponding to the beginning and end of a sign. This annotation method requires several low level approaches for performing temporal segmentation and for extracting motion and hand shape features. First we propose a particle filter based approach for robustly tracking hand and face robust to occlusions. Then a segmentation method for extracting hand when it is in front of the face has been developed. Motion is used for segmenting signs and later hand shape is used to improve the results. Indeed hand shape allows to delete limits detected in the middle of a sign. Once signs have been segmented we proceed to the gloss recognition using lexical description of signs. We have evaluated our algorithms using international corpus, in order to show their advantages and limitations. The evaluation has shown the robustness of the proposed methods with respect to high dynamics and numerous occlusions between body parts. Resulting annotation is independent on the annotator and represents a gain on annotation consistency

    Parametric synthesis of sign language

    Get PDF
    The isolation of the deaf community from mainstream society is in part due to the lack of knowledge most hearing people have of sign language. To most, there seems to be little need to learn a language that is spoken by such a small minority unless perhaps a relative is unable to hear. Even with a desire to learn, the task may seem insurmountable due to the unique formational and grammatical rules of the language. This linguistic rift has led to the call for an automatic translation system with the ability to take voice or written text as input and produce a comprehensive sequence of signed gestures through computing. This thesis focused on the development of the foundation of a system that would receive English language input and generate a sequence of related signed gestures each synthesized from their basic kinematic parameters. A technique of sign specification for a computer-based translation system was developed through the use of Python objects and functions. Sign definitions, written as Python algorithms, were used to drive the simulation engine of a human-modeling software known as Jack. This research suggests that 3-dimensional computer graphics can be utilized in the production of sign representations that are intelligible and natural in appearance

    Pauses and the temporal structure of speech

    Get PDF
    Natural-sounding speech synthesis requires close control over the temporal structure of the speech flow. This includes a full predictive scheme for the durational structure and in particuliar the prolongation of final syllables of lexemes as well as for the pausal structure in the utterance. In this chapter, a description of the temporal structure and the summary of the numerous factors that modify it are presented. In the second part, predictive schemes for the temporal structure of speech ("performance structures") are introduced, and their potential for characterising the overall prosodic structure of speech is demonstrated

    A new visual speech modelling approach for visual speech recognition

    Get PDF
    In this paper we propose a new learning-based representation that is referred to as Visual Speech Unit (VSU) for visual speech recognition (VSR). The new Visual Speech Unit concept proposes an extension of the standard viseme model that is currently applied for VSR by including in this representation not only the data associated with the visemes, but also the transitory information between consecutive visemes. The developed speech recognition system consists of several computational stages: (a) lips segmentation, (b) construction of the Expectation-Maximization Principal Component Analysis (EM-PCA) manifolds from the input video image, (c) registration between the models of the VSUs and the EM-PCA data constructed from the input image sequence and (d) recognition of the VSUs using a standard Hidden Markov Model (HMM) classification scheme. In this paper we were particularly interested to evaluate the classification accuracy obtained for our new VSU models when compared with that attained for standard (MPEG-4) viseme models. The experimental results indicate that we achieved 90% recognition rate when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 52%

    Sign Language Tutoring Tool

    Full text link
    In this project, we have developed a sign language tutor that lets users learn isolated signs by watching recorded videos and by trying the same signs. The system records the user's video and analyses it. If the sign is recognized, both verbal and animated feedback is given to the user. The system is able to recognize complex signs that involve both hand gestures and head movements and expressions. Our performance tests yield a 99% recognition rate on signs involving only manual gestures and 85% recognition rate on signs that involve both manual and non manual components, such as head movement and facial expressions.Comment: eNTERFACE'06. Summer Workshop. on Multimodal Interfaces, Dubrovnik : Croatie (2007

    Linguistics As Structure In Computer Animation: Toward A More Effective Synthesis Of Brow Motion In American Sign Language

    Get PDF
    Computer-generated three-dimensional animation holds great promise for synthesizing utterances in American Sign Language (ASL) that are not only grammatical, but well tolerated by members of the Deaf community. Unfortunately, animation poses several challenges stemming from the necessity of grappling with massive amounts of data. However, the linguistics of ASL can aid in surmounting the challenge by providing structure and rules for organizing animation data. An exploration of the linguistic and extra linguistic behavior of the brows from an animator’s viewpoint yields a new approach for synthesizing nonmanuals that differs from the conventional animation of anatomy and instead offers a different approach for animating the effects of interacting levels of linguistic function. Results of formal testing with Deaf users have indicated that this is a promising approach

    How2Sign: A large-scale multimodal dataset for continuous American sign language

    Get PDF
    One of the factors that have hindered progress in the areas of sign language recognition, translation, and production is the absence of large annotated datasets. Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. A three-hour subset was further recorded in the Panoptic studio enabling detailed 3D pose estimation. To evaluate the potential of How2Sign for real-world impact, we conduct a study with ASL signers and show that synthesized videos using our dataset can indeed be understood. The study further gives insights on challenges that computer vision should address in order to make progress in this field. Dataset website: http://how2sign.github.io/This work received funding from Facebook through gifts to CMU and UPC; through projects TEC2016-75976-R, TIN2015- 65316-P, SEV-2015-0493 and PID2019-107255GB-C22 of the Spanish Government and 2017-SGR-1414 of Generalitat de Catalunya. This work used XSEDE’s “Bridges” system at the Pittsburgh Supercomputing Center (NSF award ACI- 1445606). Amanda Duarte has received support from la Caixa Foundation (ID 100010434) under the fellowship code LCF/BQ/IN18/11660029. Shruti Palaskar was supported by the Facebook Fellowship program.Peer ReviewedObjectius de Desenvolupament Sostenible::10 - Reducció de les DesigualtatsObjectius de Desenvolupament Sostenible::4 - Educació de Qualitat::4.5 - Per a 2030, eliminar les disparitats de gènere en l’educació i garantir l’accés en condicions d’igualtat a les persones vulnerables, incloses les persones amb discapacitat, els pobles indígenes i els nens i nenes en situacions de vulnerabilitat, a tots els nivells de l’ensenyament i la formació professionalObjectius de Desenvolupament Sostenible::10 - Reducció de les Desigualtats::10.2 - Per a 2030, potenciar i promoure la inclusió social, econòmica i política de totes les persones, independentment de l’edat, sexe, discapacitat, raça, ètnia, origen, religió, situació econòmica o altra condicióObjectius de Desenvolupament Sostenible::4 - Educació de QualitatPostprint (author's final draft
    corecore