3 research outputs found

    The ACM Multimedia 2023 Computational Paralinguistics Challenge: emotion share & requests

    Get PDF
    The ACM Multimedia 2023 Computational Paralinguistics Chal- lenge addresses two different problems for the first time in a re- search competition under well-defined conditions: In the Emotion Share Sub-Challenge, a regression on speech has to be made; and in the Requests Sub-Challenges, requests and complaints need to be de- tected. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the ‘usual’ ComParE features, the auDeep toolkit, and deep feature extraction from pre-trained CNNs using the DeepSpectrum toolkit; in addition, wav2vec2 models are used

    Identification audio pour la reconnaissance de la parole

    No full text
    Cette thèse de doctorat se place dans le cadre de la reconnaissance de la parole dans des documents audio. Le but de ce travail est d adapter les principes de l identification audio pour la reconnaissance de la parole ainsi que concevoir et développer des techniques d identification robustes. Les systèmes d identification audio par empreinte (audio fingerprinting) sont conçus pour l'indexation d extraits de musique mais ne traitent pas des spécificités du signal de parole. Dans un premier temps, différentes méthodes d identification audio par empreinte sont étudiées ainsi qu un premier travail d adaptation à la reconnaissance de la parole. Ce travail est poursuivi par le développement d un système d identification audio par empreinte dédié à la tâche de décodage acoustico-phonétique. De nouveaux types de sousempreinte basés sur des paramètres usuels de la parole sont alors proposés. Dans un second temps, les différents types de variabilité du signal de parole sont décrits ainsi que les principaux paramètres de représentation acoustique du signal de parole. La robustesse de différents types de sous-empreinte à la variabilité extrinsèque et à la variabilité intrinsèque est évaluée. En présence de perturbations liées à l environnement et aux conditions de transmission du signal de parole (CTIMIT), un type de sous-empreinte issu de l identification audio s avère alors le plus robuste.This PhD thesis takes place within the framework of the speech recognition in audio contents. The purpose of this work is to adapt the principles of audio identification to speech recognition as well as to design and to develop robust identification techniques. Audio identification systems by audio fingerprinting are designed to music track indexation but do not handle the specificities of the speech signal. At first, various methods of audio identification by fingerprint are studied as well as a first work of adaptation to speech recognition. This work is followed by the development of an audio identification system by fingerprint dedicated to the ask of acoustic-phonetic decoding. New types of subfingerprint based on usual speech parameters are then proposed. Secondly, the various types of variability of the speech signal are described as well as the main parameters of acoustic representation of the speech signal. The robustness of various types of subfingerprint in extrincic variability and in intrinsic variability is estimated. In the presence of disturbances related to the environment and to the conditions of transmission of the speech signal (CTIMIT), a type of subfingerprint stemming from the audio identification turns out then the most robust.PARIS5-BU Saints-Pères (751062109) / SudocSudocFranceF

    Multi-Channel Video Segmentation

    No full text
    A video is a multimedia document which is structured in scenes and shots. Scenes are lists of consecutive shots characterized by common visual and audio features. Shots are sets of consecutive frames separated by cuts, which can be easily recognized by existing techniques. Video segmentation into scenes is a new and open problem. It is needed for scenes retrieval, specially in authoring and interactive video applications. We propose a new approach of video segmentation into scenes, which is based on several media and takes into account the film syntax: * We characterize a scene by some similarity between color histograms of the current shot, and of one of the most recent previous shots (shot (t-1), shot (t-2), etc...). Similarity between a shot frame and a frame of a previous shot may indicate the presence of alternate shots, which belong to the same scene. Other techniques based on Projective Geometry are presented in a companion paper 1 . These techniques enable to detect the movemen..
    corecore