14 research outputs found

    The Natural Statistics of Audiovisual Speech

    Get PDF
    Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver

    Segmentation et suivi des contours externe et interne des lèvres pour des applications de maquillage virtuel et de labiophonie

    No full text
    In recent years, the analysis of faces is a growing interest in the field of computer vision. The face is a powerful communications medium between human beings. It provides relevant clues on person identity, emotions or what it says. The GIPSA-lab carried out several studies on facial feature segmentation for multimedia applications (mixed reality, telephone terminal, human computer interaction, gesture interpretation for nonverbal communication interpretation, interactive driving simulator...). Studies deal with face location, with eye, eyebrow and mandibular arch contour extraction and, more recently, with mouth contour segmentation. This work introduces an automatic outer and inner lip contour segmentation method for static images and video sequences. The algorithm is composed of two modules: a static module and a tracking module. In case of static images, the first steps are face location and mouth bounding box extraction. Then, the lip contours are detected by combining active contours and parametric models. The jumping snakes are used to find key points to position a cubic curve parametric model which is appropriate to the possible lip shape deformations. Two inner parametric models have been built: one model for open mouths and another for closed mouths. Finally, maximization of relevant gradient flows is used to optimize the model parameter estimation. In case of video sequences, the same static process is carried out on the first frame to initialize the tracking algorithm. On subsequent images, the tracking method is based on key point tracking techniques and the model is adjusted by the gradient flow maximization method. The contributions of this work are: 1) a flexible lip parametric model, 2) several gradients combining luminance and chrominance information to highlight the lip contours, 3) quantitative and qualitative evaluation of the segmentation algorithm performances for the virtual make up and lipreading applications.Ces dernières années, l'analyse des visages connaît un intérêt grandissant dans le domaine de la vision par ordinateur. Le visage est un vecteur d'information puissant de la communication entre être humains et il fournit des indications pertinentes sur l'identité d'une personne, sur son état émotionnel ou sur ce qu'elle dit. Le laboratoire GIPSA a mené de multiples études concernant le problème de la segmentation automatique des traits du visage pour des applications de type multimédia (réalité mixte, terminal téléphonique, interaction homme machine, interprétation de gestes de communication non verbal, simulateur de conduite interactif...). Des travaux ont porté sur la localisation de la tête dans une image, sur l'extraction des contours des yeux, des sourcils et de l'arc mandibulaire et, plus récemment, sur la segmentation des contours de la bouche. Cette thèse présente un algorithme automatique de segmentation des contours intérieur et extérieur des lèvres utilisé pour des images statiques et des séquences vidéo. Ce système est composé de deux modules : un module statique et un module de suivi. Dans le cas d'une image statique, après avoir localisé le visage et avoir calculé une boîte englobante de la bouche, l'algorithme statique permet d'extraire automatiquement le contour complet des lèvres en combinant contours actifs et modèles paramétriques. Les jumping snakes permettent de trouver des points clefs externes et internes sur les contours afin de positionner un modèle paramétrique composé de courbes cubiques appropriées aux déformations possibles des lèvres. Le modèle interne peut prendre deux formes différentes selon que la bouche soit ouverte ou fermée. Finalement, une méthode de maximisation de flux moyen de gradients optimise le modèle paramétrique. Dans le cas de séquences vidéo, le même traitement statique est réalisé sur la 1ère image pour initialiser l'algorithme de suivi. La segmentation des contours dans les images suivantes se fait à l'aide de méthodes de tracking permettant le suivi des points clefs du modèle paramétrique des lèvres. L'ajustement du modèle paramétrique s'effectue ensuite de nouveau par maximisation de flux moyen de gradients. Les contributions de cette thèse sont les suivantes: 1) Proposition d'un modèle paramétrique complet des lèvres suffisamment flexible pour reproduire un ensemble varié de formes possibles de la bouche 2) Création de plusieurs gradients combinant des informations de luminance et de chrominance adaptés à chaque partie du contour labial. 3) Évaluation quantitative et qualitative de l'algorithme de segmentation dans le cadre d'applications de maquillage virtuel et de lecture labiale

    Segmentation du contour intérieur des lèvres en combinant contours actifs et modèles paramétriques

    No full text
    National audienceLes applications de lecture labiale requièrent des informations précises sur le mouvement et la forme des lèvres, caractérisées à la fois par le contour extérieur et par le contour intérieur de la bouche. Dans cet article, nous introduisons une nouvelle méthode de détection du contour intérieur. A partir du contour extérieur donné par un algorithme préexistant, nous utilisons des points clefs pour initialiser un contour actif appelé "jumping snake". Grâce à une information optimale de gradients de luminance et de chrominance, le contour actif ajuste la position de 2 modèles paramétriques différents composés de cubiques (un premier modèle pour des bouches fermées et un second pour des bouches ouvertes). Les modèles donnent un contour intérieur flexible et précis. Finalement, nous présentons plusieurs résultats expérimentaux démontrant l'efficacité de l'algorithme proposé

    Lip contour segmentation and tracking compliant with lip-reading application constraints

    No full text
    International audienceWe propose to use both active contours and parametric models for lip contour extraction and tracking. In the first image, jumping snakes are used to detect outer and inner contour key points. These points initialize a lip parametric model composed of several cubic curves that are appropriate to the mouth deformations. According to a combined luminance and chrominance gradient, the initial model is optimized and precisely locked onto the lip contours. On subsequent images, the segmentation is based on the mouth bounding box and key point tracking. Quantitative and qualitative evaluations show the effectiveness of the algorithm for lip-reading applications

    Inner and outer lip contour tracking using cubic curve parametric models

    No full text
    International audienceThe first step in lipreading applications is mouth contour extraction to provide the link between lip shape and the oral message. In our approach, the lip contours are detected in the first image with the two algorithms developed in [1] and [2] for static images. On subsequent images of the sequence, several key points (mouth corners and inner and outer middle contour points) are tracked with the Lucas-Kanade method to define an initial parametric lip model of the mouth. According to a combined luminance and chrominance gradient, the model is optimized and precisely locked onto the lip contours. The algorithm performances are evaluated with regard to a lipreading application

    Segmentation et suivi des contours externe et interne des lèvres pour des applications de maquillage virtuel et de labiophone

    No full text
    Ces dernières années, l'analyse des visages connaît un intérêt grandissant dans le domaine de la vision par ordinateur. Le visage est un vecteur d'information puissant de la communication entre être humains et il fournit des indications pertinentes sur l'identité d'une personne, sur son état émotionnel ou sur ce qu'elle dit. Le laboratoire GIPSA a mené de multiples études concernant le problème de la segmentation automatique des traits du visage pour des applications de type multimédia (réalité mixte, terminal téléphonique, interaction homme machine, interprétation de gestes de communication non verbal, simulateur de conduite interactif...). Des travaux ont porté sur la localisation de la tête dans une image, sur l'extraction des contours des yeux, des sourcils et de l'arc mandibulaire et, plus récemment, sur la segmentation des contours de la bouche. Cette thèse présente un algorithme automatique de segmentation des contours intérieur et extérieur des lèvres utilisé pour des images statiques et des séquences vidéo. Ce système est composé de deux modules : un module statique et un module de suivi. Dans le cas d'une image statique, après avoir localisé le visage et avoir calculé une boîte englobante de la bouche, l'algorithme statique permet d'extraire automatiquement le contour complet des lèvres en combinant contours actifs et modèles paramétriques. Les jumping snakes permettent de trouver des points clefs externes et internes sur les contours afin de positionner un modèle paramétrique composé de courbes cubiques appropriées aux déformations possibles des lèvres. Le modèle interne peut prendre deux formes différentes selon que la bouche soit ouverte ou fermée. Finalement, une méthode de maximisation de flux moyen de gradients optimise le modèle paramétrique. Dans le cas de séquences vidéo, le même traitement statique est réalisé sur la 1ère image pour initialiser l'algorithme de suivi. La segmentation des contours dans les images suivantes se fait à l'aide de méthodes de tracking permettant le suivi des points clefs du modèle paramétrique des lèvres. L'ajustement du modèle paramétrique s'effectue ensuite de nouveau par maximisation de flux moyen de gradients. Les contributions de cette thèse sont les suivantes: 1) Proposition d'un modèle paramétrique complet des lèvres suffisamment flexible pour reproduire un ensemble varié de formes possibles de la bouche 2) Création de plusieurs gradients combinant des informations de luminance et de chrominance adaptés à chaque partie du contour labial. 3) Évaluation quantitative et qualitative de l'algorithme de segmentation dans le cadre d'applications de maquillage virtuel et de lecture labiale.In recent years, the analysis of faces is a growing interest in the field of computer vision. The face is a powerful communications medium between human beings. It provides relevant clues on person identity, emotions or what it says. The GIPSA-lab carried out several studies on facial feature segmentation for multimedia applications (mixed reality, telephone terminal, human computer interaction, gesture interpretation for nonverbal communication interpretation, interactive driving simulator...). Studies deal with face location, with eye, eyebrow and mandibular arch contour extraction and, more recently, with mouth contour segmentation. This work introduces an automatic outer and inner lip contour segmentation method for static images and video sequences. The algorithm is composed of two modules: a static module and a tracking module. In case of static images, the first steps are face location and mouth bounding box extraction. Then, the lip contours are detected by combining active contours and parametric models. The jumping snakes are used to find key points to position a cubic curve parametric model which is appropriate to the possible lip shape deformations. Two inner parametric models have been built: one model for open mouths and another for closed mouths. Finally, maximization of relevant gradient flows is used to optimize the model parameter estimation. In case of video sequences, the same static process is carried out on the first frame to initialize the tracking algorithm. On subsequent images, the tracking method is based on key point tracking techniques and the model is adjusted by the gradient flow maximization method. The contributions of this work are: 1) a flexible lip parametric model, 2) several gradients combining luminance and chrominance information to highlight the lip contours, 3) quantitative and qualitative evaluation of the segmentation algorithm performances for the virtual make up and lipreading applications.GRENOBLE1-BU Sciences (384212103) / SudocGRENOBLE-GIPSA-lab (384212301) / SudocSudocFranceF

    Lip Modelling and Segmentation

    No full text
    International audienceLip segmentation is the first step of any audio-visual speech reading system. The accuracy of this segmentation has a major influence on the performances of the global system. But this is a very difficult task. First of all, lip shape can undergo strong deformations during a speech sequence. As many other image processing algorithms, the segmentation task is also influenced by the illumination conditions and by the orientation of the object to be segmented. In this chapter, we present an overview about lip modeling and lip segmentation (region-based and contour-based methods). We limit our study to the problem of lip segmentation in frontal faces. Section I gives an overview about the chrominance information that is used for lip segmentation and a comparison between different chrominance cues is proposed. Section II presents region-based approaches and training steps. Section III focuses on contour-based approaches and parametric lip models. Section IV inventories methods for lip segmentation accuracy evaluation. Some specific applications are briefly presented in section V

    Average correlations between visual and auditory segments for longer speech materials.

    No full text
    <p>A - Top, Inter-lip distance and the auditory envelope for a single 20 second segment from a single subject in the X-ray database as a function of time. X –axes depict time in seconds. Y–axes on the left depict the distance between the lower and upper lip in millimeters. Y-axes on the right depict the power in the wideband envelope. Bottom, shows a zoomed in portion of the 8–12 second time segment of the same data shown in A. Clear correspondences are present between the inter-lip distance and the auditory envelope. B - Scatter plot of the envelope power and inter lip distance along with the corresponding regression line. Each red circle denotes a single point in the speech time series. Black line denotes the linear regression between the inter-lip distance and the envelope power. Correlation coefficient between auditory and visual components for this sentence was 0.49 (p<0.0001). C - Average rank ordered intact correlations (red bars) and shuffled correlations (green bars) for the 15 subjects analyzed in the Wisconsin×ray database. X-axes depict subject number; Y- axes depict the correlations. Intact correlations for each subject were the average across all speech segments analyzed for that subject. Error bars denote standard error of the mean. Shuffled correlations were computed as an average correlation between all non-paired auditory envelopes and the inter-lip distance for each subject.</p

    Coherence between vision and audition.

    No full text
    <p>A - Left, heat map shows the coherence between the mouth area function and the auditory signal as a function of both spectral frequency band and temporal modulation frequency for a single subject from the GRID corpus. X-axes depicts temporal modulation frequency in Hz. Y-axes depict the spectral frequency in KHz. Square drawn in dashed lines depicts the region of maximal coherence between the visual and auditory signals. Right, heat map for another subject from the GRID corpus. Figure conventions as in the left panel. B - Average coherence between the mouth area function and the auditory signal as a function of both spectral frequency band and temporal modulation frequency for the twenty subjects in the GRID corpus. Figure conventions as in the left panel of A. C - Average coherence between the mouth area function and the auditory signal for four different spectral frequencies (8.8 KHz – orange, 2.3 KHz – red, 161 Hz – blue, 460 Hz – green) across all subjects in the GRID corpus as a function of temporal frequency. Shaded regions denote the standard error of the mean. D - Average coherence between the inter-lip distance and the wideband auditory envelope as a function of both spectral frequency band and temporal modulation frequency for the fifteen subjects in the Wisconsin x-ray database. Figure conventions as in A. E - Average coherence between the area of the mouth opening and the wideband auditory envelope as a function of both spectral frequency band and temporal modulation frequency averaged across the two subjects from the French spontaneous database. Figure conventions as in A.</p
    corecore