13 research outputs found

    Deep Multimodal Learning for Audio-Visual Speech Recognition

    Full text link
    In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of 41%41\% under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of 35.83%35.83\% demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio. Second, we present a new deep network architecture that uses a bilinear softmax layer to account for class specific correlations between modalities. We show that combining the posteriors from the bilinear networks with those from the fused model mentioned above results in a further significant phone error rate reduction, yielding a final PER of 34.03%34.03\%.Comment: ICASSP 201

    Self-critical Sequence Training for Image Captioning

    Full text link
    Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Our systems are built using a new optimization approach that we call self-critical sequence training (SCST). SCST is a form of the popular REINFORCE algorithm that, rather than estimating a "baseline" to normalize the rewards and reduce variance, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences. Using this approach, estimating the reward signal (as actor-critic methods must do) and estimating normalization (as REINFORCE algorithms typically do) is avoided, while at the same time harmonizing the model with respect to its test-time inference procedure. Empirically we find that directly optimizing the CIDEr metric with SCST and greedy decoding at test-time is highly effective. Our results on the MSCOCO evaluation sever establish a new state-of-the-art on the task, improving the best result in terms of CIDEr from 104.9 to 114.7.Comment: CVPR 2017 + additional analysis + fixed baseline results, 16 page

    L\ucdNEAS DE FUERZAS QUE ATRAVIESAN EL DISPOSITIVO TEATRAL CONTEMPOR\uc1NEO EN LA CIUDAD DE C 3RDOBA, ARGENTINA: VINCULACIONES ENTRE EL \uc1MBITO ACAD\uc9MICO Y EL TEATRO INDEPENDIENTE.

    No full text
    El trabajo da cuenta de una investigaci\uf3n llevada a cabo en el teatro indepen\uaddiente, en la ciudad de C\uf3rdoba, y en el Departamento de Teatro de la Facultad de Artes de la Universidad Nacional de C\uf3rdoba (UNC). La indagaci\uf3n apunt\uf3 a relevar los procedimientos esc\ue9nicos como l\uedneas de fuerza que atraviesan el dispositivo teatral contempor\ue1neo. Tambi\ue9n se propuso establecer las vincula\uadciones existentes entre el \ue1mbito independiente y el acad\ue9mico. Se realizaron entrevistas a directores, actores y diferentes hacedores del campo teatral. Se ob\uadservaron y registraron m\ue1s de cien obras teatrales gestadas como producciones independientes y tambi\ue9n trabajos finales de las distintas c\ue1tedras de la Escuela de Teatro de la Facultad de Artes (UNC), muchas de ellas, transferidas luego al campo independiente. Se registraron tambi\ue9n congresos, festivales y distintos eventos vinculados al hacer teatral, tanto provinciales como nacionales e inter\uadnacionales. Tambi\ue9n fueron tomadas en consideraci\uf3n las producciones propias de los autores de este trabajo, tanto esc\ue9nicas como te\uf3ricas. Este relevamiento fue analizado y puesto en tensi\uf3n y di\ue1logo con teor\uedas teatrales, est\ue9ticas y fi\uadlos\uf3ficas que entendimos eran pertinentes al tema de investigaci\uf3n. Este trabajo asimismo toma como referencia otra investigaci\uf3n realizada para el Instituto Nacional de Teatro argentino, en la que se puso en tensi\uf3n la producci\uf3n teatral de la ciudad de C\uf3rdoba con la producci\uf3n teatral de la ciudad de Buenos Aires. Se concluye que algunas de las problem\ue1ticas que con mayor fuerza aparecen en la teatralidad actual es la puesta en crisis de la relaci\uf3n realidad-ficci\uf3n y la pregunta del arte acerca de s\ued mismo. Se advierte que ese l\uedmite d\ue9bil entre realidad y ficci\uf3n que se percibe en la din\ue1mica social, es asumido por el teatro, que borra a su vez sus propios l\uedmites, como una mirada cr\uedtica sobre el entra\uadmado de poder que construye la realidad. Palabras clave: Teatro contempor\ue1neo- C\uf3rdoba-Universidad-Escena inde\uadpendiente. ABSTRACT This work accounts for research carried out at independent theaters in the city of C\uf3rdoba, and at the Department of Theater, School of Arts, National University of C\uf3rdoba. The inquiry aimed at surveying on-stage procedures and techniques, considering them as lines of force which cross through the contemporary theatrical device. It also aimed at establishing connections between academic and independent circles. Interviews were conducted with directors, actors and different theater makers. More than a hundred plays were seen and recorded, both independent productions and final projects from different classes at the Theater Department, School of Arts, National University of C\uf3rdoba. Many of the latter were eventually put on at independent venues. State, national, and international congresses, festivals, and various events related to theater making were also recorded. Furthermore, theatrical productions and theoretical approaches by the authors of this paper were also taken into account. The data yielded by this survey was analyzed against the backdrop of relevant theories and philosophy of drama and aesthetics. Finally, this work took as reference research carried out for the Argentinian National Theater Institute, which compared and contrasted theatrical productions in C\uf3rdoba to those in Buenos Aires.It is concluded that some of the problems that appear most strongly in the current theatricality is the crisis of the reality-fiction relationship and the question of art about itself. It is noticed that this weak limit between reality and fiction that is perceived in the social dynamics, is assumed by the theater, which in turn erases its own limits, as a critical view on the web of power that constructs reality. Key words: contemporary theater - C\uf3rdoba - University - independent scene. RESUMO O trabalho d\ue1 conta de uma investiga\ue7\ue3o levada a cabo no teatro independente, na cidade de C\uf3rdoba, e no Departamento de Teatro da Faculdade de Artes da Universidade Nacional de C\uf3rdoba (UNC). A indaga\ue7\ue3o apontou a relevar os procedimentos esc\ueanicos como linhas de for\ue7a que atravessam o dispositivo teatral contempor\ue2neo. Tamb\ue9m se prop\uf4s estabelecer as vincula\ue7\uf5es existentes entre o \ue2mbito independente e o acad\ueamico. Realizaram-se entrevistas a diretores, atores e diferentes hacedores do campo teatral. Se observaram e registraram mais de cem obras teatrais gestadas como produ\ue7\uf5es independentes, e tamb\ue9m trabalhos finais da diferentes c\ue1tedras da Escola de Teatro da Faculdade de Artes (UNC), muitas delas, transferidas depois ao campo independente. Registaram-se tamb\ue9m congressos, festivais e diferentes eventos vinculados ao fazer teatral, tanto provinciais, quanto nacionais e internacionais. Tamb\ue9m foram tomadas em considera\ue7\ue3o as produ\ue7\uf5es pr\uf3prias dos autores deste trabalho, tanto esc\ueanicas quanto te\uf3ricas. Este relevamiento foi analisado e posto em tens\ue3o e di\ue1logo com teorias teatrais, est\ue9ticas e filos\uf3ficas que entendemos eram apropriadas ao tema de investiga\ue7\ue3o. Este trabalho, assim mesmo toma como refere outra investiga\ue7\ue3o realizada para o Instituto Nacional de Teatro Argentino, na que se p\uf4s em tens\ue3o a produ\ue7\ue3o teatral da cidade de C\uf3rdoba com a produ\ue7\ue3o teatral da cidade de Buenos Aires. Conclui-se que algumas das problematicas que com maior for\ue7a aparecem na teatralidad atual \ue9 a posta em crise da rela\ue7\ue3o realidade-fic\ue7\ue3o e a pergunta da arte a respeito de si mesmo. Adverte-se que esse limite d\ue9bil entre a realidade e fic\ue7\ue3o que se percebe na din\ue2mica social \ue9 assumido pelo teatro, que apaga a sua vez seus pr\uf3prios limites, como uma vis\ue3o cr\uedtica a malha de poder que constr\uf3i a realidade. Palavras chave: teatro contempor\ue2neo - C\uf3rdoba - Universidade - cena independente. <br

    Scattering vs. Discrete Cosine Transform Features in Visual Speech Processing

    No full text
    Appearance-based feature extraction constitutes the dominant approach for visual speech representation in a variety of problems, such as automatic speechreading, visual speech detection, and others. To obtain the necessary visual features, typically a rectangular region-of-interest (ROI) containing the speaker’s mouth is first extracted, followed, most commonly, by a discrete cosine transform (DCT) of the ROI pixel values and a feature selection step. The approach, although algorithmically simple and computationally efficient, suffers from lack of DCT invariance to typical ROI deformations, stemming, primarily, from speaker’s head pose variability and small tracking inaccuracies. To address the problem, in this paper, the recently introduced scattering transform is investigated as an alternative to DCT within the appearance-based framework for ROI representation, suitable for visual speech applications. A number of such tasks are considered, namely, visual-only speech activity detection, visual-only and audio-visual sub-phonetic classification, as well as audio-visual speech synchrony detection, all employing deep neural network classifiers with either DCT or scattering-based visual features. Comparative experiments of the resulting systems are conducted on a large audio-visual corpus of frontal face videos, demonstrating, in all cases, the scattering transform superiority over the DCT. © 2015 Auditory-Visual Speech Processing 2015, AVSP 2015, held in conjunction with Facial Analysis and Animation, FAA 2015 - 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015. All rights reserved

    Detecting audio-visual synchrony using deep neural networks

    No full text
    In this paper, we address the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not. This is of interest in a wide range of applications, for example spoof detection in biometrics, lip-syncing, speaker detection and diarization in multi-subject videos, and video data quality assurance. In our adopted approach, we investigate the use of deep neural networks (DNNs) for this purpose. The proposed synchrony DNNs operate directly on audio and visual features over relatively wide contexts, or, alternatively, on appropriate hidden (bottleneck) or output layers of DNNs trained for single-modal or audio-visual automatic speech recognition. In all cases, the synchrony DNN classes consist of the "in-sync" and a number of "out-of-sync" targets, the latter considered at multiples of ± 30 msec steps of overall asynchrony between the two modalities. We apply the proposed approach on two multi-subject audio-visual databases, one of high-quality data recorded in studio-like conditions, and one of data recorded by smart cell-phone devices. On both sets, and under a speaker-independent experimental framework, we are able to achieve very low equal-error-rates in distinguishing "in-sync" from "out-of-sync" data. Copyright © 2015 ISCA

    Rapid feature space speaker adaptation for multi-stream hmm-based audio-visual speech recognition

    No full text
    Multi-stream hidden Markov models (HMMs) have recently been very successful in audio-visual speech recognition, where the audio and visual streams are fused at the final decision level. In this paper we investigate fast feature space speaker adaptation using multi-stream HMMs for audio-visual speech recognition. In particular, we focus on studying the performance of feature-space maximum likelihood linear regression (fMLLR), a fast and effective method for estimating feature space transforms. Unlike the common speaker adaptation techniques of MAP or MLLR, fM-LLR does not change the audio or visual HMM parameters, but simply applies a single transform to the testing features. We also address the problem of fast and robust on-line fMLLR adaptation using feature space maximum a posterior linear regression (fMAPLR). Adaptation experiments are reported on the IBM infrared headset audio-visual database. On average for a 20-speaker hour independent test set, the multi-stream fMLLR achieves £ relative gain on the clean audio condition, ¦¨§ and relative gain on the noisy audio condition (approximately 7dB) as compared to the baseline multi-stream system. 1
    corecore