9 research outputs found

    Combining Multiple Views for Visual Speech Recognition

    Get PDF
    Visual speech recognition is a challenging research problem with a particular practical application of aiding audio speech recognition in noisy scenarios. Multiple camera setups can be beneficial for the visual speech recognition systems in terms of improved performance and robustness. In this paper, we explore this aspect and provide a comprehensive study on combining multiple views for visual speech recognition. The thorough analysis covers fusion of all possible view angle combinations both at feature level and decision level. The employed visual speech recognition system in this study extracts features through a PCA-based convolutional neural network, followed by an LSTM network. Finally, these features are processed in a tandem system, being fed into a GMM-HMM scheme. The decision fusion acts after this point by combining the Viterbi path log-likelihoods. The results show that the complementary information contained in recordings from different view angles improves the results significantly. For example, the sentence correctness on the test set is increased from 76% for the highest performing single view (30∘30^\circ) to up to 83% when combining this view with the frontal and 60∘60^\circ view angles

    Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

    Full text link
    Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul, Republic of Kore

    Face Mask Extraction in Video Sequence

    Get PDF
    Inspired by the recent development of deep network-based methods in semantic image segmentation, we introduce an end-to-end trainable model for face mask extraction in video sequence. Comparing to landmark-based sparse face shape representation, our method can produce the segmentation masks of individual facial components, which can better reflect their detailed shape variations. By integrating Convolutional LSTM (ConvLSTM) algorithm with Fully Convolutional Networks (FCN), our new ConvLSTM-FCN model works on a per-sequence basis and takes advantage of the temporal correlation in video clips. In addition, we also propose a novel loss function, called Segmentation Loss, to directly optimise the Intersection over Union (IoU) performances. In practice, to further increase segmentation accuracy, one primary model and two additional models were trained to focus on the face, eyes, and mouth regions, respectively. Our experiment shows the proposed method has achieved a 16.99% relative improvement (from 54.50% to 63.76% mean IoU) over the baseline FCN model on the 300 Videos in the Wild (300VW) dataset

    Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features

    Get PDF
    Extraction of relevant lip features is of continuing interest in the visual speech domain. 1 Using end-to-end feature extraction can produce good results, but at the cost of the results being 2 difficult for humans to comprehend and relate to. We present a new, lightweight feature extraction 3 approach, motivated by human-centric glimpse based psychological research into facial barcodes, 4 and demonstrate that these simple, easy to extract 3D geometric features (produced using Gabor 5 based image patches), can successfully be used for speech recognition with LSTM based machine 6 learning. This approach can successfully extract low dimensionality lip parameters with a minimum 7 of processing. One key difference between using these Gabor-based features and using other features 8 such as traditional DCT, or the current fashion for CNN features is that these are human-centric 9 features that can be visualised and analysed by humans. This means that it is easier to explain and 10 visualise the results. They can also be used for reliable speech recognition, as demonstrated using the 11 Grid corpus. Results for overlapping speakers using our lightweight system gave a recognition rate 12 of over 82%, which compares well to less explainable features in the literature. 1

    Deep audio-visual speech recognition

    Get PDF
    Decades of research in acoustic speech recognition have led to systems that we use in our everyday life. However, even the most advanced speech recognition systems fail in the presence of noise. The degraded performance can be compensated by introducing visual speech information. However, Visual Speech Recognition (VSR) in naturalistic conditions is very challenging, in part due to the lack of architectures and annotations. This thesis contributes towards the problem of Audio-Visual Speech Recognition (AVSR) from different aspects. Firstly, we develop AVSR models for isolated words. In contrast to previous state-of-the-art methods that consists of a two-step approach, feature extraction and recognition, we present an End-to-End (E2E) approach inside a deep neural network, and this has led to a significant improvement in audio-only, visual-only and audio-visual experiments. We further replace Bi-directional Gated Recurrent Unit (BGRU) with Temporal Convolutional Networks (TCN) to greatly simplify the training procedure. Secondly, we extend our AVSR model for continuous speech by presenting a hybrid Connectionist Temporal Classification (CTC)/Attention model, that can be trained in an end-to-end manner. We then propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations. Next, we present a self-supervised framework, Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech, and find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading. We also investigate the Lombard effect influence in an end-to-end AVSR system, which is the first work using end-to-end deep architectures and presents results on unseen speakers. We show that even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved. Lastly, we propose a detection method against adversarial examples in an AVSR system, where the strong correlation between audio and visual streams is leveraged. The synchronisation confidence score is leveraged as a proxy for audio-visual correlation and based on it, we can detect adversarial attacks. We apply recent adversarial attacks on two AVSR models and the experimental results demonstrate that the proposed approach is an effective way for detecting such attacks.Open Acces

    Lectura de labios en imágenes de vídeo

    Full text link
    [ES] Durante una conversación nuestro cerebro se encarga de combinar la información procedente de múltiples sentidos con el objetivo de mejorar nuestra capacidad a la hora de interpretar el mensaje percibido. Además, diferentes estudios han demostrado la relación existente entre las expresiones faciales y su sonido correspondiente. Este efecto nos ha impulsado hacia la construcción de un sistema capaz de leer los labios considerando únicamente la información procedente del canal visual, es decir, capaz de mimetizar la habilidad humana de interpretar el habla leyendo los labios del interlocutor. Para ello, en primer lugar, se ha construido un dataset compuesto por planos frontales de cuatro presentadores de telediario, así como las transcripciones asociadas a cada uno de los discursos. Para cada uno de estos discursos, se localiza la región bucal gracias a bibliotecas enfocadas al machine learning, como es el caso de scikit-learn Tras la compilación de este conjunto de datos, se han procesado los distintos planos de modo que puedan ser interpretados por el sistema. Además, después de este procesado, se han aplicado técnicas de selección de características para prescindir de aquellos datos que no aporten información relevante de cara al reconocimiento del habla. Por otra parte, nuestro sistema se compone de distintos módulos, entre los que destacamos los Modelos Ocultos de Markov Continuos por su gran aporte al ámbito del reconocimiento de voz, o texto manuscrito, entre otros. Estos modelos son entrenados con un subconjunto del dataset construido, mientras que sus prestaciones serán comprobadas con los datos restantes. Sin embargo, los resultados obtenidos tras el protocolo experimental no han sido mínimamente aceptables. Esto demuestra la dificultad que presenta la interpretación del habla continua y, más aún, si tenemos en cuenta los desafíos que supone la carencia de un sentido tan crucial como es el oído. Por tanto, nuestro sistema se proyecta sobre trabajos futuros en los cuales volcar el resto de nuestros esfuerzos[EN] During a conversation our brain is responsible for combining information from multiple senses in order to improve our ability to interpret the perceived message. In addition, different studies have shown the relationship between facial expressions and their corresponding sound. This effect has driven us towards the construction of a system capable of reading the lips considering only the information coming from the visual channel, that is, capable of mimicking the human ability to interpret speech by reading the interlocutor's lips. For this, first, a dataset composed of frontal views of four television news anchors has been constructed, as well as the transcriptions associated to each one of the speeches. For each of these speeches, the mouth region is located thanks to libraries focused on machine learning, as it is the case of scikit-learn. After the compilation of this dataset, the different views have been processed so that they can be interpreted by the system. In addition, after this processing, feature selection techniques have been applied to disregard data that does not provide relevant information for speech recognition. On the other hand, our system is composed of different modules, among which we highlight the Continuous Hidden Markov Models for their great contribution to the field of voice recognition, or handwritten text, among others. These models are trained with a subset of the constructed dataset, while their accuracy will be checked with the remaining data. However, the results obtained after the experimental protocol have not been minimally acceptable. This demonstrates the difficulty presented by the interpretation of continuous speech and, even more so, if we consider the challenges posed by the lack of such a crucial sense as hearing. Therefore, our system is projected on future works in which to focus the rest of our efforts.[CA] Durant una conversació el nostre cervell s'encarrega de combinar la informació procedent de múltiples sentits amb l'objectiu de millorar la nostra capacitat a l'hora d'interpretar el missatge percebut. A més, diferents estudis han demostrat la relació existent entre les expressions facials i el seu so corresponent. Este efecte ens ha impulsat cap a la construcció d'un sistema capaç de llegir els llavis considerant únicament la informació procedent del canal visual, és a dir, capaç de mimetitzar l'habilitat humana d'interpretar la parla llegint els llavis de l'interlocutor. Per a això, en primer lloc, s'ha construït un dataset compost per plans frontals de quatre presentadors de telenotícies, així com les transcripcions associades a cada un dels discursos. Per a cada un d'estos discursos, es localitza la regió bucal gràcies a biblioteques enfocades al machine learning, com és el cas de scikit-learn Després de la compilació d'este conjunt de dades, s'han processat els distints plans de manera que puguen ser interpretats pel sistema. A més, després d'este processat, s'han aplicat tècniques de selecció de característiques per a prescindir d'aquelles dades que no aporten informació rellevant de cara al reconeixement de la parla. D'altra banda, el nostre sistema es compon de distints mòduls, entre els que destaquem els Models Ocults de Markov Continus per la seua gran aportació a l'àmbit del reconeixement de veu, o text manuscrit, entre altres. Estos models són entrenats amb un subconjunt del dataset construït, mentres que les seues prestacions seran comprovades amb les dades restants. No obstant això, els resultats obtinguts després del protocol experimental no han sigut mínimament acceptables. Açò demostra la dificultat que presenta la interpretació de la parla contínua i, més encara, si tenim en compte els desafiaments que suposa la carència d'un sentit tan crucial com és l'oïda. Per tant, el nostre sistema es projecta sobre treballs futurs en els quals bolcar la resta dels nostres esforçosGimeno Gómez, D. (2019). Lectura de labios en imágenes de vídeo. http://hdl.handle.net/10251/125008TFG

    Visual speech recognition:from traditional to deep learning frameworks

    Get PDF
    Speech is the most natural means of communication for humans. Therefore, since the beginning of computers it has been a goal to interact with machines via speech. While there have been gradual improvements in this field over the decades, and with recent drastic progress more and more commercial software is available that allow voice commands, there are still many ways in which it can be improved. One way to do this is with visual speech information, more specifically, the visible articulations of the mouth. Based on the information contained in these articulations, visual speech recognition (VSR) transcribes an utterance from a video sequence. It thus helps extend speech recognition from audio-only to other scenarios such as silent or whispered speech (e.g.\ in cybersecurity), mouthings in sign language, as an additional modality in noisy audio scenarios for audio-visual automatic speech recognition, to better understand speech production and disorders, or by itself for human machine interaction and as a transcription method. In this thesis, we present and compare different ways to build systems for VSR: We start with the traditional hidden Markov models that have been used in the field for decades, especially in combination with handcrafted features. These are compared to models taking into account recent developments in the fields of computer vision and speech recognition through deep learning. While their superior performance is confirmed, certain limitations with respect to computing power for these systems are also discussed. This thesis also addresses multi-view processing and fusion, which is an important topic for many current applications. This is due to the fact that a single camera view often cannot provide enough flexibility with speakers moving in front of the camera. Technology companies are willing to integrate more cameras into their products, such as cars and mobile devices, due to lower hardware cost for both cameras and processing units, as well as the availability of higher processing power and high performance algorithms. Multi-camera and multi-view solutions are thus becoming more common, which means that algorithms can benefit from taking these into account. In this work we propose several methods of fusing the views of multiple cameras to improve the overall results. We can show that both, relying on deep learning-based approaches for feature extraction and sequence modelling, as well as taking into account the complementary information contained in several views, improves performance considerably. To further improve the results, it would be necessary to move from data recorded in a lab environment, to multi-view data in realistic scenarios. Furthermore, the findings and models could be transferred to other domains such as audio-visual speech recognition or the study of speech production and disorders
    corecore