1,253 research outputs found

    Robust visual speech recognition using optical flow analysis and rotation invariant features

    Get PDF
    The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes. The majority of existing speech recognition systems is based on audio-visual signals and has been developed for speech enhancement and is prone to acoustic noise. Considering this problem, aim of this research is to investigate and develop a visual only speech recognition system which should be suitable for noisy environments. Potential applications of such a system include the lip-reading mobile phones, human computer interface (HCI) for mobility-impaired users, robotics, surveillance, improvement of speech based computer control in a noisy environment and for the rehabilitation of the persons who have undergone a laryngectomy surgery. In the literature, there are several models and algorithms available for visual feature extraction. These features are extracted from static mouth images and characterized as appearance and shape based features. However, these methods rarely incorporate the time dependent information of mouth dynamics. This dissertation presents two optical flow based approaches of visual feature extraction, which capture the mouth motions in an image sequence. The motivation for using motion features is, because the human perception of lip-reading is concerned with the temporal dynamics of mouth motion. The first approach is based on extraction of features from the optical flow vertical component. The optical flow vertical component is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. To overcome the issue of large variation in speed of speech, each utterance is normalized using simple linear interpolation method. In the second approach, four directional motion templates based on optical flow are developed, each representing the consolidated motion information in an utterance in four directions (i.e.,up, down, left and right). This approach is an evolution of a view based approach known as motion history image (MHI). One of the main issues with the MHI method is its motion overwriting problem because of self-occlusion. DMHIs seem to solve this issue of overwriting. Two types of image descriptors, Zernike moments and Hu moments are used to represent each image of DMHIs. A support vector machine (SVM) classifier was used to classify the features obtained from the optical flow vertical component, Zernike and Hu moments separately. For identification of visemes, a multiclass SVM approach was employed. A video speech corpus of seven subjects was used for evaluating the efficiency of the proposed methods for lip-reading. The experimental results demonstrate the promising performance of the optical flow based mouth movement representations. Performance comparison between DMHI and MHI based on Zernike moments, shows that the DMHI technique outperforms the MHI technique. A video based adhoc temporal segmentation method is proposed in the thesis for isolated utterances. It has been used to detect the start and the end frame of an utterance from an image sequence. The technique is based on a pair-wise pixel comparison method. The efficiency of the proposed technique was tested on the available data set with short pauses between each utterance

    Change blindness: eradication of gestalt strategies

    Get PDF
    Arrays of eight, texture-defined rectangles were used as stimuli in a one-shot change blindness (CB) task where there was a 50% chance that one rectangle would change orientation between two successive presentations separated by an interval. CB was eliminated by cueing the target rectangle in the first stimulus, reduced by cueing in the interval and unaffected by cueing in the second presentation. This supports the idea that a representation was formed that persisted through the interval before being 'overwritten' by the second presentation (Landman et al, 2003 Vision Research 43149–164]. Another possibility is that participants used some kind of grouping or Gestalt strategy. To test this we changed the spatial position of the rectangles in the second presentation by shifting them along imaginary spokes (by ±1 degree) emanating from the central fixation point. There was no significant difference seen in performance between this and the standard task [F(1,4)=2.565, p=0.185]. This may suggest two things: (i) Gestalt grouping is not used as a strategy in these tasks, and (ii) it gives further weight to the argument that objects may be stored and retrieved from a pre-attentional store during this task

    Audio-coupled video content understanding of unconstrained video sequences

    Get PDF
    Unconstrained video understanding is a difficult task. The main aim of this thesis is to recognise the nature of objects, activities and environment in a given video clip using both audio and video information. Traditionally, audio and video information has not been applied together for solving such complex task, and for the first time we propose, develop, implement and test a new framework of multi-modal (audio and video) data analysis for context understanding and labelling of unconstrained videos. The framework relies on feature selection techniques and introduces a novel algorithm (PCFS) that is faster than the well-established SFFS algorithm. We use the framework for studying the benefits of combining audio and video information in a number of different problems. We begin by developing two independent content recognition modules. The first one is based on image sequence analysis alone, and uses a range of colour, shape, texture and statistical features from image regions with a trained classifier to recognise the identity of objects, activities and environment present. The second module uses audio information only, and recognises activities and environment. Both of these approaches are preceded by detailed pre-processing to ensure that correct video segments containing both audio and video content are present, and that the developed system can be made robust to changes in camera movement, illumination, random object behaviour etc. For both audio and video analysis, we use a hierarchical approach of multi-stage classification such that difficult classification tasks can be decomposed into simpler and smaller tasks. When combining both modalities, we compare fusion techniques at different levels of integration and propose a novel algorithm that combines advantages of both feature and decision-level fusion. The analysis is evaluated on a large amount of test data comprising unconstrained videos collected for this work. We finally, propose a decision correction algorithm which shows that further steps towards combining multi-modal classification information effectively with semantic knowledge generates the best possible results

    Skin texture features for face recognition

    Get PDF
    Face recognition has been deployed in a wide range of important applications including surveillance and forensic identification. However, it still seems to be a challenging problem as its performance severely degrades under illumination, pose and expression variations, as well as with occlusions, and aging. In this thesis, we have investigated the use of local facial skin data as a source of biometric information to improve human recognition. Skin texture features have been exploited in three major tasks, which include (i) improving the performance of conventional face recognition systems, (ii) building an adaptive skin-based face recognition system, and (iii) dealing with circumstances when a full view of the face may not be avai'lable. Additionally, a fully automated scheme is presented for localizing eyes and mouth and segmenting four facial regions: forehead, right cheek, left cheek and chin. These four regions are divided into nonoverlapping patches with equal size. A novel skin/non-skin classifier is proposed for detecting patches containing only skin texture and therefore detecting the pure-skin regions. Experiments using the XM2VTS database indicate that the forehead region has the most significant biometric information. The use of forehead texture features improves the rank-l identification of Eigenfaces system from 77.63% to 84.07%. The rank-l identification is equal 93.56% when this region is fused with Kernel Direct Discriminant Analysis algorithm

    Enhancing person annotation for personal photo management using content and context based technologies

    Get PDF
    Rapid technological growth and the decreasing cost of photo capture means that we are all taking more digital photographs than ever before. However, lack of technology for automatically organising personal photo archives has resulted in many users left with poorly annotated photos, causing them great frustration when such photo collections are to be browsed or searched at a later time. As a result, there has recently been significant research interest in technologies for supporting effective annotation. This thesis addresses an important sub-problem of the broad annotation problem, namely "person annotation" associated with personal digital photo management. Solutions to this problem are provided using content analysis tools in combination with context data within the experimental photo management framework, called “MediAssist”. Readily available image metadata, such as location and date/time, are captured from digital cameras with in-built GPS functionality, and thus provide knowledge about when and where the photos were taken. Such information is then used to identify the "real-world" events corresponding to certain activities in the photo capture process. The problem of enabling effective person annotation is formulated in such a way that both "within-event" and "cross-event" relationships of persons' appearances are captured. The research reported in the thesis is built upon a firm foundation of content-based analysis technologies, namely face detection, face recognition, and body-patch matching together with data fusion. Two annotation models are investigated in this thesis, namely progressive and non-progressive. The effectiveness of each model is evaluated against varying proportions of initial annotation, and the type of initial annotation based on individual and combined face, body-patch and person-context information sources. The results reported in the thesis strongly validate the use of multiple information sources for person annotation whilst emphasising the advantage of event-based photo analysis in real-life photo management systems

    A motion-based approach for audio-visual automatic speech recognition

    Get PDF
    The research work presented in this thesis introduces novel approaches for both visual region of interest extraction and visual feature extraction for use in audio-visual automatic speech recognition. In particular, the speaker‘s movement that occurs during speech is used to isolate the mouth region in video sequences and motionbased features obtained from this region are used to provide new visual features for audio-visual automatic speech recognition. The mouth region extraction approach proposed in this work is shown to give superior performance compared with existing colour-based lip segmentation methods. The new features are obtained from three separate representations of motion in the region of interest, namely the difference in luminance between successive images, block matching based motion vectors and optical flow. The new visual features are found to improve visual-only and audiovisual speech recognition performance when compared with the commonly-used appearance feature-based methods. In addition, a novel approach is proposed for visual feature extraction from either the discrete cosine transform or discrete wavelet transform representations of the mouth region of the speaker. In this work, the image transform is explored from a new viewpoint of data discrimination; in contrast to the more conventional data preservation viewpoint. The main findings of this work are that audio-visual automatic speech recognition systems using the new features extracted from the frequency bands selected according to their discriminatory abilities generally outperform those using features designed for data preservation. To establish the noise robustness of the new features proposed in this work, their performance has been studied in presence of a range of different types of noise and at various signal-to-noise ratios. In these experiments, the audio-visual automatic speech recognition systems based on the new approaches were found to give superior performance both to audio-visual systems using appearance based features and to audio-only speech recognition systems

    Datasets, Clues and State-of-the-Arts for Multimedia Forensics: An Extensive Review

    Full text link
    With the large chunks of social media data being created daily and the parallel rise of realistic multimedia tampering methods, detecting and localising tampering in images and videos has become essential. This survey focusses on approaches for tampering detection in multimedia data using deep learning models. Specifically, it presents a detailed analysis of benchmark datasets for malicious manipulation detection that are publicly available. It also offers a comprehensive list of tampering clues and commonly used deep learning architectures. Next, it discusses the current state-of-the-art tampering detection methods, categorizing them into meaningful types such as deepfake detection methods, splice tampering detection methods, copy-move tampering detection methods, etc. and discussing their strengths and weaknesses. Top results achieved on benchmark datasets, comparison of deep learning approaches against traditional methods and critical insights from the recent tampering detection methods are also discussed. Lastly, the research gaps, future direction and conclusion are discussed to provide an in-depth understanding of the tampering detection research arena

    Advancing the diagnosis of dry eye syndrome : development of automated assessments of tear film lipid layer patterns

    Get PDF
    [Resumen] El síndrome de ojo seco es una enfermedad sintomática que afecta a un amplio rango de la población, y tiene un impacto negativo en sus actividades diarias. Su diagnóstico es una tarea difícil debido a su etiología multifactorial, y por eso existen varias pruebas clínicas. Una de esas pruebas es la evaluación de los patrones interferenciales de la capa lipídica de la película lagrimal. Guillon dise˜nó un instrumento denominado Tearscope Plus para evaluar el grosor de la película lagrimal de forma rápida, y también definió una escala de clasificación compuesta de cinco categorías. La clasificación en uno de esos cinco patrones es una tarea clínica dificil, especialmente con las capas lipídicas más finas que carecen de características de color y/o morfológicas. Además, la interpretación subjetiva de los expertos mediante una revisión visual puede afectar a la clasificación, pudiendo producirse un alto grado de inter- e intra- variabilidad entre observadores. El desarrollo de un método sistemático y objetivo para análisis y clasificación es altamente deseable, permitiendo un diagnóstico homogéneo y liberando a los expertos de esta tediosa tarea. La propuesta de esta investigación es el diseño de un sistema automático para evaluar los patrones de la capa lipídica de la película lagrimal mediante la interpretación de las imágenes obtenidas con el Tearscope Plus. Por una parte, se presenta una metodología global para evaluar la capa lipídica de la película lagrimal mediante la clasificación automática de estas imágenes en una de las categorías de Guillon. El proceso se lleva a cabo mediante el uso de modelos de textura y color, y algoritmos de aprendizaje máquina. A continuación, esta metodología global se optimiza mediante la reducción de su complejidad computacional. Se utilizan técnicas de reducción de la dimensión para disminuir los requisitos de memoria/tiempo sin una degradación en su rendimiento. Por otra parte, se presenta una metodología local para crear mapas de la película lagrimal, que representan la distribución local de los patrones de la capa lipídica sobre la película lagrimal. Las diferentes evaluaciones automáticas que se proponen ahorran tiempo a los expertos, y proporcionan resultados imparciales que no están afectados por factores subjetivos.[Resumo] O síndrome de ollo seco é unha enfermidade sintomática que afecta a un amplo rango da poboación, e ten un impacto negativo nas súas actividades diarias. O seu diagnóstico é unha tarefa difícil debido á súa etioloxía multifactorial, e por iso existen varias probas clínicas. Unha desas probas é a avaliación dos patróns interferenciais da capa lipídica da película lagrimal. Guillon dese˜nou un instrumento denominado Tearscope Plus para avaliar o grosor da película lagrimal de forma rápida, e tamén definiu unha escala de clasificación composta de cinco categorías. A clasificación nun deses cinco patróns é unha tarefa clínica difícil, especialmente coas capas lipídicas máis finas que carecen de características de cor e/ou morfolóxicas. Ademais, a interpretación subxectiva dos expertos mediante una revisión visual pode afectar á clasificación, podendo producirse un alto grao de inter- e intra- variabilidade entre observadores. O desenvolvemento dun método sistemático e obxectivo para análise e clasificación é altamente desexable, permitindo un diagnóstico homoxéneo e liberando aos expertos desta tediosa tarefa. A proposta desta investigación é o deseño dun sistema automático para avaliar os patróns da capa lipídica da película lagrimal mediante a interpretación das imaxes obtidas co Tearscope Plus. Por unha parte, preséntase unha metodoloxía global para avaliar a capa lipídica da película lagrimal mediante a clasificación automática destas imaxes nunha das categorías de Guillon. O proceso é levado a cabo mediante o uso de modelos de textura e cor, e algoritmos de aprendizaxe máquina. A continuación, esta metodoloxía global é optimizada mediante a redución da súa complexidade computacional. Utilízanse técnicas de redución da dimensión para diminuír os requisitos de memoria/tempo sen unha degradación no seu rendemento. Por outra parte, preséntase unha metodoloxía local para crear mapas da película lagrimal, que representan a distribución local dos patróns da capa lipídica sobre a película lagrimal. As diferentes avaliacións automáticas que se propoñen aforran tempo aos expertos, e proporcionan resultados imparciais que non están afectados por factores subxectivos.[Abstract] Dry eye syndrome is a symptomatic disease which affects a wide range of population, and has a negative impact on their daily activities. Its diagnosis is a difficult task due to its multifactorial etiology, and so there exist several clinical tests. One of these tests is the evaluation of the interference patterns of the tear film lipid layer. Guillon designed an instrument known as Tearscope Plus which allows clinicians to rapidly assess the lipid layer thickness, and also defined a grading scale composed of five categories. The classification into these five patterns is a difficult clinical task, especially with thinner lipid layers which lack color and/or morphological features. Furthermore, the subjective interpretation of the experts via visual inspection may affect the classification, and so a high degree of inter- and also intra- observer variability can be produced. The development of a systematic, objective computerized method for analysis and classification is thus highly desirable, allowing for homogeneous diagnosis and relieving the experts from this tedious task. The proposal of this research is the design of an automatic system to assess the tear film lipid layer patterns through the interpretation of the images acquired with the Tearscope Plus. On the one hand, a global methodology is presented to assess the tear film lipid layer by automatically classifying these images into the Guillon categories. The process is carried out using texture and color models, and machine learning algorithms. Then, this global methodology is optimized through the reduction of its computational complexity. Dimensionality reduction techniques are used in order to diminish the memory/time requirements with no degradation in performance. On the other hand, a local methodology is also presented to create tear film maps, which represent the local distribution of the lipid layer patterns over the tear film. The different automated assessments proposed save time for experts, and provide unbiased results which are not affected by subjective factors
    corecore