1,253 research outputs found
Robust visual speech recognition using optical flow analysis and rotation invariant features
The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes. The majority of existing speech recognition systems is based on audio-visual signals and has been developed for speech enhancement and is prone to acoustic noise. Considering this problem, aim of this research is to investigate and develop a visual only speech recognition system which should be suitable for noisy environments. Potential applications of such a system include the lip-reading mobile phones, human computer interface (HCI) for mobility-impaired users, robotics, surveillance, improvement of speech based computer control in a noisy environment and for the rehabilitation of the persons who have undergone a laryngectomy surgery. In the literature, there are several models and algorithms available for visual feature extraction. These features are extracted from static mouth images and characterized as appearance and shape based features. However, these methods rarely incorporate the time dependent information of mouth dynamics. This dissertation presents two optical flow based approaches of visual feature extraction, which capture the mouth motions in an image sequence. The motivation for using motion features is, because the human perception of lip-reading is concerned with the temporal dynamics of mouth motion. The first approach is based on extraction of features from the optical flow vertical component. The optical flow vertical component is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. To overcome the issue of large variation in speed of speech, each utterance is normalized using simple linear interpolation method. In the second approach, four directional motion templates based on optical flow are developed, each representing the consolidated motion information in an utterance in four directions (i.e.,up, down, left and right). This approach is an evolution of a view based approach known as motion history image (MHI). One of the main issues with the MHI method is its motion overwriting problem because of self-occlusion. DMHIs seem to solve this issue of overwriting. Two types of image descriptors, Zernike moments and Hu moments are used to represent each image of DMHIs. A support vector machine (SVM) classifier was used to classify the features obtained from the optical flow vertical component, Zernike and Hu moments separately. For identification of visemes, a multiclass SVM approach was employed. A video speech corpus of seven subjects was used for evaluating the efficiency of the proposed methods for lip-reading. The experimental results demonstrate the promising performance of the optical flow based mouth movement representations. Performance comparison between DMHI and MHI based on Zernike moments, shows that the DMHI technique outperforms the MHI technique. A video based adhoc temporal segmentation method is proposed in the thesis for isolated utterances. It has been used to detect the start and the end frame of an utterance from an image sequence. The technique is based on a pair-wise pixel comparison method. The efficiency of the proposed technique was tested on the available data set with short pauses between each utterance
Change blindness: eradication of gestalt strategies
Arrays of eight, texture-defined rectangles were used as stimuli in a one-shot change blindness (CB) task where there was a 50% chance that one rectangle would change orientation between two successive presentations separated by an interval. CB was eliminated by cueing the target rectangle in the first stimulus, reduced by cueing in the interval and unaffected by cueing in the second presentation. This supports the idea that a representation was formed that persisted through the interval before being 'overwritten' by the second presentation (Landman et al, 2003 Vision Research 43149–164]. Another possibility is that participants used some kind of grouping or Gestalt strategy. To test this we changed the spatial position of the rectangles in the second presentation by shifting them along imaginary spokes (by ±1 degree) emanating from the central fixation point. There was no significant difference seen in performance between this and the standard task [F(1,4)=2.565, p=0.185]. This may suggest two things: (i) Gestalt grouping is not used as a strategy in these tasks, and (ii) it gives further weight to the argument that objects may be stored and retrieved from a pre-attentional store during this task
Audio-coupled video content understanding of unconstrained video sequences
Unconstrained video understanding is a difficult task. The main aim of this thesis is to
recognise the nature of objects, activities and environment in a given video clip using
both audio and video information. Traditionally, audio and video information has not
been applied together for solving such complex task, and for the first time we propose,
develop, implement and test a new framework of multi-modal (audio and video) data
analysis for context understanding and labelling of unconstrained videos.
The framework relies on feature selection techniques and introduces a novel algorithm
(PCFS) that is faster than the well-established SFFS algorithm. We use the framework for
studying the benefits of combining audio and video information in a number of different
problems. We begin by developing two independent content recognition modules. The
first one is based on image sequence analysis alone, and uses a range of colour, shape,
texture and statistical features from image regions with a trained classifier to recognise
the identity of objects, activities and environment present. The second module uses audio
information only, and recognises activities and environment. Both of these approaches
are preceded by detailed pre-processing to ensure that correct video segments containing
both audio and video content are present, and that the developed system can be made
robust to changes in camera movement, illumination, random object behaviour etc. For
both audio and video analysis, we use a hierarchical approach of multi-stage
classification such that difficult classification tasks can be decomposed into simpler and
smaller tasks.
When combining both modalities, we compare fusion techniques at different levels of
integration and propose a novel algorithm that combines advantages of both feature and
decision-level fusion. The analysis is evaluated on a large amount of test data comprising
unconstrained videos collected for this work. We finally, propose a decision correction
algorithm which shows that further steps towards combining multi-modal classification
information effectively with semantic knowledge generates the best possible results
Skin texture features for face recognition
Face recognition has been deployed in a wide range of important applications including surveillance and forensic identification. However, it still seems to be a challenging problem as its performance severely degrades under illumination, pose and expression variations, as well as with occlusions, and aging. In this thesis, we have investigated the use of local facial skin data as a source of biometric information to improve human recognition. Skin texture features have been exploited in three major tasks, which include (i) improving the performance of conventional face recognition systems, (ii) building an adaptive skin-based face recognition system, and (iii) dealing with circumstances when a full view of the face may not be avai'lable. Additionally, a fully automated scheme is presented for localizing eyes and mouth and segmenting four facial regions: forehead, right cheek, left cheek and chin. These four regions are divided into nonoverlapping patches with equal size. A novel skin/non-skin classifier is proposed for detecting patches containing only skin texture and therefore detecting the pure-skin regions. Experiments using the XM2VTS database indicate that the forehead region has the most significant biometric information. The use of forehead texture features improves the rank-l identification of Eigenfaces system from 77.63% to 84.07%. The rank-l identification is equal 93.56% when this region is fused with Kernel Direct Discriminant Analysis algorithm
Enhancing person annotation for personal photo management using content and context based technologies
Rapid technological growth and the decreasing cost of photo capture means that we are all taking more digital photographs than ever before. However, lack of technology for automatically organising personal photo archives has resulted in many users left with poorly annotated photos, causing them great frustration when such photo collections are to be browsed or searched at a later time. As a result, there has recently been significant research interest in technologies for supporting effective annotation.
This thesis addresses an important sub-problem of the broad annotation problem, namely "person annotation" associated with personal digital photo management. Solutions to this problem are provided using content analysis tools in combination with context data within the experimental photo management framework, called “MediAssist”. Readily available image metadata, such as location and date/time, are captured from digital cameras with in-built GPS functionality, and thus provide knowledge about when and where the photos were taken. Such information is then used to identify the "real-world" events corresponding to certain activities in the photo capture process. The
problem of enabling effective person annotation is formulated in such a way that both "within-event" and "cross-event" relationships of persons' appearances are captured.
The research reported in the thesis is built upon a firm foundation of content-based analysis technologies, namely face detection, face recognition, and body-patch matching together with data fusion.
Two annotation models are investigated in this thesis, namely progressive and non-progressive. The effectiveness of each model is evaluated against varying proportions of
initial annotation, and the type of initial annotation based on individual and combined face, body-patch and person-context information sources. The results reported in the thesis strongly validate the use of multiple information sources for person annotation whilst
emphasising the advantage of event-based photo analysis in real-life photo management systems
A motion-based approach for audio-visual automatic speech recognition
The research work presented in this thesis introduces novel approaches for both visual
region of interest extraction and visual feature extraction for use in audio-visual
automatic speech recognition. In particular, the speaker‘s movement that occurs
during speech is used to isolate the mouth region in video sequences and motionbased
features obtained from this region are used to provide new visual features for
audio-visual automatic speech recognition. The mouth region extraction approach
proposed in this work is shown to give superior performance compared with existing
colour-based lip segmentation methods. The new features are obtained from three
separate representations of motion in the region of interest, namely the difference in
luminance between successive images, block matching based motion vectors and
optical flow. The new visual features are found to improve visual-only and audiovisual
speech recognition performance when compared with the commonly-used
appearance feature-based methods.
In addition, a novel approach is proposed for visual feature extraction from either the
discrete cosine transform or discrete wavelet transform representations of the mouth
region of the speaker. In this work, the image transform is explored from a new
viewpoint of data discrimination; in contrast to the more conventional data
preservation viewpoint. The main findings of this work are that audio-visual
automatic speech recognition systems using the new features extracted from the
frequency bands selected according to their discriminatory abilities generally
outperform those using features designed for data preservation.
To establish the noise robustness of the new features proposed in this work, their
performance has been studied in presence of a range of different types of noise and at
various signal-to-noise ratios. In these experiments, the audio-visual automatic speech
recognition systems based on the new approaches were found to give superior
performance both to audio-visual systems using appearance based features and to
audio-only speech recognition systems
Datasets, Clues and State-of-the-Arts for Multimedia Forensics: An Extensive Review
With the large chunks of social media data being created daily and the
parallel rise of realistic multimedia tampering methods, detecting and
localising tampering in images and videos has become essential. This survey
focusses on approaches for tampering detection in multimedia data using deep
learning models. Specifically, it presents a detailed analysis of benchmark
datasets for malicious manipulation detection that are publicly available. It
also offers a comprehensive list of tampering clues and commonly used deep
learning architectures. Next, it discusses the current state-of-the-art
tampering detection methods, categorizing them into meaningful types such as
deepfake detection methods, splice tampering detection methods, copy-move
tampering detection methods, etc. and discussing their strengths and
weaknesses. Top results achieved on benchmark datasets, comparison of deep
learning approaches against traditional methods and critical insights from the
recent tampering detection methods are also discussed. Lastly, the research
gaps, future direction and conclusion are discussed to provide an in-depth
understanding of the tampering detection research arena
Advancing the diagnosis of dry eye syndrome : development of automated assessments of tear film lipid layer patterns
[Resumen] El síndrome de ojo seco es una enfermedad sintomática que afecta a un amplio rango de la población, y tiene un impacto negativo en sus actividades diarias. Su diagnóstico es una tarea difícil debido a su etiología multifactorial, y por eso existen
varias pruebas clínicas. Una de esas pruebas es la evaluación de los patrones interferenciales
de la capa lipídica de la película lagrimal. Guillon dise˜nó un instrumento
denominado Tearscope Plus para evaluar el grosor de la película lagrimal de forma
rápida, y también definió una escala de clasificación compuesta de cinco categorías.
La clasificación en uno de esos cinco patrones es una tarea clínica dificil, especialmente con las capas lipídicas más finas que carecen de características de color y/o
morfológicas. Además, la interpretación subjetiva de los expertos mediante una
revisión visual puede afectar a la clasificación, pudiendo producirse un alto grado
de inter- e intra- variabilidad entre observadores. El desarrollo de un método sistemático y objetivo para análisis y clasificación es altamente deseable, permitiendo
un diagnóstico homogéneo y liberando a los expertos de esta tediosa tarea.
La propuesta de esta investigación es el diseño de un sistema automático para
evaluar los patrones de la capa lipídica de la película lagrimal mediante la interpretación de las imágenes obtenidas con el Tearscope Plus. Por una parte, se presenta
una metodología global para evaluar la capa lipídica de la película lagrimal
mediante la clasificación automática de estas imágenes en una de las categorías de
Guillon. El proceso se lleva a cabo mediante el uso de modelos de textura y color, y
algoritmos de aprendizaje máquina. A continuación, esta metodología global se optimiza
mediante la reducción de su complejidad computacional. Se utilizan técnicas
de reducción de la dimensión para disminuir los requisitos de memoria/tiempo sin
una degradación en su rendimiento. Por otra parte, se presenta una metodología
local para crear mapas de la película lagrimal, que representan la distribución local
de los patrones de la capa lipídica sobre la película lagrimal. Las diferentes evaluaciones
automáticas que se proponen ahorran tiempo a los expertos, y proporcionan
resultados imparciales que no están afectados por factores subjetivos.[Resumo] O síndrome de ollo seco é unha enfermidade sintomática que afecta a un amplo
rango da poboación, e ten un impacto negativo nas súas actividades diarias. O
seu diagnóstico é unha tarefa difícil debido á súa etioloxía multifactorial, e por
iso existen varias probas clínicas. Unha desas probas é a avaliación dos patróns
interferenciais da capa lipídica da película lagrimal. Guillon dese˜nou un instrumento
denominado Tearscope Plus para avaliar o grosor da película lagrimal de forma
rápida, e tamén definiu unha escala de clasificación composta de cinco categorías. A
clasificación nun deses cinco patróns é unha tarefa clínica difícil, especialmente coas
capas lipídicas máis finas que carecen de características de cor e/ou morfolóxicas.
Ademais, a interpretación subxectiva dos expertos mediante una revisión visual pode
afectar á clasificación, podendo producirse un alto grao de inter- e intra- variabilidade
entre observadores. O desenvolvemento dun método sistemático e obxectivo para
análise e clasificación é altamente desexable, permitindo un diagnóstico homoxéneo
e liberando aos expertos desta tediosa tarefa.
A proposta desta investigación é o deseño dun sistema automático para avaliar os
patróns da capa lipídica da película lagrimal mediante a interpretación das imaxes
obtidas co Tearscope Plus. Por unha parte, preséntase unha metodoloxía global
para avaliar a capa lipídica da película lagrimal mediante a clasificación automática
destas imaxes nunha das categorías de Guillon. O proceso é levado a cabo mediante
o uso de modelos de textura e cor, e algoritmos de aprendizaxe máquina.
A continuación, esta metodoloxía global é optimizada mediante a redución da súa
complexidade computacional. Utilízanse técnicas de redución da dimensión para
diminuír os requisitos de memoria/tempo sen unha degradación no seu rendemento.
Por outra parte, preséntase unha metodoloxía local para crear mapas da película
lagrimal, que representan a distribución local dos patróns da capa lipídica sobre a
película lagrimal. As diferentes avaliacións automáticas que se propoñen aforran
tempo aos expertos, e proporcionan resultados imparciais que non están afectados
por factores subxectivos.[Abstract] Dry eye syndrome is a symptomatic disease which affects a wide range of population,
and has a negative impact on their daily activities. Its diagnosis is a difficult task
due to its multifactorial etiology, and so there exist several clinical tests. One of
these tests is the evaluation of the interference patterns of the tear film lipid layer.
Guillon designed an instrument known as Tearscope Plus which allows clinicians to
rapidly assess the lipid layer thickness, and also defined a grading scale composed
of five categories. The classification into these five patterns is a difficult clinical
task, especially with thinner lipid layers which lack color and/or morphological features.
Furthermore, the subjective interpretation of the experts via visual inspection
may affect the classification, and so a high degree of inter- and also intra- observer
variability can be produced. The development of a systematic, objective computerized
method for analysis and classification is thus highly desirable, allowing for
homogeneous diagnosis and relieving the experts from this tedious task.
The proposal of this research is the design of an automatic system to assess
the tear film lipid layer patterns through the interpretation of the images acquired
with the Tearscope Plus. On the one hand, a global methodology is presented to
assess the tear film lipid layer by automatically classifying these images into the
Guillon categories. The process is carried out using texture and color models, and
machine learning algorithms. Then, this global methodology is optimized through
the reduction of its computational complexity. Dimensionality reduction techniques
are used in order to diminish the memory/time requirements with no degradation
in performance. On the other hand, a local methodology is also presented to create
tear film maps, which represent the local distribution of the lipid layer patterns over
the tear film. The different automated assessments proposed save time for experts,
and provide unbiased results which are not affected by subjective factors
- …