156 research outputs found
Towards a Self-Sufficient Face Verification System
Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] The absence of a previous collaborative manual enrolment represents a significant handicap towards designing a face verification system for face re-identification purposes. In this scenario, the system must learn the target identity incrementally, using data from the video stream during the operational authentication phase. So, manual labelling cannot be assumed apart from the first few frames. On the other hand, even the most advanced methods trained on large-scale and unconstrained datasets suffer performance degradation when no adaptation to specific contexts is performed. This work proposes an adaptive face verification system, for the continuous re-identification of target identity, within the framework of incremental unsupervised learning. Our Dynamic Ensemble of SVM is capable of incorporating non-labelled information to improve the performance of any model, even when its initial performance is modest. The proposal uses the self-training approach and is compared against other classification techniques within this same approach. Results show promising behaviour in terms of both knowledge acquisition and impostor robustness.This work has received financial support from the Spanish government (project TIN2017-90135-R MINECO (FEDER)), from The Consellaría de Cultura, Educación e Ordenación Universitaria (accreditations 2016–2019, EDG431G/01 and ED431G/08), and reference competitive groups (2017–2020, and ED431C 2017/04), and from the European Regional Development Fund (ERDF). Eric López-López has received financial support from the Xunta de Galicia and the European Union (European Social Fund – ESF)Xunta de Galicia; EDG431G/01Xunta de Galicia; ED431G/08Xunta de Galicia; ED431C 2017/0
Domain-Specific Face Synthesis for Video Face Recognition from a Single Sample Per Person
The performance of still-to-video FR systems can decline significantly
because faces captured in unconstrained operational domain (OD) over multiple
video cameras have a different underlying data distribution compared to faces
captured under controlled conditions in the enrollment domain (ED) with a still
camera. This is particularly true when individuals are enrolled to the system
using a single reference still. To improve the robustness of these systems, it
is possible to augment the reference set by generating synthetic faces based on
the original still. However, without knowledge of the OD, many synthetic images
must be generated to account for all possible capture conditions. FR systems
may, therefore, require complex implementations and yield lower accuracy when
training on many less relevant images. This paper introduces an algorithm for
domain-specific face synthesis (DSFS) that exploits the representative
intra-class variation information available from the OD. Prior to operation, a
compact set of faces from unknown persons appearing in the OD is selected
through clustering in the captured condition space. The domain-specific
variations of these face images are projected onto the reference stills by
integrating an image-based face relighting technique inside the 3D
reconstruction framework. A compact set of synthetic faces is generated that
resemble individuals of interest under the capture conditions relevant to the
OD. In a particular implementation based on sparse representation
classification, the synthetic faces generated with the DSFS are employed to
form a cross-domain dictionary that account for structured sparsity.
Experimental results reveal that augmenting the reference gallery set of FR
systems using the proposed DSFS approach can provide a higher level of accuracy
compared to state-of-the-art approaches, with only a moderate increase in its
computational complexity
Face recognition in video surveillance from a single reference sample through domain adaptation
Face recognition (FR) has received significant attention during the past decades in many applications, such as law enforcement, forensics, access controls, information security and video surveillance (VS), due to its covert and non-intrusive nature. FR systems specialized for VS seek to accurately detect the presence of target individuals of interest over a distributed network of video cameras under uncontrolled capture conditions. Therefore, recognizing faces of target individuals in such environment is a challenging problem because the appearance of faces varies due to changes in pose, scale, illumination, occlusion, blur, etc. The computational complexity is also an important consideration because of the growing number of cameras, and the processing time of state-of-the-art face detection, tracking and matching algorithms.
In this thesis, adaptive systems are proposed for accurate still-to-video FR, where a single (or very few) reference still or a mug-shot is available to design a facial model for the target individual. This is a common situation in real-world watch-list screening applications due to the cost and feasibility of capturing reference stills, and managing facial models over time. The limited number of reference stills can adversely affect the robustness of facial models to intra-class variations, and therefore the performance of still-to-video FR systems. Moreover, a specific challenge in still-to-video FR is the shift between the enrollment domain, where high-quality reference faces are captured under controlled conditions from still cameras, and the operational domain, where faces are captured with video cameras under uncontrolled conditions. To overcome the challenges of such single sample per person (SSPP) problems, 3 new systems are proposed for accurate still-to-video FR that are based on multiple face representations and domain adaptation. In particular, this thesis presents 3 contributions. These contributions are described with more details in the following statements.
In Chapter 3, a multi-classifier framework is proposed for robust still-to-video FR based on multiple and diverse face representations of a single reference face still. During enrollment of a target individual, the single reference face still is modeled using an ensemble of SVM classifiers based on different patches and face descriptors. Multiple feature extraction techniques are applied to patches isolated in the reference still to generate a diverse SVM pool that provides robustness to common nuisance factors (e.g., variations in illumination and pose). The estimation of discriminant feature subsets, classifier parameters, decision thresholds, and ensemble fusion functions is achieved using the high-quality reference still and a large number of faces captured in lower quality video of non-target individuals in the scene. During operations, the most competent subset of SVMs are dynamically selected according to capture conditions. Finally, a head-face tracker gradually regroups faces captured from different people appearing in a scene, while each individual-specific ensemble performs face matching. The accumulation of matching scores per face track leads to a robust spatio-temporal FR when accumulated ensemble scores surpass a detection threshold. Experimental results obtained with the Chokepoint and COX-S2V datasets show a significant improvement in performance w.r.t. reference systems, especially when individual-specific ensembles (1) are designed using exemplar-SVMs rather than one-class SVMs, and (2) exploit score-level fusion of local SVMs (trained using features extracted from each patch), rather than using either decision-level or feature-level fusion with a global SVM (trained by concatenating features extracted from patches).
In Chapter 4, an efficient multi-classifier system (MCS) is proposed for accurate still-to-video FR based on multiple face representations and domain adaptation (DA). An individual-specific ensemble of exemplar-SVM (e-SVM) classifiers is thereby designed to improve robustness to intra-class variations. During enrollment of a target individual, an ensemble is used to model the single reference still, where multiple face descriptors and random feature subspaces allow to generate a diverse pool of patch-wise classifiers. To adapt these ensembles to the operational domains, e-SVMs are trained using labeled face patches extracted from the reference still versus patches extracted from cohort and other non-target stills mixed with unlabeled patches extracted from the corresponding face trajectories captured with surveillance cameras. During operations, the most competent classifiers per given probe face are dynamically selected and weighted based on the internal criteria determined in the feature space of e-SVMs. This chapter also investigates the impact of using different training schemes for DA, as well as, the validation set of non-target faces extracted from stills and video trajectories of unknown individuals in the operational domain. The results indicate that the proposed system can surpass state-of-the-art accuracy, yet with a significantly lower computational complexity.
In Chapter 5, a deep convolutional neural network (CNN) is proposed to cope with the discrepancies between facial regions of interest (ROIs) isolated in still and video faces for robust still-to-video FR. To that end, a face-flow autoencoder CNN called FFA-CNN is trained using both still and video ROIs in a supervised end-to-end multi-task learning. A novel loss function containing a weighted combination of pixel-wise, symmetry-wise and identity preserving losses is introduced to optimize the network parameters. The proposed FFA-CNN incorporates a reconstruction network and a fully-connected classification network, where the former reconstructs a well-illuminated frontal ROI with neutral expression from a pair of low-quality non-frontal video ROIs and the latter is utilized to compare the still and video representations to provide matching scores. Thus, integrating the proposed weighted loss function with a supervised end-to-end training approach leads to generate high-quality frontal faces and learn discriminative face representations similar for the same identities. Simulation results obtained over challenging COX Face DB confirm the effectiveness of the proposed FFA-CNN to achieve convincing performance compared to current state-of-the-art CNN-based FR systems
Incremental Learning Through Unsupervised Adaptation in Video Face Recognition
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
Durante a última década, os métodos baseados en deep learning trouxeron un
salto significativo no rendemento dos sistemas de visión artificial. Unha das claves
neste éxito foi a creación de grandes conxuntos de datos perfectamente etiquetados
para usar durante o adestramento. En certa forma, as redes de deep learning
resumen esta enorme cantidade datos en prácticos vectores multidimensionais. Por
este motivo, cando as diferenzas entre os datos de adestramento e os adquiridos
durante o funcionamento dos sistemas (debido a factores como o contexto de adquisición)
son especialmente notorias, as redes de deep learning son susceptibles de
sufrir degradación no rendemento.
Mentres que a solución inmediata a este tipo de problemas sería a de recorrer a
unha recolección adicional de imaxes, co seu correspondente proceso de etiquetado,
esta dista moito de ser óptima. A gran cantidade de posibles variacións que presenta
o mundo visual converten rápido este enfoque nunha tarefa sen fin. Máis aínda cando
existen aplicacións específicas nas que esta acción é difícil, ou incluso imposible, de
realizar debido a problemas de custos ou de privacidade.
Esta tese propón abordar todos estes problemas usando a perspectiva da adaptación.
Así, a hipótese central consiste en asumir que é posible utilizar os datos non
etiquetados adquiridos durante o funcionamento para mellorar o rendemento que
obteríamos con sistemas de recoñecemento xerais. Para isto, e como proba de concepto,
o campo de estudo da tese restrinxiuse ao recoñecemento de caras. Esta é unha
aplicación paradigmática na cal o contexto de adquisición pode ser especialmente
relevante.
Este traballo comeza examinando as diferenzas intrínsecas entre algúns dos contextos
específicos nos que se pode necesitar o recoñecemento de caras e como estas
afectan ao rendemento. Desta maneira, comparamos distintas bases de datos (xunto
cos seus contextos) entre elas, usando algúns dos descritores de características máis
avanzados e así determinar a necesidade real de adaptación.
A partir desta punto, pasamos a presentar o método novo, que representa a principal
contribución da tese: o Dynamic Ensemble of SVM (De-SVM). Este método implementa
a capacidade de adaptación utilizando unha aprendizaxe incremental non
supervisada na que as súas propias predicións se usan como pseudo-etiquetas durante
as actualizacións (a estratexia de auto-adestramento). Os experimentos realizáronse
baixo condicións de vídeo-vixilancia, un exemplo paradigmático dun contexto moi
específico no que os procesos de etiquetado son particularmente complicados. As
ideas claves de De-SVM probáronse en diferentes sub-problemas de recoñecemento
de caras: a verificación de caras e recoñecemento de caras en conxunto pechado e en
conxunto aberto.
Os resultados acadados mostran un comportamento prometedor en termos de
adquisición de coñecemento sen supervisión así como robustez contra impostores.
Ademais, este rendemento é capaz de superar a outros métodos do estado da arte
que non posúen esta capacidade de adaptación.[Resumen]
Durante la última década, los métodos basados en deep learning trajeron un salto
significativo en el rendimiento de los sistemas de visión artificial. Una de las claves en
este éxito fue la creación de grandes conjuntos de datos perfectamente etiquetados
para usar durante el entrenamiento. En cierta forma, las redes de deep learning
resumen esta enorme cantidad datos en prácticos vectores multidimensionales. Por
este motivo, cuando las diferencias entre los datos de entrenamiento y los adquiridos
durante el funcionamiento de los sistemas (debido a factores como el contexto de
adquisición) son especialmente notorias, las redes de deep learning son susceptibles
de sufrir degradación en el rendimiento.
Mientras que la solución a este tipo de problemas es recurrir a una recolección
adicional de imágenes, con su correspondiente proceso de etiquetado, esta dista mucho
de ser óptima. La gran cantidad de posibles variaciones que presenta el mundo
visual convierten rápido este enfoque en una tarea sin fin. Más aún cuando existen
aplicaciones específicas en las que esta acción es difícil, o incluso imposible, de
realizar; debido a problemas de costes o de privacidad.
Esta tesis propone abordar todos estos problemas usando la perspectiva de la
adaptación. Así, la hipótesis central consiste en asumir que es posible utilizar los
datos no etiquetados adquiridos durante el funcionamiento para mejorar el rendimiento
que se obtendría con sistemas de reconocimiento generales. Para esto, y como
prueba de concepto, el campo de estudio de la tesis se restringió al reconocimiento
de caras. Esta es una aplicación paradigmática en la cual el contexto de adquisición
puede ser especialmente relevante.
Este trabajo comienza examinando las diferencias entre algunos de los contextos
específicos en los que se puede necesitar el reconocimiento de caras y así como
sus efectos en términos de rendimiento. De esta manera, comparamos distintas ba
ses de datos (y sus contextos) entre ellas, usando algunos de los descriptores de
características más avanzados para así determinar la necesidad real de adaptación.
A partir de este punto, pasamos a presentar el nuevo método, que representa la
principal contribución de la tesis: el Dynamic Ensemble of SVM (De- SVM). Este
método implementa la capacidad de adaptación utilizando un aprendizaje incremental
no supervisado en la que sus propias predicciones se usan cómo pseudo-etiquetas
durante las actualizaciones (la estrategia de auto-entrenamiento). Los experimentos
se realizaron bajo condiciones de vídeo-vigilancia, un ejemplo paradigmático de
contexto muy específico en el que los procesos de etiquetado son particularmente
complicados. Las ideas claves de De- SVM se probaron en varios sub-problemas
del reconocimiento de caras: la verificación de caras y reconocimiento de caras de
conjunto cerrado y conjunto abierto.
Los resultados muestran un comportamiento prometedor en términos de adquisición
de conocimiento así como de robustez contra impostores. Además, este rendimiento
es capaz de superar a otros métodos del estado del arte que no poseen esta
capacidad de adaptación.[Abstract]
In the last decade, deep learning has brought an unprecedented leap forward for
computer vision general classification problems. One of the keys to this success is the
availability of extensive and wealthy annotated datasets to use as training samples.
In some sense, a deep learning network summarises this enormous amount of data
into handy vector representations. For this reason, when the differences between
training datasets and the data acquired during operation (due to factors such as
the acquisition context) are highly marked, end-to-end deep learning methods are
susceptible to suffer performance degradation.
While the immediate solution to mitigate these problems is to resort to an additional
data collection and its correspondent annotation procedure, this solution
is far from optimal. The immeasurable possible variations of the visual world can
convert the collection and annotation of data into an endless task. Even more when
there are specific applications in which this additional action is difficult or simply not
possible to perform due to, among other reasons, cost-related problems or privacy
issues.
This Thesis proposes to tackle all these problems from the adaptation point of
view. Thus, the central hypothesis assumes that it is possible to use operational
data with almost no supervision to improve the performance we would achieve with
general-purpose recognition systems. To do so, and as a proof-of-concept, the field
of study of this Thesis is restricted to face recognition, a paradigmatic application
in which the context of acquisition can be especially relevant.
This work begins by examining the intrinsic differences between some of the
face recognition contexts and how they directly affect performance. To do it, we
compare different datasets, and their contexts, against each other using some of the
most advanced feature representations available to determine the actual need for
adaptation.
From this point, we move to present the novel method, representing the central
contribution of the Thesis: the Dynamic Ensembles of SVM (De-SVM). This
method implements the adaptation capabilities by performing unsupervised incremental
learning using its own predictions as pseudo-labels for the update decision
(the self-training strategy). Experiments are performed under video surveillance
conditions, a paradigmatic example of a very specific context in which labelling
processes are particularly complicated. The core ideas of De-SVM are tested in
different face recognition sub-problems: face verification and, the more complex,
general closed- and open-set face recognition.
In terms of the achieved results, experiments have shown a promising behaviour
in terms of both unsupervised knowledge acquisition and robustness against impostors,
surpassing the performances achieved by state-of-the-art non-adaptive methods.Funding and Technical Resources For the successful development of this Thesis, it was necessary to rely on series of indispensable means included in the following list:
• Working material, human and financial support primarily by the CITIC and
the Computer Architecture Group of the University of A Coruña and CiTIUS
of University of Santiago de Compostela, along with a PhD grant funded by
Xunta the Galicia and the European Social Fund.
• Access to bibliographical material through the library of the University of A
Coruña.
• Additional funding through the following research projects:
State funding by the Ministry of Economy and Competitiveness of Spain
(project TIN2017-90135-R MINECO, FEDER)
Spatiotemporal visual analysis of human actions
In this dissertation we propose four methods for the recognition of human activities. In all four of
them, the representation of the activities is based on spatiotemporal features that are automatically
detected at areas where there is a significant amount of independent motion, that is, motion that is
due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features
throughout this dissertation. The algorithms presented, however, can be used with any kind of features,
as long as the latter are well localized and have a well-defined area of support in space and time. We
introduce the utilized spatiotemporal salient points in the first method presented in this dissertation.
By extending previous work on spatial saliency, we measure the variations in the information content of
pixel neighborhoods both in space and time, and detect the points at the locations and scales for which
this information content is locally maximized. In this way, an activity is represented as a collection of
spatiotemporal salient points. We propose an iterative linear space-time warping technique in order
to align the representations in space and time and propose to use Relevance Vector Machines (RVM)
in order to classify each example into an action category. In the second method proposed in this
dissertation we propose to enhance the acquired representations of the first method. More specifically,
we propose to track each detected point in time, and create representations based on sets of trajectories,
where each trajectory expresses how the information engulfed by each salient point evolves over time.
In order to deal with imperfect localization of the detected points, we augment the observation model
of the tracker with background information, acquired using a fully automatic background estimation
algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels.
In addition, we perform experiments where the tracked templates are localized on specific parts of the
body, like the hands and the head, and we further augment the tracker’s observation model using a
human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm
(LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and
RVMs for classification. In the third method that we propose, we assume that neighboring salient
points follow a similar motion. This is in contrast to the previous method, where each salient point was
tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual
descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The
latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal
neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in
translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the
corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are
extracted across the whole dataset are subsequently clustered in order to create a codebook, which is
used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for
classification. The fourth and last method addresses the joint problem of localization and recognition
of human activities depicted in unsegmented image sequences. Its main contribution is the use of an
implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal
localization of characteristic ensembles of spatiotemporal features. The latter are localized around
automatically detected salient points. Evidence for the spatiotemporal localization of the activity
is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in
order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct
class-specific spatiotemporal models, which encode where in space and time each codeword ensemble
appears in the training set. During testing, each activated codeword ensemble casts probabilistic
votes concerning the spatiotemporal localization of the activity, according to the information stored
during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable
hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume
which potentially engulfs the activity, and is verified by performing action category classification with
an RVM classifier
Advances in Monocular Exemplar-based Human Body Pose Analysis: Modeling, Detection and Tracking
Esta tesis contribuye en el análisis de la postura del cuerpo humano a partir de secuencias de imágenes adquiridas con una sola cámara. Esta temática presenta un amplio rango de potenciales aplicaciones en video-vigilancia, video-juegos o aplicaciones biomédicas. Las técnicas basadas en patrones han tenido éxito, sin embargo, su precisión depende de la similitud del punto de vista de la cámara y de las propiedades de la escena entre las imágenes de entrenamiento y las de prueba. Teniendo en cuenta un conjunto de datos de entrenamiento capturado mediante un número reducido de cámaras fijas, paralelas al suelo, se han identificado y analizado tres escenarios posibles con creciente nivel de dificultad: 1) una cámara estática paralela al suelo, 2) una cámara de vigilancia fija con un ángulo de visión considerablemente diferente, y 3) una secuencia de video capturada con una cámara en movimiento o simplemente una sola imagen estática
- …