567 research outputs found

    Interpretable and Generalizable Person Re-Identification with Query-Adaptive Convolution and Temporal Lifting

    Get PDF
    For person re-identification, existing deep networks often focus on representation learning. However, without transfer learning, the learned model is fixed as is, which is not adaptable for handling various unseen scenarios. In this paper, beyond representation learning, we consider how to formulate person image matching directly in deep feature maps. We treat image matching as finding local correspondences in feature maps, and construct query-adaptive convolution kernels on the fly to achieve local matching. In this way, the matching process and results are interpretable, and this explicit matching is more generalizable than representation features to unseen scenarios, such as unknown misalignments, pose or viewpoint changes. To facilitate end-to-end training of this architecture, we further build a class memory module to cache feature maps of the most recent samples of each class, so as to compute image matching losses for metric learning. Through direct cross-dataset evaluation, the proposed Query-Adaptive Convolution (QAConv) method gains large improvements over popular learning methods (about 10%+ mAP), and achieves comparable results to many transfer learning methods. Besides, a model-free temporal cooccurrence based score weighting method called TLift is proposed, which improves the performance to a further extent, achieving state-of-the-art results in cross-dataset person re-identification. Code is available at https://github.com/ShengcaiLiao/QAConv.Comment: This is the ECCV 2020 version, including the appendi

    Advanced Biometrics with Deep Learning

    Get PDF
    Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others

    Towards a Self-Sufficient Face Verification System

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] The absence of a previous collaborative manual enrolment represents a significant handicap towards designing a face verification system for face re-identification purposes. In this scenario, the system must learn the target identity incrementally, using data from the video stream during the operational authentication phase. So, manual labelling cannot be assumed apart from the first few frames. On the other hand, even the most advanced methods trained on large-scale and unconstrained datasets suffer performance degradation when no adaptation to specific contexts is performed. This work proposes an adaptive face verification system, for the continuous re-identification of target identity, within the framework of incremental unsupervised learning. Our Dynamic Ensemble of SVM is capable of incorporating non-labelled information to improve the performance of any model, even when its initial performance is modest. The proposal uses the self-training approach and is compared against other classification techniques within this same approach. Results show promising behaviour in terms of both knowledge acquisition and impostor robustness.This work has received financial support from the Spanish government (project TIN2017-90135-R MINECO (FEDER)), from The Consellaría de Cultura, Educación e Ordenación Universitaria (accreditations 2016–2019, EDG431G/01 and ED431G/08), and reference competitive groups (2017–2020, and ED431C 2017/04), and from the European Regional Development Fund (ERDF). Eric López-López has received financial support from the Xunta de Galicia and the European Union (European Social Fund – ESF)Xunta de Galicia; EDG431G/01Xunta de Galicia; ED431G/08Xunta de Galicia; ED431C 2017/0

    Deep Multi-View Learning for Visual Understanding

    Get PDF
    PhD ThesisMulti-view data is the result of an entity being perceived or represented from multiple perspectives. Plenty of applications in visual understanding contain multi-view data. For example, the face images for training a recognition system are usually captured by different devices from multiple angles. This thesis focuses on the cross-view visual recognition problems, e.g., identifying the face images of the same person across different cameras. Several representative multi-view settings, from the supervised multi-view learning to the more challenging unsupervised domain adaptive (UDA) multi-view learning, are investigated. Novel multi-view learning algorithms are proposed correspondingly. To be more specific, the proposed methods are based on the advanced deep neural network (DNN) architectures for better handling visual data. However, directly combining the multi-view learning objectives with DNN can result in different issues, e.g., on scalability, and limit the application scenarios and model performance. Corresponding novelties in DNN methods are thus required to solve them. This thesis is organised into three parts. Each chapter focuses on a multi-view learning setting with novel solutions and is detailed as follows: Chapter 3 A supervised multi-view learning setting with two different views are studied. To recognise the data samples across views, one strategy is aligning them in a common feature space via correlation maximisation. It is also known as canonical correlation analysis (CCA). Deep CCA has been proposed for better performance with the non-linear projection via deep neural networks. Existing deep CCA models typically decorrelate the deep feature dimensions of each view before their Euclidean distances are minimised in the common space. This feature decorrelation is achieved by enforcing an exact decorrelation constraint which is computationally expensive due to the matrix inversion or SVD operations. Therefore, existing deep CCA models are inefficient and have scalability issues. Furthermore, the exact decorrelation is incompatible with the gradient based deep model training and results in sub-optimal solution. To overcome these aforementioned issues, a novel deep CCA model Soft CCA is introduced in this thesis. Specifically, the exact decorrelation is replaced by soft decorrelation via a mini-batch based Stochastic Decorrelation Loss (SDL). It can be jointly optimised with the other training objectives. In addition, our SDL loss can be applied to other deep models beyond multi-view learning. Chapter 4 The supervised multi-view learning setting, whereby more than two views exist, are studied in this chapter. Recently developed deep multi-view learning algorithms either learn a latent visual representation based on a single semantic level and/or require laborious human annotation of these factors as attributes. A novel deep neural network architecture, called Multi- Level Factorisation Net (MLFN), is proposed to automatically factorise the visual appearance into latent discriminative factors at multiple semantic levels without manual annotation. The main purpose is forcing different views share the same latent factors so that they are can be aligned at all layers. Specifically, MLFN is composed of multiple stacked blocks. Each block contains multiple factor modules to model latent factors at a specific level, and factor selection modules that dynamically select the factor modules to interpret the content of each input image. The outputs of the factor selection modules also provide a compact latent factor descriptor that is complementary to the conventional deeply learned feature, and they can be fused efficiently. The effectiveness of the proposed MLFN is demonstrated by not only the large-scale cross-view recognition problems but also the general object categorisation tasks. Chapter 5 The last problem is a special unsupervised domain adaptation setting called unsupervised domain adaptive (UDA) multi-view learning. It contains a fully annotated dataset as the source domain and another unsupervised dataset with relevant tasks as the target domain. The main purpose is to improve the performance of the unlabelled dataset with the annotated data from the other dataset. More importantly, this setting further requires both the source and target domains are multi-view datasets with relevant tasks. Therefore, the assumption of the aligned label space across domains is inappropriate in the UDA multi-view learning. For example, the person re-identification (Re-ID) datasets built on different surveillance scenarios are with images of different people captured and should be given disjoint person identity labels. Existing methods for UDA multi-view learning problems are aligning different domains either in the raw image space or a feature embedding space for domain alignment. In this thesis, a different framework, multi-task learning, is adopted with the domain specific objectives for a common space learning. Specifically, such common space is proposed to enable the knowledge transfer. The conventional supervised losses can be used for the labelled source data while the unsupervised objectives for the target domain play the key roles in domain adaptation. Two novel unsupervised objectives are introduced for UDA multi-view learning and result in two models as below. The first model, termed common factorised space model (CFSM), is built on the assumptions that the semantic latent attributes are shared between the source and target domains since they are relevant multi-view learning tasks. Different from the existing methods that based on domain alignment, CFSM emphasizes on transferring the information across domains via discovering discriminative latent factors in the proposed common space. However, the multi-view data from target domain is without labels. Therefore, an unsupervised factorisation loss is derived and applied on the common space for latent factors discovery across domains. The second model still learns a shared embedding space with multi-view data from both domains but with a different assumption. It attempts to discover the latent correspondence of multi-view data in the unsupervised target data. The target data’s contribution comes from a clustering process. Each cluster thus reveals the underlying cross-view correspondences across multiple views in target domain. To this end, a novel Stochastic Inference for Deep Clustering (SIDC) method is proposed. It reduces self-reinforcing errors that lead to premature convergence to a sub-optimal solution by changing the conventional deterministic cluster assignment to a stochastic one

    Incremental Learning Through Unsupervised Adaptation in Video Face Recognition

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] Durante a última década, os métodos baseados en deep learning trouxeron un salto significativo no rendemento dos sistemas de visión artificial. Unha das claves neste éxito foi a creación de grandes conxuntos de datos perfectamente etiquetados para usar durante o adestramento. En certa forma, as redes de deep learning resumen esta enorme cantidade datos en prácticos vectores multidimensionais. Por este motivo, cando as diferenzas entre os datos de adestramento e os adquiridos durante o funcionamento dos sistemas (debido a factores como o contexto de adquisición) son especialmente notorias, as redes de deep learning son susceptibles de sufrir degradación no rendemento. Mentres que a solución inmediata a este tipo de problemas sería a de recorrer a unha recolección adicional de imaxes, co seu correspondente proceso de etiquetado, esta dista moito de ser óptima. A gran cantidade de posibles variacións que presenta o mundo visual converten rápido este enfoque nunha tarefa sen fin. Máis aínda cando existen aplicacións específicas nas que esta acción é difícil, ou incluso imposible, de realizar debido a problemas de custos ou de privacidade. Esta tese propón abordar todos estes problemas usando a perspectiva da adaptación. Así, a hipótese central consiste en asumir que é posible utilizar os datos non etiquetados adquiridos durante o funcionamento para mellorar o rendemento que obteríamos con sistemas de recoñecemento xerais. Para isto, e como proba de concepto, o campo de estudo da tese restrinxiuse ao recoñecemento de caras. Esta é unha aplicación paradigmática na cal o contexto de adquisición pode ser especialmente relevante. Este traballo comeza examinando as diferenzas intrínsecas entre algúns dos contextos específicos nos que se pode necesitar o recoñecemento de caras e como estas afectan ao rendemento. Desta maneira, comparamos distintas bases de datos (xunto cos seus contextos) entre elas, usando algúns dos descritores de características máis avanzados e así determinar a necesidade real de adaptación. A partir desta punto, pasamos a presentar o método novo, que representa a principal contribución da tese: o Dynamic Ensemble of SVM (De-SVM). Este método implementa a capacidade de adaptación utilizando unha aprendizaxe incremental non supervisada na que as súas propias predicións se usan como pseudo-etiquetas durante as actualizacións (a estratexia de auto-adestramento). Os experimentos realizáronse baixo condicións de vídeo-vixilancia, un exemplo paradigmático dun contexto moi específico no que os procesos de etiquetado son particularmente complicados. As ideas claves de De-SVM probáronse en diferentes sub-problemas de recoñecemento de caras: a verificación de caras e recoñecemento de caras en conxunto pechado e en conxunto aberto. Os resultados acadados mostran un comportamento prometedor en termos de adquisición de coñecemento sen supervisión así como robustez contra impostores. Ademais, este rendemento é capaz de superar a outros métodos do estado da arte que non posúen esta capacidade de adaptación.[Resumen] Durante la última década, los métodos basados en deep learning trajeron un salto significativo en el rendimiento de los sistemas de visión artificial. Una de las claves en este éxito fue la creación de grandes conjuntos de datos perfectamente etiquetados para usar durante el entrenamiento. En cierta forma, las redes de deep learning resumen esta enorme cantidad datos en prácticos vectores multidimensionales. Por este motivo, cuando las diferencias entre los datos de entrenamiento y los adquiridos durante el funcionamiento de los sistemas (debido a factores como el contexto de adquisición) son especialmente notorias, las redes de deep learning son susceptibles de sufrir degradación en el rendimiento. Mientras que la solución a este tipo de problemas es recurrir a una recolección adicional de imágenes, con su correspondiente proceso de etiquetado, esta dista mucho de ser óptima. La gran cantidad de posibles variaciones que presenta el mundo visual convierten rápido este enfoque en una tarea sin fin. Más aún cuando existen aplicaciones específicas en las que esta acción es difícil, o incluso imposible, de realizar; debido a problemas de costes o de privacidad. Esta tesis propone abordar todos estos problemas usando la perspectiva de la adaptación. Así, la hipótesis central consiste en asumir que es posible utilizar los datos no etiquetados adquiridos durante el funcionamiento para mejorar el rendimiento que se obtendría con sistemas de reconocimiento generales. Para esto, y como prueba de concepto, el campo de estudio de la tesis se restringió al reconocimiento de caras. Esta es una aplicación paradigmática en la cual el contexto de adquisición puede ser especialmente relevante. Este trabajo comienza examinando las diferencias entre algunos de los contextos específicos en los que se puede necesitar el reconocimiento de caras y así como sus efectos en términos de rendimiento. De esta manera, comparamos distintas ba ses de datos (y sus contextos) entre ellas, usando algunos de los descriptores de características más avanzados para así determinar la necesidad real de adaptación. A partir de este punto, pasamos a presentar el nuevo método, que representa la principal contribución de la tesis: el Dynamic Ensemble of SVM (De- SVM). Este método implementa la capacidad de adaptación utilizando un aprendizaje incremental no supervisado en la que sus propias predicciones se usan cómo pseudo-etiquetas durante las actualizaciones (la estrategia de auto-entrenamiento). Los experimentos se realizaron bajo condiciones de vídeo-vigilancia, un ejemplo paradigmático de contexto muy específico en el que los procesos de etiquetado son particularmente complicados. Las ideas claves de De- SVM se probaron en varios sub-problemas del reconocimiento de caras: la verificación de caras y reconocimiento de caras de conjunto cerrado y conjunto abierto. Los resultados muestran un comportamiento prometedor en términos de adquisición de conocimiento así como de robustez contra impostores. Además, este rendimiento es capaz de superar a otros métodos del estado del arte que no poseen esta capacidad de adaptación.[Abstract] In the last decade, deep learning has brought an unprecedented leap forward for computer vision general classification problems. One of the keys to this success is the availability of extensive and wealthy annotated datasets to use as training samples. In some sense, a deep learning network summarises this enormous amount of data into handy vector representations. For this reason, when the differences between training datasets and the data acquired during operation (due to factors such as the acquisition context) are highly marked, end-to-end deep learning methods are susceptible to suffer performance degradation. While the immediate solution to mitigate these problems is to resort to an additional data collection and its correspondent annotation procedure, this solution is far from optimal. The immeasurable possible variations of the visual world can convert the collection and annotation of data into an endless task. Even more when there are specific applications in which this additional action is difficult or simply not possible to perform due to, among other reasons, cost-related problems or privacy issues. This Thesis proposes to tackle all these problems from the adaptation point of view. Thus, the central hypothesis assumes that it is possible to use operational data with almost no supervision to improve the performance we would achieve with general-purpose recognition systems. To do so, and as a proof-of-concept, the field of study of this Thesis is restricted to face recognition, a paradigmatic application in which the context of acquisition can be especially relevant. This work begins by examining the intrinsic differences between some of the face recognition contexts and how they directly affect performance. To do it, we compare different datasets, and their contexts, against each other using some of the most advanced feature representations available to determine the actual need for adaptation. From this point, we move to present the novel method, representing the central contribution of the Thesis: the Dynamic Ensembles of SVM (De-SVM). This method implements the adaptation capabilities by performing unsupervised incremental learning using its own predictions as pseudo-labels for the update decision (the self-training strategy). Experiments are performed under video surveillance conditions, a paradigmatic example of a very specific context in which labelling processes are particularly complicated. The core ideas of De-SVM are tested in different face recognition sub-problems: face verification and, the more complex, general closed- and open-set face recognition. In terms of the achieved results, experiments have shown a promising behaviour in terms of both unsupervised knowledge acquisition and robustness against impostors, surpassing the performances achieved by state-of-the-art non-adaptive methods.Funding and Technical Resources For the successful development of this Thesis, it was necessary to rely on series of indispensable means included in the following list: • Working material, human and financial support primarily by the CITIC and the Computer Architecture Group of the University of A Coruña and CiTIUS of University of Santiago de Compostela, along with a PhD grant funded by Xunta the Galicia and the European Social Fund. • Access to bibliographical material through the library of the University of A Coruña. • Additional funding through the following research projects: State funding by the Ministry of Economy and Competitiveness of Spain (project TIN2017-90135-R MINECO, FEDER)

    Cross-class Transfer Learning for Visual Data

    Get PDF
    PhDAutomatic analysis of visual data is a key objective of computer vision research; and performing visual recognition of objects from images is one of the most important steps towards understanding and gaining insights into the visual data. Most existing approaches in the literature for the visual recognition are based on a supervised learning paradigm. Unfortunately, they require a large amount of labelled training data which severely limits their scalability. On the other hand, recognition is instantaneous and effortless for humans. They can recognise a new object without seeing any visual samples by just knowing the description of it, leveraging similarities between the description of the new object and previously learned concepts. Motivated by humans recognition ability, this thesis proposes novel approaches to tackle cross-class transfer learning (crossclass recognition) problem whose goal is to learn a model from seen classes (those with labelled training samples) that can generalise to unseen classes (those with labelled testing samples) without any training data i.e., seen and unseen classes are disjoint. Specifically, the thesis studies and develops new methods for addressing three variants of the cross-class transfer learning: Chapter 3 The first variant is transductive cross-class transfer learning, meaning labelled training set and unlabelled test set are available for model learning. Considering training set as the source domain and test set as the target domain, a typical cross-class transfer learning assumes that the source and target domains share a common semantic space, where visual feature vector extracted from an image can be embedded using an embedding function. Existing approaches learn this function from the source domain and apply it without adaptation to the target one. They are therefore prone to the domain shift problem i.e., the embedding function is only concerned with predicting the training seen class semantic representation in the learning stage during learning, when applied to the test data it may underperform. In this thesis, a novel cross-class transfer learning (CCTL) method is proposed based on unsupervised domain adaptation. Specifically, a novel regularised dictionary learning framework is formulated by which the target class labels are used to regularise the learned target domain embeddings thus effectively overcoming the projection domain shift problem. Chapter 4 The second variant is inductive cross-class transfer learning, that is, only training set is assumed to be available during model learning, resulting in a harder challenge compared to the previous one. Nevertheless, this setting reflects a real-world setting in which test data is available after the model learning. The main problem remains the same as the previous variant, that is, the domain shift problem occurs when the model learned only from the training set is applied to the test set without adaptation. In this thesis, a semantic autoencoder (SAE) is proposed building on an encoder-decoder paradigm. Specifically, first a semantic space is defined so that knowledge transfer is possible from the seen classes to the unseen classes. Then, an encoder aims to embed/project a visual feature vector into the semantic space. However, the decoder exerts a generative task, that is, the projection must be able to reconstruct the original visual features. The generative task forces the encoder to preserve richer information, thus the learned encoder from seen classes is able generalise better to the new unseen classes. Chapter 5 The third one is unsupervised cross-class transfer learning. In this variant, no supervision is available for model learning i.e., only unlabelled training data is available, leading to the hardest setting compared to the previous cases. The goal, however, is the same, learning some knowledge from the training data that can be transferred to the test data composed of completely different labels from that of training data. The thesis proposes a novel approach which requires no labelled training data yet is able to capture discriminative information. The proposed model is based on a new graph regularised dictionary learning algorithm. By introducing a l1- norm graph regularisation term, instead of the conventional squared l2-norm, the model is robust against outliers and noises typical in visual data. Importantly, the graph and representation are learned jointly, resulting in further alleviation of the effects of data outliers. As an application, person re-identification is considered for this variant in this thesis
    corecore