3 research outputs found

    Self-supervised Face Representation Learning

    Get PDF
    This thesis investigates fine-tuning deep face features in a self-supervised manner for discriminative face representation learning, wherein we develop methods to automatically generate pseudo-labels for training a neural network. Most importantly solving this problem helps us to advance the state-of-the-art in representation learning and can be beneficial to a variety of practical downstream tasks. Fortunately, there is a vast amount of videos on the internet that can be used by machines to learn an effective representation. We present methods that can learn a strong face representation from large-scale data be the form of images or video. However, while learning a good representation using a deep learning algorithm requires a large-scale dataset with manually curated labels, we propose self-supervised approaches to generate pseudo-labels utilizing the temporal structure of the video data and similarity constraints to get supervision from the data itself. We aim to learn a representation that exhibits small distances between samples from the same person, and large inter-person distances in feature space. Using metric learning one could achieve that as it is comprised of a pull-term, pulling data points from the same class closer, and a push-term, pushing data points from a different class further away. Metric learning for improving feature quality is useful but requires some form of external supervision to provide labels for the same or different pairs. In the case of face clustering in TV series, we may obtain this supervision from tracks and other cues. The tracking acts as a form of high precision clustering (grouping detections within a shot) and is used to automatically generate positive and negative pairs of face images. Inspired from that we propose two variants of discriminative approaches: Track-supervised Siamese network (TSiam) and Self-supervised Siamese network (SSiam). In TSiam, we utilize the tracking supervision to obtain the pair, additional we include negative training pairs for singleton tracks -- tracks that are not temporally co-occurring. As supervision from tracking may not always be available, to enable the use of metric learning without any supervision we propose an effective approach SSiam that can generate the required pairs automatically during training. In SSiam, we leverage dynamic generation of positive and negative pairs based on sorting distances (i.e. ranking) on a subset of frames and do not have to only rely on video/track based supervision. Next, we present a method namely Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that utilizes automatically discovered partitions obtained from a clustering algorithm (FINCH) as weak supervision along with inherent video constraints to learn discriminative face features. As annotating datasets is costly and difficult, using label-free and weak supervision obtained from a clustering algorithm as a proxy learning task is promising. Through our analysis, we show that creating positive and negative training pairs using clustering predictions help to improve the performance for video face clustering. We then propose a method face grouping on graphs (FGG), a method for unsupervised fine-tuning of deep face feature representations. We utilize a graph structure with positive and negative edges over a set of face-tracks based on their temporal structure of the video data and similarity-based constraints. Using graph neural networks, the features communicate over the edges allowing each track\u27s feature to exchange information with its neighbors, and thus push each representation in a direction in feature space that groups all representations of the same person together and separates representations of a different person. Having developed these methods to generate weak-labels for face representation learning, next we propose to learn compact yet effective representation for describing face tracks in videos into compact descriptors, that can complement previous methods towards learning a more powerful face representation. Specifically, we propose Temporal Compact Bilinear Pooling (TCBP) to encode the temporal segments in videos into a compact descriptor. TCBP possesses the ability to capture interactions between each element of the feature representation with one-another over a long-range temporal context. We integrated our previous methods TSiam, SSiam and CCL with TCBP and demonstrated that TCBP has excellent capabilities in learning a strong face representation. We further show TCBP has exceptional transfer abilities to applications such as multimodal video clip representation that jointly encodes images, audio, video and text, and video classification. All of these contributions are demonstrated on benchmark video clustering datasets: The Big Bang Theory, Buffy the Vampire Slayer and Harry Potter 1. We provide extensive evaluations on these datasets achieving a significant boost in performance over the base features, and in comparison to the state-of-the-art results

    AFMB-Net: DeepFake Detection Network Using Heart Rate Analysis

    Get PDF
    With advances in deepfake generating technology, it is getting increasingly difficult to detect deepfakes. Deepfakes can be used for many malpractices such as blackmail, politics, social media, etc. These can lead to widespread misinformation and can be harmful to an individual or an institutionā€™s reputation. It has become important to be able to identify deepfakes effectively, while there exist many machine learning techniques to identify them, these methods are not able to cope up with the rapidly improving GAN technology which is used to generate deepfakes. Our project aims to identify deepfakes successfully using machine learning along with Heart Rate Analysis. The heart rate identified by our model is unique to each individual and cannot be spoofed or imitated by a GAN and is thus susceptible to improving GAN technology. To solve the deepfake detection problem we employ various machine learning models along with heart rate analysis to detect deepfakes

    End-to-end learning, and audio-visual human-centric video understanding

    Get PDF
    The field of machine learning has seen tremendous progress in the last decade, largely due to the advent of deep neural networks. When trained on large-scale labelled datasets, these machine learning algorithms can learn powerful semantic representations directly from the input data, end-to-end. End-to-end learning requires the availability of three core components: useful input data, target outputs, and an objective function for measuring how well the model's predictions match the target outputs. In this thesis, we explore and overcome a series of challenges as related to assembling these three components in the sufficient format and scale for end-to-end learning. The first key idea presented in this thesis is to learn representations by enabling end-to-end learning for tasks where such challenges exist. We first explore whether better representations can be learnt for the image retrieval task by directly optimising the evaluation metric, Average Precision. This is notoriously challenging task, because such rank-based metrics are non-differentiable. We introduce a simple objective function that optimises a smoothed approximation of Average Precision, termed Smooth-AP, and demonstrate the benefits of training end-to-end over prior approaches. Secondly, we explore whether a representation can be learnt end-to-end for the task of image editing, where target data does not exist in sufficient scale. We propose a self-supervised approach that simulates target data by augmenting off-the-shelf image data, giving remarkable benefits over prior work. The second idea presented in this thesis is focused on how to use the rich multi-modal signals that are essential for human perceptual systems as input data for deep neural networks. More specifically, we explore the use of audio-visual input data for the human-centric video understanding task. Here, we first explore if highly optimised speaker verification representations can transfer to the domain of movies where humans intentionally disguise their voice. We do this by collecting an audio-visual dataset of humans speaking in movies. Second, given strong identity discriminating representations, we present two methods that harness the complementarity and redundancy between multi-modal signals in order to build robust perceptual systems for determining who is present in a scene. These methods include an automated pipeline for labelling people in unlabelled video archives, and an approach for clustering people by identity in videos
    corecore