33 research outputs found

    Learning Multimodal Structures in Computer Vision

    Get PDF
    A phenomenon or event can be received from various kinds of detectors or under different conditions. Each such acquisition framework is a modality of the phenomenon. Due to the relation between the modalities of multimodal phenomena, a single modality cannot fully describe the event of interest. Since several modalities report on the same event introduces new challenges comparing to the case of exploiting each modality separately. We are interested in designing new algorithmic tools to apply sensor fusion techniques in the particular signal representation of sparse coding which is a favorite methodology in signal processing, machine learning and statistics to represent data. This coding scheme is based on a machine learning technique and has been demonstrated to be capable of representing many modalities like natural images. We will consider situations where we are not only interested in support of the model to be sparse, but also to reflect a-priorily known knowledge about the application in hand. Our goal is to extract a discriminative representation of the multimodal data that leads to easily finding its essential characteristics in the subsequent analysis step, e.g., regression and classification. To be more precise, sparse coding is about representing signals as linear combinations of a small number of bases from a dictionary. The idea is to learn a dictionary that encodes intrinsic properties of the multimodal data in a decomposition coefficient vector that is favorable towards the maximal discriminatory power. We carefully design a multimodal representation framework to learn discriminative feature representations by fully exploiting, the modality-shared which is the information shared by various modalities, and modality-specific which is the information content of each modality individually. Plus, it automatically learns the weights for various feature components in a data-driven scheme. In other words, the physical interpretation of our learning framework is to fully exploit the correlated characteristics of the available modalities, while at the same time leverage the modality-specific character of each modality and change their corresponding weights for different parts of the feature in recognition

    Learning Generalizable Visual Patterns Without Human Supervision

    Get PDF
    Owing to the existence of large labeled datasets, Deep Convolutional Neural Networks have ushered in a renaissance in computer vision. However, almost all of the visual data we generate daily - several human lives worth of it - remains unlabeled and thus out of reach of today’s dominant supervised learning paradigm. This thesis focuses on techniques that steer deep models towards learning generalizable visual patterns without human supervision. Our primary tool in this endeavor is the design of Self-Supervised Learning tasks, i.e., pretext-tasks for which labels do not involve human labor. Besides enabling the learning from large amounts of unlabeled data, we demonstrate how self-supervision can capture relevant patterns that supervised learning largely misses. For example, we design learning tasks that learn deep representations capturing shape from images, motion from video, and 3D pose features from multi-view data. Notably, these tasks’ design follows a common principle: The recognition of data transformations. The strong performance of the learned representations on downstream vision tasks such as classification, segmentation, action recognition, or pose estimation validate this pretext-task design. This thesis also explores the use of Generative Adversarial Networks (GANs) for unsupervised representation learning. Besides leveraging generative adversarial learning to define image transformation for self-supervised learning tasks, we also address training instabilities of GANs through the use of noise. While unsupervised techniques can significantly reduce the burden of supervision, in the end, we still rely on some annotated examples to fine-tune learned representations towards a target task. To improve the learning from scarce or noisy labels, we describe a supervised learning algorithm with improved generalization in these challenging settings

    Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective

    Full text link
    We present a new dataset condensation framework termed Squeeze, Recover and Relabel (SRe2^2L) that decouples the bilevel optimization of model and synthetic data during training, to handle varying scales of datasets, model architectures and image resolutions for effective dataset condensation. The proposed method demonstrates flexibility across diverse dataset scales and exhibits multiple advantages in terms of arbitrary resolutions of synthesized images, low training cost and memory consumption with high-resolution training, and the ability to scale up to arbitrary evaluation network architectures. Extensive experiments are conducted on Tiny-ImageNet and full ImageNet-1K datasets. Under 50 IPC, our approach achieves the highest 42.5% and 60.8% validation accuracy on Tiny-ImageNet and ImageNet-1K, outperforming all previous state-of-the-art methods by margins of 14.5% and 32.9%, respectively. Our approach also outperforms MTT by approximately 52Ă—\times (ConvNet-4) and 16Ă—\times (ResNet-18) faster in speed with less memory consumption of 11.6Ă—\times and 6.4Ă—\times during data synthesis. Our code and condensed datasets of 50, 200 IPC with 4K recovery budget are available at https://zeyuanyin.github.io/projects/SRe2L/.Comment: Technical repor

    The Role of Global Labels in Few-Shot Classification and How to Infer Them

    Get PDF
    Few-shot learning is a central problem in meta-learning, where learners must quickly adapt to new tasks given limited training data. Recently, feature pre-training has become a ubiquitous component in state-of-the-art meta-learning methods and is shown to provide significant performance improvement. However, there is limited theoretical understanding of the connection between pre-training and meta-learning. Further, pre-training requires global labels shared across tasks, which may be unavailable in practice. In this paper, we show why exploiting pre-training is theoretically advantageous for meta-learning, and in particular the critical role of global labels. This motivates us to propose Meta Label Learning (MeLa), a novel meta-learning framework that automatically infers global labels to obtains robust few-shot models. Empirically, we demonstrate that MeLa is competitive with existing methods and provide extensive ablation experiments to highlight its key properties

    EgoHumans: An Egocentric 3D Multi-Human Benchmark

    Full text link
    We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations to support diverse tasks such as human detection, tracking, 2D/3D pose estimation, and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing tennis, fencing, volleyball, etc. Furthermore, our multi-view setup generates accurate 3D ground truth even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario, specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a multi-stream transformer architecture and explicit 3D spatial reasoning to estimate and track the human pose. EgoFormer significantly outperforms prior art by 13.6% IDF1 on the EgoHumans dataset.Comment: Accepted to ICCV 2023 (Oral

    Non-Negative Discriminative Data Analytics

    Get PDF
    Due to advancements in data acquisition techniques, collecting datasets representing samples from multi-views has become more common recently (Jia et al. 2019). For instance, in genomics, a lymphoma patient’s dataset may include data on gene expression, single nucleotide polymorphism (SNP), and array Comparative genomic hybridization (aCGH) measurements. Learning from multiple views about the same objective, in general, obtains a better understanding of the hidden patterns of the data compared to learning from a single view data. Most of the existing multi-view learning techniques such as canonical correlation analysis (Hotelling et al. 1936) and multi-view support vector machine (Farquhar et al. 2006), multiple kernel learning (Zhang et al. 2016) are focused on extracting the shared information among multiple datasets. However, in some real-world applications, it’s appealing to extract the discriminative knowledge of multiple datasets, namely discriminative data analytics. For example, consider the one dataset as gene-expression measurements of cancer patients, and the other dataset as the gene-expression levels of healthy volunteers and the goal is to cluster cancer patients according to the molecular sub-types. Performing a single view analysis such as principal component analysis (PCA) on any of the dataset yields information related to the common knowledge between the two datasets (Garte et al. 1996). Addressing such challenge, contrastive PCA (Abid et al. 2017) and discriminative (d) PCA in (Jia et al. 2019) are proposed in to extract one dataset-specific information often missed by PCA. Inspired by dPCA, we propose a novel discriminative multi-view learning algorithm, namely Non-negative Discriminative Analysis (DNA), to extract the unique information of one dataset (a.k.a. view) with respect to the other dataset. This boils down to solving a non-negative matrix factorization problem. Furthermore, we apply the proposed DNA framework in various real-world down-stream machine learning applications such as feature selections, dimensionality reduction, classification, and clustering
    corecore