11 research outputs found

    Exploring semantic inter-class relationships (SIR) for zero-shot action recognition

    Full text link
    © Copyright 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Automatically recognizing a large number of action categories from videos is of significant importance for video understanding. Most existing works focused on the design of more discriminative feature representation, and have achieved promising results when the positive samples are enough. However, very limited efforts were spent on recognizing a novel action without any positive exemplars, which is often the case in the real settings due to the large amount of action classes and the users' queries dramatic variations. To address this issue, we propose to perform action recognition when no positive exemplars of that class are provided, which is often known as the zero-shot learning. Different from other zero-shot learning approaches, which exploit attributes as the intermediate layer for the knowledge transfer, our main contribution is SIR, which directly leverages the semantic inter-class relationships between the known and unknown actions followed by label transfer learning. The inter-class semantic relationships are automatically measured by continuous word vectors, which learned by the skip-gram model using the large-scale text corpus. Extensive experiments on the UCF101 dataset validate the superiority of our method over fully-supervised approaches using few positive exemplars

    Learning joint feature adaptation for zero-shot recognition

    Full text link
    Zero-shot recognition (ZSR) aims to recognize target-domain data instances of unseen classes based on the models learned from associated pairs of seen-class source and target domain data. One of the key challenges in ZSR is the relative scarcity of source-domain features (e.g. one feature vector per class), which do not fully account for wide variability in target-domain instances. In this paper we propose a novel framework of learning data-dependent feature transforms for scoring similarity between an arbitrary pair of source and target data instances to account for the wide variability in target domain. Our proposed approach is based on optimizing over a parameterized family of local feature displacements that maximize the source-target adaptive similarity functions. Accordingly we propose formulating zero-shot learning (ZSL) using latent structural SVMs to learn our similarity functions from training data. As demonstration we design a specific algorithm under the proposed framework involving bilinear similarity functions and regularized least squares as penalties for feature displacement. We test our approach on several benchmark datasets for ZSR and show significant improvement over the state-of-the-art. For instance, on aP&Y dataset we can achieve 80.89% in terms of recognition accuracy, outperforming the state-of-the-art by 11.15%

    See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

    Full text link
    Zero-shot point cloud segmentation aims to make deep models capable of recognizing novel objects in point cloud that are unseen in the training phase. Recent trends favor the pipeline which transfers knowledge from seen classes with labels to unseen classes without labels. They typically align visual features with semantic features obtained from word embedding by the supervision of seen classes' annotations. However, point cloud contains limited information to fully match with semantic features. In fact, the rich appearance information of images is a natural complement to the textureless point cloud, which is not well explored in previous literature. Motivated by this, we propose a novel multi-modal zero-shot learning method to better utilize the complementary information of point clouds and images for more accurate visual-semantic alignment. Extensive experiments are performed in two popular benchmarks, i.e., SemanticKITTI and nuScenes, and our method outperforms current SOTA methods with 52% and 49% improvement on average for unseen class mIoU, respectively.Comment: Accepted by ICCV 202

    Transductive Zero-Shot Action Recognition by Word-Vector Embedding

    Get PDF
    The number of categories for action recognition is growing rapidly and it has become increasingly hard to label sufficient training data for learning conventional models for all categories. Instead of collecting ever more data and labelling them exhaustively for all categories, an attractive alternative approach is zero-shot learning" (ZSL). To that end, in this study we construct a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data. Existing ZSL studies focus primarily on still images, and attribute-based semantic representations. In this work, we explore word-vectors as the shared semantic space to embed videos and category labels for ZSL action recognition. This is a more challenging problem than existing ZSL of still images and/or attributes, because the mapping between video spacetime features of actions and the semantic space is more complex and harder to learn for the purpose of generalising over any cross-category domain shift. To solve this generalisation problem in ZSL action recognition, we investigate a series of synergistic strategies to improve upon the standard ZSL pipeline. Most of these strategies are transductive in nature which means access to testing data in the training phase.Comment: Accepted by IJC

    On the semantic information in zero-shot action recognition

    Get PDF
    Orientador: Dr. David MenottiCoorientador: Dr. Hélio PedriniTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 14/04/2023Inclui referências: p. 117-132Área de concentração: Ciência da ComputaçãoResumo: Os avanços da última década em modelos de aprendizagem profunda aliados à alta disponibilidade de exemplos em plataformas como o YouTube foram responsáveis por notáveis progressos no problema de Reconhecimento de Ações Humanas (RAH) em vídeos. Esses avanços trouxeram o desafio da inclusão de novas classes aos modelos existentes, pois incluí-las é uma tarefa que demanda tempo e recursos computacionais. Além disso, novas classes de ações são frequentemente criadas pelo uso de novos objetos ou novas formas de interação entre humanos. Esse cenário é o que motiva o problema Zero-Shot Action Recognition (ZSAR), definido como classificar instâncias pertencentes a classes não disponíveis na fase de treinamento dos modelos. Métodos ZSAR objetivam aprender funções de projeção que relacionem as representações dos vídeos com as representações semânticas dos rótulos das classes conhecidas. Trata-se, portanto, de um problema de representação multi-modal. Nesta tese, investigamos o problema do semantic gap em ZSAR, ou seja, as propriedades dos espaços vetoriais das representações dos vídeos e dos rótulos não são coincidentes e, muitas vezes, as funções de projeção aprendidas são insuficientes para corrigir distorções. Nós defendemos que o semantic gap deriva do que chamamos semantic lack, ou falta de semântica, que ocorre em ambos os lados do problema (i.e., vídeos e rótulos) e não é suficientemente investigada na literatura. Apresentamos três abordagens ao problema investigando diferentes informações semânticas e formas de representação para vídeos e rótulos. Mostramos que uma forma eficiente de representar vídeos é transformando-os em sentenças descritivas utilizando métodos de video captioning. Essa abordagem permite descrever cenários, objetos e interações espaciais e temporais entre humanos. Nós mostramos que sua adoção gera modelos de alta eficácia comparados à literatura. Também propusemos incluir informações descritivas sobre os objetos presentes nas cenas a partir do uso de métodos treinados em reconhecimento de objetos. Mostramos que a representação dos rótulos de classes apresenta melhores resultados com o uso de sentenças extraídas de textos descritivos coletados da Internet. Ao usar apenas textos, nós nos valemos de modelos de redes neurais profundas pré-treinados na tarefa de paráfrase para codificar a informação e realizar a classificação ZSAR com reduzido semantic gap. Finalmente, mostramos como condicionar a representação dos quadros de um vídeo à sua correspondente descrição texual, produzindo um modelo capaz de representar em um espaço vetorial conjunto tanto vídeos quanto textos. As abordagens apresentadas nesta tese mostraram efetiva redução do semantic gap a partir das contribuições tanto em acréscimo de informação quanto em formas de codificação.Abstract: The advancements of the last decade in deep learning models and the high availability of examples on platforms such as YouTube were responsible for notable progress in the problem of Human Action Recognition (HAR) in videos. These advancements brought the challenge of adding new classes to existing models, since including them takes time and computational resources. In addition, new classes of actions are frequently created, either by using new objects or new forms of interaction between humans. This scenario motivates the Zero-Shot Action Recognition (ZSAR) problem, defined as classifying instances belonging to classes not available for the model training phase. ZSAR methods aim to learn projection functions associating video representations with semantic label representations of known classes. Therefore, it is a multi-modal representation problem. In this thesis, we investigate the semantic gap problem in ZSAR. The properties of vector spaces are not coincident, and, often, the projection functions learned are insufficient to correct distortions. We argue that the semantic gap derives from what we call semantic lack, which occurs on both sides of the problem (i.e., videos and labels) and is not sufficiently investigated in the literature. We present three approaches to the problem, investigating different information and representation strategies for videos and labels. We show an efficient way to represent videos by transforming them into descriptive sentences using video captioning methods. This approach enables us to produce high-performance models compared to the literature. We also proposed including descriptive information about objects present in the scenes using object recognition methods. We showed that the representation of class labels presents better results using sentences extracted from descriptive texts collected on the Internet. Using only texts, we employ deep neural network models pre-trained in the paraphrasing task to encode the information and perform the ZSAR classification with a reduced semantic gap. Finally, we show how conditioning the representation of video frames to their corresponding textual description produces a model capable of representing both videos and texts in a joint vector space. The approaches presented in this thesis showed an effective reduction of the semantic gap based on contributions in addition to information and representation ways