32 research outputs found

    Action Tubelet Detector for Spatio-Temporal Action Localization

    Full text link
    Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level that are then linked or tracked across time. In this paper, we leverage the temporal continuity of videos instead of operating at the frame level. We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i.e., sequences of bounding boxes with associated scores. The same way state-of-the-art object detectors rely on anchor boxes, our ACT-detector is based on anchor cuboids. We build upon the SSD framework. Convolutional features are extracted for each frame, while scores and regressions are based on the temporal stacking of these features, thus exploiting information from a sequence. Our experimental results show that leveraging sequences of frames significantly improves detection performance over using individual frames. The gain of our tubelet detector can be explained by both more accurate scores and more precise localization. Our ACT-detector outperforms the state-of-the-art methods for frame-mAP and video-mAP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds.Comment: 9 page

    On the semantic information in zero-shot action recognition

    Get PDF
    Orientador: Dr. David MenottiCoorientador: Dr. Hélio PedriniTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 14/04/2023Inclui referências: p. 117-132Área de concentração: Ciência da ComputaçãoResumo: Os avanços da última década em modelos de aprendizagem profunda aliados à alta disponibilidade de exemplos em plataformas como o YouTube foram responsáveis por notáveis progressos no problema de Reconhecimento de Ações Humanas (RAH) em vídeos. Esses avanços trouxeram o desafio da inclusão de novas classes aos modelos existentes, pois incluí-las é uma tarefa que demanda tempo e recursos computacionais. Além disso, novas classes de ações são frequentemente criadas pelo uso de novos objetos ou novas formas de interação entre humanos. Esse cenário é o que motiva o problema Zero-Shot Action Recognition (ZSAR), definido como classificar instâncias pertencentes a classes não disponíveis na fase de treinamento dos modelos. Métodos ZSAR objetivam aprender funções de projeção que relacionem as representações dos vídeos com as representações semânticas dos rótulos das classes conhecidas. Trata-se, portanto, de um problema de representação multi-modal. Nesta tese, investigamos o problema do semantic gap em ZSAR, ou seja, as propriedades dos espaços vetoriais das representações dos vídeos e dos rótulos não são coincidentes e, muitas vezes, as funções de projeção aprendidas são insuficientes para corrigir distorções. Nós defendemos que o semantic gap deriva do que chamamos semantic lack, ou falta de semântica, que ocorre em ambos os lados do problema (i.e., vídeos e rótulos) e não é suficientemente investigada na literatura. Apresentamos três abordagens ao problema investigando diferentes informações semânticas e formas de representação para vídeos e rótulos. Mostramos que uma forma eficiente de representar vídeos é transformando-os em sentenças descritivas utilizando métodos de video captioning. Essa abordagem permite descrever cenários, objetos e interações espaciais e temporais entre humanos. Nós mostramos que sua adoção gera modelos de alta eficácia comparados à literatura. Também propusemos incluir informações descritivas sobre os objetos presentes nas cenas a partir do uso de métodos treinados em reconhecimento de objetos. Mostramos que a representação dos rótulos de classes apresenta melhores resultados com o uso de sentenças extraídas de textos descritivos coletados da Internet. Ao usar apenas textos, nós nos valemos de modelos de redes neurais profundas pré-treinados na tarefa de paráfrase para codificar a informação e realizar a classificação ZSAR com reduzido semantic gap. Finalmente, mostramos como condicionar a representação dos quadros de um vídeo à sua correspondente descrição texual, produzindo um modelo capaz de representar em um espaço vetorial conjunto tanto vídeos quanto textos. As abordagens apresentadas nesta tese mostraram efetiva redução do semantic gap a partir das contribuições tanto em acréscimo de informação quanto em formas de codificação.Abstract: The advancements of the last decade in deep learning models and the high availability of examples on platforms such as YouTube were responsible for notable progress in the problem of Human Action Recognition (HAR) in videos. These advancements brought the challenge of adding new classes to existing models, since including them takes time and computational resources. In addition, new classes of actions are frequently created, either by using new objects or new forms of interaction between humans. This scenario motivates the Zero-Shot Action Recognition (ZSAR) problem, defined as classifying instances belonging to classes not available for the model training phase. ZSAR methods aim to learn projection functions associating video representations with semantic label representations of known classes. Therefore, it is a multi-modal representation problem. In this thesis, we investigate the semantic gap problem in ZSAR. The properties of vector spaces are not coincident, and, often, the projection functions learned are insufficient to correct distortions. We argue that the semantic gap derives from what we call semantic lack, which occurs on both sides of the problem (i.e., videos and labels) and is not sufficiently investigated in the literature. We present three approaches to the problem, investigating different information and representation strategies for videos and labels. We show an efficient way to represent videos by transforming them into descriptive sentences using video captioning methods. This approach enables us to produce high-performance models compared to the literature. We also proposed including descriptive information about objects present in the scenes using object recognition methods. We showed that the representation of class labels presents better results using sentences extracted from descriptive texts collected on the Internet. Using only texts, we employ deep neural network models pre-trained in the paraphrasing task to encode the information and perform the ZSAR classification with a reduced semantic gap. Finally, we show how conditioning the representation of video frames to their corresponding textual description produces a model capable of representing both videos and texts in a joint vector space. The approaches presented in this thesis showed an effective reduction of the semantic gap based on contributions in addition to information and representation ways

    Spatio-temporal human action detection and instance segmentation in videos

    Get PDF
    With an exponential growth in the number of video capturing devices and digital video content, automatic video understanding is now at the forefront of computer vision research. This thesis presents a series of models for automatic human action detection in videos and also addresses the space-time action instance segmentation problem. Both action detection and instance segmentation play vital roles in video understanding. Firstly, we propose a novel human action detection approach based on a frame-level deep feature representation combined with a two-pass dynamic programming approach. The method obtains a frame-level action representation by leveraging recent advances in deep learning based action recognition and object detection methods. To combine the the complementary appearance and motion cues, we introduce a new fusion technique which signicantly improves the detection performance. Further, we cast the temporal action detection as two energy optimisation problems which are solved using Viterbi algorithm. Exploiting a video-level representation further allows the network to learn the inter-frame temporal correspondence between action regions and it is bound to be a more optimal solution to the action detection problem than a frame-level representation. Secondly, we propose a novel deep network architecture which learns a video-level action representation by classifying and regressing 3D region proposals spanning two successive video frames. The proposed model is end-to-end trainable and can be jointly optimised for both proposal generation and action detection objectives in a single training step. We name our new network as \AMTnet" (Action Micro-Tube regression Network). We further extend the AMTnet model by incorporating optical ow features to encode motion patterns of actions. Finally, we address the problem of action instance segmentation in which multiple concurrent actions of the same class may be segmented out of an image sequence. By taking advantage of recent work on action foreground-background segmentation, we are able to associate each action tube with class-specic segmentations. We demonstrate the performance of our proposed models on challenging action detection benchmarks achieving new state-of-the-art results across the board and signicantly increasing detection speed at test time

    Data-driven quantitative photoacoustic tomography

    Get PDF
    Spatial information about the 3D distribution of blood oxygen saturation (sO2) in vivo is of clinical interest as it encodes important physiological information about tissue health/pathology. Photoacoustic tomography (PAT) is a biomedical imaging modality that, in principle, can be used to acquire this information. Images are formed by illuminating the sample with a laser pulse where, after multiple scattering events, the optical energy is absorbed. A subsequent rise in temperature induces an increase in pressure (the photoacoustic initial pressure p0) that propagates to the sample surface as an acoustic wave. These acoustic waves are detected as pressure time series by sensor arrays and used to reconstruct images of sample’s p0 distribution. This encodes information about the sample’s absorption distribution, and can be used to estimate sO2. However, an ill-posed nonlinear inverse problem stands in the way of acquiring estimates in vivo. Current approaches to solving this problem fall short of being widely and successfully applied to in vivo tissues due to their reliance on simplifying assumptions about the tissue, prior knowledge of its optical properties, or the formulation of a forward model accurately describing image acquisition with a specific imaging system. Here, we investigate the use of data-driven approaches (deep convolutional networks) to solve this problem. Networks only require a dataset of examples to learn a mapping from PAT data to images of the sO2 distribution. We show the results of training a 3D convolutional network to estimate the 3D sO2 distribution within model tissues from 3D multiwavelength simulated images. However, acquiring a realistic training set to enable successful in vivo application is non-trivial given the challenges associated with estimating ground truth sO2 distributions and the current limitations of simulating training data. We suggest/test several methods to 1) acquire more realistic training data or 2) improve network performance in the absence of adequate quantities of realistic training data. For 1) we describe how training data may be acquired from an organ perfusion system and outline a possible design. Separately, we describe how training data may be generated synthetically using a variant of generative adversarial networks called ambientGANs. For 2), we show how the accuracy of networks trained with limited training data can be improved with self-training. We also demonstrate how the domain gap between training and test sets can be minimised with unsupervised domain adaption to improve quantification accuracy. Overall, this thesis clarifies the advantages of data-driven approaches, and suggests concrete steps towards overcoming the challenges with in vivo application
    corecore