46 research outputs found

    Vision and language understanding with localized evidence

    Full text link
    Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval. First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals. In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research

    Deep Video Analytics of Humans: From Action Recognition to Forgery Detection

    Get PDF
    In this work, we explore a variety of techniques and applications for visual problems involving videos of humans in the contexts of activity detection, pose detection, and forgery detection. The first works discussed here address the issue of human activity detection in untrimmed video where the actions performed are spatially and temporally sparse. The video may therefore contain long sequences of frames where no actions occur, and the actions that do occur will often only comprise a very small percentage of the pixels on the screen. We address this with a two-stage architecture that first creates many coarse proposals with high recall, and then classifies and refines them to create temporally accurate activity proposals. We present two methods that follow this high-level paradigm: TRI-3D and CHUNK-3D. This work on activity detection is then extended to include results on few-shot learning. In this domain, a system must learn to perform detection given only an extremely limited set of training examples. We propose a method we call a Self-Denoising Neural Network (SDNN), which takes inspiration from Denoising Autoencoders, in order to solve this problem, both in the context of activity detection and image classification. We also propose a method that performs optical character recognition on real world images when no labels are available in the language we wish to transcribe. Specifically, we build an accurate transcription system for Hebrew street name signs when no labeled training data is available. In order to do this, we divide the problem into two components and address each separately: content, which refers to the characters and language structure, and style, which refers to the domain of the images (for example, real or synthetic). We train with simple synthetic Hebrew street signs to address the content components, and with labeled French street signs to address the style. We continue our analysis by proposing a method for automatic detection of facial forgeries in videos and images. This work approaches the problem of facial forgery detection by breaking the face into multiple regions and training separate classifiers for each part. The end result is a collection of high-quality facial forgery detectors that are both accurate and explainable. We exploit this explainability by providing extensive empirical analysis of our method's results. Next, we present work that focuses on multi-camera, multi-person 3D human pose estimation from video. To address this problem, we aggregate the outputs of a 2D human pose detector across cameras and actors using a novel factor graph formulation, which we optimize using the loopy belief propagation algorithm. In particular, our factor graph introduces a temporal smoothing term to create smooth transitions between poses across frames. Finally, our last proposed method covers activity detection, pose detection, and tracking in the game of Ping Pong, where we present a new dataset, dubbed SPIN, with extensive annotations. We introduce several tasks with this dataset, including the task of predicting the future actions of players and tracking ball movements. To evaluate our performance on these tasks, we present a novel recurrent gated CNN architecture

    Visual Recognition and Synthesis of Human-Object Interactions

    Full text link
    The ability to perceive and understand people's actions enables humans to efficiently communicate and collaborate in society. Endowing machines with such ability is an important step for building assistive and socially-aware robots. Despite such significance, the problem poses a great challenge and the current state of the art is still nowhere close to human-level performance. This dissertation drives progress on visual action understanding in the scope of human-object interactions (HOI), a major branch of human actions that dominates our everyday life. Specifically, we address the challenges of two important tasks: visual recognition and visual synthesis. The first part of this dissertation considers the recognition task. The main bottleneck of current research is a lack of proper benchmark, since existing action datasets contain only a small number of categories with limited diversity. To this end, we set out to construct a large-scale benchmark for HOI recognition. We first tackle the problem of establishing the vocabulary for human-object interactions, by investigating a variety of automatic approaches as well as a crowdsourcing approach that collects human labeled categories. Given the vocabulary, we then construct a large-scale image dataset of human-object interactions by annotating web images through online crowdsourcing. The new "HICO" dataset surpasses prior datasets in term of both the number of images and action categories by one order of magnitude. The introduction of HICO enables us to benchmark state-of-the-art recognition approaches and also shed light on new challenges in the realm of large-scale HOI recognition. We further discover that visual features of humans, objects, as well as their spatial relations play a central role in the representation of interaction, and the combination of three can improve the recognition outcome. The second part of this dissertation considers the synthesis task, and focuses particularly on the synthesis of body motion. The central goal is: given an image of a scene, synthesize the course of an action conditioned on the observed scene. Such capability can predict possible actions afforded by the scene, and will facilitate efficient reactions in human-robot interactions. We investigate two types of synthesis tasks: semantic-driven synthesis and goal-driven synthesis. For semantic-driven synthesis, we study the forecasting of human dynamics from a static image. We propose a novel deep neural network architecture that extracts semantic information from the image and use it to predict future body movement. For goal-directed synthesis, we study the synthesis of motion defined by human-object interactions. We focus on one particular class of interactions—a person sitting onto a chair. To ensure realistic motion from physical interactions, we leverage a physics simulated environment that contains a humanoid and chair model. We propose a novel reinforcement learning framework, and show that the synthesized motion can generalize to different initial human-chair configurations. At the end of this dissertation, we also contribute a new approach to temporal action localization, an essential task in video action understanding. We address the shortcomings of prior Faster R-CNN based approaches, and show state-of-the-art performance on standard benchmarks.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/150045/1/ywchao_1.pd

    Towards Efficient Visual Analysis Without Extra Supervision

    Full text link
    Visual analysis has received increasing attention in the fields of computer vision and multimedia. With enough labeled training data,existing deep learning based methods can achieve promising performance. However, visual analysis faces the severe data-scarcitychallenge: for some category of interest, only very few, perhaps even none, positive examples are available, and the performancedrops dramatically when the number of positive samples falls short. In some real-world applications, people are also interested inrecognizing concepts that do not appear in the training stage at all.Zero-shot learning and few-shot learning has been widely explored to tackle the problem of data scarcity. Although some promisingresults have been achieved, existing models still have some inherent limitations. 1) They lack the ability to simultaneously detectand recognize unseen objects by only exploring natural language description. 2) They fail to consider that different conceptshave different degree of relevance to a certain category. They cannot mine these difference statistically for a more accurateevent-concept association. 3) They remain very limited in terms of their ability to deal with semantically unrepresentative eventnames, and lack of coherence between visual and textual concepts. 4) They lack the ability to improve the model performanceby recycling the given limited annotation. To solve these challenges, in this thesis, we aim to develop a series of robust statisticallearning models to improve the performance visual analysis without extra supervision.In Chapter 2, we focus on how to simultaneously recognize and locate novel object instances using purely unstructured textualdescriptions with no training samples. The goal is to concurrently link visual image features with the semantic label informationwhere the descriptions of novel concepts are presented in the form of natural languages. In Chapter 3, we propose a new zero-shotevent detection approach, which exploits the semantic correlation between an event and concepts. Our method learns the semanticcorrelation from the concept vocabulary and emphasizes on the most related concepts. In Chapter 4, we propose a method ofgrounding visual concepts for large-scale Multimedia Event Detection and Multimedia Event Captioning in zero-shot setting. InChapter 5, we present a novel improved temporal action localization model that is better able to take advantage of limited labeleddata available

    On the semantic information in zero-shot action recognition

    Get PDF
    Orientador: Dr. David MenottiCoorientador: Dr. Hélio PedriniTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 14/04/2023Inclui referências: p. 117-132Área de concentração: Ciência da ComputaçãoResumo: Os avanços da última década em modelos de aprendizagem profunda aliados à alta disponibilidade de exemplos em plataformas como o YouTube foram responsáveis por notáveis progressos no problema de Reconhecimento de Ações Humanas (RAH) em vídeos. Esses avanços trouxeram o desafio da inclusão de novas classes aos modelos existentes, pois incluí-las é uma tarefa que demanda tempo e recursos computacionais. Além disso, novas classes de ações são frequentemente criadas pelo uso de novos objetos ou novas formas de interação entre humanos. Esse cenário é o que motiva o problema Zero-Shot Action Recognition (ZSAR), definido como classificar instâncias pertencentes a classes não disponíveis na fase de treinamento dos modelos. Métodos ZSAR objetivam aprender funções de projeção que relacionem as representações dos vídeos com as representações semânticas dos rótulos das classes conhecidas. Trata-se, portanto, de um problema de representação multi-modal. Nesta tese, investigamos o problema do semantic gap em ZSAR, ou seja, as propriedades dos espaços vetoriais das representações dos vídeos e dos rótulos não são coincidentes e, muitas vezes, as funções de projeção aprendidas são insuficientes para corrigir distorções. Nós defendemos que o semantic gap deriva do que chamamos semantic lack, ou falta de semântica, que ocorre em ambos os lados do problema (i.e., vídeos e rótulos) e não é suficientemente investigada na literatura. Apresentamos três abordagens ao problema investigando diferentes informações semânticas e formas de representação para vídeos e rótulos. Mostramos que uma forma eficiente de representar vídeos é transformando-os em sentenças descritivas utilizando métodos de video captioning. Essa abordagem permite descrever cenários, objetos e interações espaciais e temporais entre humanos. Nós mostramos que sua adoção gera modelos de alta eficácia comparados à literatura. Também propusemos incluir informações descritivas sobre os objetos presentes nas cenas a partir do uso de métodos treinados em reconhecimento de objetos. Mostramos que a representação dos rótulos de classes apresenta melhores resultados com o uso de sentenças extraídas de textos descritivos coletados da Internet. Ao usar apenas textos, nós nos valemos de modelos de redes neurais profundas pré-treinados na tarefa de paráfrase para codificar a informação e realizar a classificação ZSAR com reduzido semantic gap. Finalmente, mostramos como condicionar a representação dos quadros de um vídeo à sua correspondente descrição texual, produzindo um modelo capaz de representar em um espaço vetorial conjunto tanto vídeos quanto textos. As abordagens apresentadas nesta tese mostraram efetiva redução do semantic gap a partir das contribuições tanto em acréscimo de informação quanto em formas de codificação.Abstract: The advancements of the last decade in deep learning models and the high availability of examples on platforms such as YouTube were responsible for notable progress in the problem of Human Action Recognition (HAR) in videos. These advancements brought the challenge of adding new classes to existing models, since including them takes time and computational resources. In addition, new classes of actions are frequently created, either by using new objects or new forms of interaction between humans. This scenario motivates the Zero-Shot Action Recognition (ZSAR) problem, defined as classifying instances belonging to classes not available for the model training phase. ZSAR methods aim to learn projection functions associating video representations with semantic label representations of known classes. Therefore, it is a multi-modal representation problem. In this thesis, we investigate the semantic gap problem in ZSAR. The properties of vector spaces are not coincident, and, often, the projection functions learned are insufficient to correct distortions. We argue that the semantic gap derives from what we call semantic lack, which occurs on both sides of the problem (i.e., videos and labels) and is not sufficiently investigated in the literature. We present three approaches to the problem, investigating different information and representation strategies for videos and labels. We show an efficient way to represent videos by transforming them into descriptive sentences using video captioning methods. This approach enables us to produce high-performance models compared to the literature. We also proposed including descriptive information about objects present in the scenes using object recognition methods. We showed that the representation of class labels presents better results using sentences extracted from descriptive texts collected on the Internet. Using only texts, we employ deep neural network models pre-trained in the paraphrasing task to encode the information and perform the ZSAR classification with a reduced semantic gap. Finally, we show how conditioning the representation of video frames to their corresponding textual description produces a model capable of representing both videos and texts in a joint vector space. The approaches presented in this thesis showed an effective reduction of the semantic gap based on contributions in addition to information and representation ways

    Semantic and spatio-temporal understanding for computer vision driven worker safety inspection and risk analysis

    Get PDF
    Despite decades of efforts, we are still far from eliminating construction safety risks. Recently, computer vision techniques have been applied for construction safety management on real-world residential and commercial projects; they have shown the potential to fundamentally change safety management practices and safety performance measurement. The most significant breakthroughs of this field have been achieved in the areas of safety practice observations, incident and safety performance forecasting, and vision-based construction risk assessment. However, fundamental theoretical and technical challenges have yet to be addressed in order to achieve the full potential of construction site images and videos for construction safety. This dissertation explores methods for automated semantic and spatio-temporal visual understanding of workers and equipment and how to use them to improve automatic safety inspections and risk analysis: (1) a new method is developed to improve the breadth and depth of vision-based safety compliance checking by explicitly classifying worker-tool interactions. A detection model is trained on a newly constructed image dataset for construction sites, achieving 52.9% mean average precision for 10 object categories and 89.4% average precision for detecting workers. Using this detector and new dataset, the proposed human-object interaction recognition model achieved 79.78% precision and 77.64% recall for hard hat checking; 79.11% precision and 75.29% recall for safety vest checking. The new model also verifies hand protection for workers when tools are being used with 66.2% precision and 64.86% recall. The proposed model is superior to methods relying on hand-made rules to recognize interactions or that reason directly on the outputs of object detectors. (2) to support systems that proactively prevent these accidents, this thesis presents a path prediction model for workers and equipment. The model leverages the extracted video frames to predict upcoming worker and equipment motion trajectories on construction sites. Specifically, the model takes 2D tracks of workers and equipment from visual data -based on computer vision methods for detection and tracking- and uses a Long Short-Term Memory (LSTM) encoder-decoder followed by a Mixture Density Network (MDN) to predict their locations. A multi-head prediction module is introduced to predict locations at different future times. The method is validated on an existing dataset TrajNet and a new dataset of 105 high-definition videos recorded over 30 days from a real-world construction site. On the TrajNet dataset, the proposed model significantly outperforms Social LSTM. On the new dataset, the presented model outperforms conventional time-series models and achieves average localization errors of 7.30, 12.71, and 24.22 pixels for 10, 20, and 40 future steps, respectively. (3) A new construction worker safety analysis method is introduced that evaluates worker-level risk from site photos and videos. This method evaluates worker state, which is based on workers' body pose, their protective equipment use, their interactions with tools and materials, the construction activity being performed, and hazards in the workplace. To estimate worker state, a visual-based Object-Activity-Keypoint (OAK) recognition model is proposed that takes 36.6% less time and 40.1% less memory while keeping comparably performances compared to a system running individual models for each sub-task. Worker activity recognition is further improved with a spatio-temporal graph model using recognized per-frame worker activity, detected bounding boxes of tools and materials, and estimated worker poses. Finally, severity levels are predicted by a trained classifier on a dataset of images of construction workers accompanied with ground truth severity level annotations. In the test dataset, the severity level prediction model achieves 85.7% cross-validation accuracy in a bricklaying task and 86.6% cross-validation accuracy for a plastering task
    corecore