8 research outputs found
Dilated Temporal Relational Adversarial Network for Generic Video Summarization
The large amount of videos popping up every day, make it more and more
critical that key information within videos can be extracted and understood in
a very short time. Video summarization, the task of finding the smallest subset
of frames, which still conveys the whole story of a given video, is thus of
great significance to improve efficiency of video understanding. We propose a
novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to
achieve frame-level video summarization. Given a video, it selects the set of
key frames, which contain the most meaningful and compact information.
Specifically, DTR-GAN learns a dilated temporal relational generator and a
discriminator with three-player loss in an adversarial manner. A new dilated
temporal relation (DTR) unit is introduced to enhance temporal representation
capturing. The generator uses this unit to effectively exploit global
multi-scale temporal context to select key frames and to complement the
commonly used Bi-LSTM. To ensure that summaries capture enough key video
representation from a global perspective rather than a trivial randomly shorten
sequence, we present a discriminator that learns to enforce both the
information completeness and compactness of summaries via a three-player loss.
The loss includes the generated summary loss, the random summary loss, and the
real summary (ground-truth) loss, which play important roles for better
regularizing the learned model to obtain useful summaries. Comprehensive
experiments on three public datasets show the effectiveness of the proposed
approach
Contextual RNN-GANs for Abstract Reasoning Diagram Generation
Understanding object motions and transformations is a core problem in computer science. Modeling sequences of evolving images may provide better representations and models of motion and may ultimately be used for forecasting or simulation. Diagrammatic Abstract Reasoning is an avenue in which diagrams evolve in complex patterns and one needs to infer the underlying pattern sequence and generate the next image in the sequence. For this, we develop a novel Contextual Generative Adversarial Network based on Recurrent Neural Networks (Context-RNN-GANs), where both the generator and the discriminator modules are based on contextual history and the adversarial discriminator guides the generator to produce realistic images for the particular time step in the image sequence. We employ the Context-RNN-GAN model (and its variants) on a novel dataset of Diagrammatic Abstract Reasoning as well as perform initial evaluations on a next-frame prediction task of videos. Empirically, we show that our Context-RNN-GAN model performs competitively with 10th-grade human performance but there is still scope for interesting improvements as compared to college-grade human performance
Sistema de Aprendizaje Profundo para reconocimiento de actividades con sensores de captura de movimientos
En este este Trabajo Fin de Máster se desarrolla un sistema para el reconocimiento de actividades humanas (HAR) a partir de lo que se conoce como redes neuronales y sensores inerciales. El sistema es capaz de distinguir entre 11 actividades a partir de los datos que indican la orientación del cuerpo (cuaterniones) provenientes únicamente de 5 sensores. Las pruebas han sido realizadas con un conjunto de datos públicos conocido como REALDISP, ampliamente utilizado en la resolución de problemas HAR. También se aborda el problema de la generación de datos de movimiento a partir de redes neuronales, como complemento a la resolución del problema HAR.
Además, a lo largo de este Trabajo Fin de Máster se expone la situación en la que se encuentra hoy en dÃa el reconocimiento de actividades fÃsicas mediante el uso de la Inteligencia Artificial y más en concreto, del Deep Learning, asà como los fundamentos matemáticos y teóricos en los que se basa el diseño de redes neuronales, con el objetivo de justificar las decisiones de diseño que se han llevado a cabo. Finalmente, se describen las redes neuronales diseñadas presentando los resultados obtenidos.In this End of Master Project a system for Human Action Recognition (HAR) is developed with Artificial Neural Networks and inertial sensors. The system can distinguish between 11 activities from data that indicate the body’s orientation, that is, quaternions, which come from only 5 sensors. The tests have been done with a public dataset known as REALDISP, widely used in solving HAR problems. The generation of movement data is also addressed using neural networks, as a complement of HAR problem.
In addition, throughout this End of Master Project, the nowadays’ situation of the recognition of physical activities through the use of Artificial Intelligence and, more specifically, Deep Learning, is exposed, as well as the mathematical and theoretical foundations on which the design of neural networks is based, in order to justify the design decisions that have been carried out. Finally, the neural networks designed by presenting the resulst obtained are described.Departamento de TeorÃa de la Señal y Comunicaciones e IngenierÃa TelemáticaMáster en IngenierÃa de Telecomunicació
Recommended from our members
Robust and Efficient Classification of Videos in the Wild
Recognizing human actions in videos is a long-standing problem in computer vision with a wide range of applications including video surveillance, content retrieval, and sports analysis. This thesis focuses on addressing efficiency and robustness of video classification in unconstrained real-world settings. The thesis work can be broadly divided into four major parts.
First, we address view-invariant action recognition. This problem is formulated within the multi-task learning framework, where the action model of each viewpoint is specified as a separate task and all tasks are trained jointly.
Second, we address a large-scale action recognition in uncontrolled settings. For robustness, we augment the standard training video dataset with additional data from another modality data source -- namely, 3D skeleton sequences of human body motion --. A recurrent neural network called long short-term memory (LSTM) is used to encode sequences from 3D skeleton data. For learning another LSTM for video classification, we use a modified hybrid backpropagation through time algorithm.
Third, we address the unsupervised video summarization. We formulate the problem as a subset frame selection and specified a novel deep generative network to compute a video summary with the smallest representation error.
Fourth, we introduce the new problem of budget-aware semantic segmentation of videos. In this line of work, we consider two models. The first model uses a conditional random field (CRF) model and replaces the standard inference steps for feature computation with a sequential policy which intelligently selects a subset of regions and their corresponding features. The second model is a deep recurrent policy which is learned to select a subset of frames and uses a shallow convolutional neural network (CNN) to propagate the available segmentation to unlabeled frames.
This research has advanced the state of the art in computer vision because the approaches developed enabled meeting stringent runtime requirements arising in many applications, and working in less sanitized settings