18 research outputs found
A Review on Deep Learning Techniques for Video Prediction
The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions.This work has been funded by the Spanish Government PID2019-104818RB-I00 grant for the MoDeaAS project, supported with Feder funds. This work has also been supported by two Spanish national grants for PhD studies, FPU17/00166, and ACIF/2018/197 respectively
Meta-Learning Dynamics Forecasting Using Task Inference
Current deep learning models for dynamics forecasting struggle with
generalization. They can only forecast in a specific domain and fail when
applied to systems with different parameters, external forces, or boundary
conditions. We propose a model-based meta-learning method called DyAd which can
generalize across heterogeneous domains by partitioning them into different
tasks. DyAd has two parts: an encoder which infers the time-invariant hidden
features of the task with weak supervision, and a forecaster which learns the
shared dynamics of the entire domain. The encoder adapts and controls the
forecaster during inference using adaptive instance normalization and adaptive
padding. Theoretically, we prove that the generalization error of such
procedure is related to the task relatedness in the source domain, as well as
the domain differences between source and target. Experimentally, we
demonstrate that our model outperforms state-of-the-art approaches on both
turbulent flow and real-world ocean data forecasting tasks
Neural Video Recovery for Cloud Gaming
Cloud gaming is a multi-billion dollar industry. A client in cloud gaming
sends its movement to the game server on the Internet, which renders and
transmits the resulting video back. In order to provide a good gaming
experience, a latency below 80 ms is required. This means that video rendering,
encoding, transmission, decoding, and display have to finish within that time
frame, which is especially challenging to achieve due to server overload,
network congestion, and losses. In this paper, we propose a new method for
recovering lost or corrupted video frames in cloud gaming. Unlike traditional
video frame recovery, our approach uses game states to significantly enhance
recovery accuracy and utilizes partially decoded frames to recover lost
portions. We develop a holistic system that consists of (i) efficiently
extracting game states, (ii) modifying H.264 video decoder to generate a mask
to indicate which portions of video frames need recovery, and (iii) designing a
novel neural network to recover either complete or partial video frames. Our
approach is extensively evaluated using iPhone 12 and laptop implementations,
and we demonstrate the utility of game states in the game video recovery and
the effectiveness of our overall design
When Deep Learning Meets Data Alignment: A Review on Deep Registration Networks (DRNs)
Registration is the process that computes the transformation that aligns sets
of data. Commonly, a registration process can be divided into four main steps:
target selection, feature extraction, feature matching, and transform
computation for the alignment. The accuracy of the result depends on multiple
factors, the most significant are the quantity of input data, the presence of
noise, outliers and occlusions, the quality of the extracted features,
real-time requirements and the type of transformation, especially those ones
defined by multiple parameters, like non-rigid deformations.
Recent advancements in machine learning could be a turning point in these
issues, particularly with the development of deep learning (DL) techniques,
which are helping to improve multiple computer vision problems through an
abstract understanding of the input data. In this paper, a review of deep
learning-based registration methods is presented. We classify the different
papers proposing a framework extracted from the traditional registration
pipeline to analyse the new learning-based proposal strengths. Deep
Registration Networks (DRNs) try to solve the alignment task either replacing
part of the traditional pipeline with a network or fully solving the
registration problem. The main conclusions extracted are, on the one hand, 1)
learning-based registration techniques cannot always be clearly classified in
the traditional pipeline. 2) These approaches allow more complex inputs like
conceptual models as well as the traditional 3D datasets. 3) In spite of the
generality of learning, the current proposals are still ad hoc solutions.
Finally, 4) this is a young topic that still requires a large effort to reach
general solutions able to cope with the problems that affect traditional
approaches.Comment: Submitted to Pattern Recognitio
Precision and power grip detection in egocentric hand-object Interaction using machine learning
This project, was carried out in Yverdon-les-Bains, Switzerland, between the University of Applied Sciences and Arts Western Switzerland (HEIG-VD / HES-SO) and the Centre Hospitalier Universitaire Vaudois (CHUV) in Lausanne, it focuses on the detection of grasp types from an egocentric point of view. The objective is to accurately determine the kind of grasp (power, precision and none) performed by a user based on images captured from their perspective. The successful implementation of this grasp detection system would greatly benefit the evaluation of patients undergoing upper limb rehabilitation. Various computer vision frameworks were utilized to detect hands, interacting objects, and depth information in the images. These extracted features were then fed into deep learning models for grasp prediction. Both custom recorded datasets and open-source datasets, such as EpicKitchen and the Yale dataset, were employed for training and evaluation. In conclusion, this project achieved satisfactory results in the detection of grasp types from an egocentric viewpoint, with a 0.76 F1-macro score in the final test set. The utilization of diverse videos, including custom recordings and publicly available datasets, facilitated comprehensive training and evaluation. A robust pipeline was developed through iterative refinement, enabling the extraction of crucial features from each frame to predict grasp types accurately. Furthermore, data mixtures were proposed to enhance dataset size and improve the generalization performance of the models, which played a crucial role in the project's final stages
Artificial intelligence approaches for materials-by-design of energetic materials: state-of-the-art, challenges, and future directions
Artificial intelligence (AI) is rapidly emerging as an enabling tool for
solving various complex materials design problems. This paper aims to review
recent advances in AI-driven materials-by-design and their applications to
energetic materials (EM). Trained with data from numerical simulations and/or
physical experiments, AI models can assimilate trends and patterns within the
design parameter space, identify optimal material designs (micro-morphologies,
combinations of materials in composites, etc.), and point to designs with
superior/targeted property and performance metrics. We review approaches
focusing on such capabilities with respect to the three main stages of
materials-by-design, namely representation learning of microstructure
morphology (i.e., shape descriptors), structure-property-performance (S-P-P)
linkage estimation, and optimization/design exploration. We provide a
perspective view of these methods in terms of their potential, practicality,
and efficacy towards the realization of materials-by-design. Specifically,
methods in the literature are evaluated in terms of their capacity to learn
from a small/limited number of data, computational complexity,
generalizability/scalability to other material species and operating
conditions, interpretability of the model predictions, and the burden of
supervision/data annotation. Finally, we suggest a few promising future
research directions for EM materials-by-design, such as meta-learning, active
learning, Bayesian learning, and semi-/weakly-supervised learning, to bridge
the gap between machine learning research and EM research
Video Transformers: A Survey
Transformer models have shown great success handling long-range interactions,
making them a promising tool for modeling video. However they lack inductive
biases and scale quadratically with input length. These limitations are further
exacerbated when dealing with the high dimensionality introduced with the
temporal dimension. While there are surveys analyzing the advances of
Transformers for vision, none focus on an in-depth analysis of video-specific
designs. In this survey we analyze main contributions and trends of works
leveraging Transformers to model video. Specifically, we delve into how videos
are handled as input-level first. Then, we study the architectural changes made
to deal with video more efficiently, reduce redundancy, re-introduce useful
inductive biases, and capture long-term temporal dynamics. In addition we
provide an overview of different training regimes and explore effective
self-supervised learning strategies for video. Finally, we conduct a
performance comparison on the most common benchmark for Video Transformers
(i.e., action classification), finding them to outperform 3D ConvNets even with
less computational complexity