18 research outputs found

    A Review on Deep Learning Techniques for Video Prediction

    Get PDF
    The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions.This work has been funded by the Spanish Government PID2019-104818RB-I00 grant for the MoDeaAS project, supported with Feder funds. This work has also been supported by two Spanish national grants for PhD studies, FPU17/00166, and ACIF/2018/197 respectively

    Meta-Learning Dynamics Forecasting Using Task Inference

    Full text link
    Current deep learning models for dynamics forecasting struggle with generalization. They can only forecast in a specific domain and fail when applied to systems with different parameters, external forces, or boundary conditions. We propose a model-based meta-learning method called DyAd which can generalize across heterogeneous domains by partitioning them into different tasks. DyAd has two parts: an encoder which infers the time-invariant hidden features of the task with weak supervision, and a forecaster which learns the shared dynamics of the entire domain. The encoder adapts and controls the forecaster during inference using adaptive instance normalization and adaptive padding. Theoretically, we prove that the generalization error of such procedure is related to the task relatedness in the source domain, as well as the domain differences between source and target. Experimentally, we demonstrate that our model outperforms state-of-the-art approaches on both turbulent flow and real-world ocean data forecasting tasks

    Neural Video Recovery for Cloud Gaming

    Full text link
    Cloud gaming is a multi-billion dollar industry. A client in cloud gaming sends its movement to the game server on the Internet, which renders and transmits the resulting video back. In order to provide a good gaming experience, a latency below 80 ms is required. This means that video rendering, encoding, transmission, decoding, and display have to finish within that time frame, which is especially challenging to achieve due to server overload, network congestion, and losses. In this paper, we propose a new method for recovering lost or corrupted video frames in cloud gaming. Unlike traditional video frame recovery, our approach uses game states to significantly enhance recovery accuracy and utilizes partially decoded frames to recover lost portions. We develop a holistic system that consists of (i) efficiently extracting game states, (ii) modifying H.264 video decoder to generate a mask to indicate which portions of video frames need recovery, and (iii) designing a novel neural network to recover either complete or partial video frames. Our approach is extensively evaluated using iPhone 12 and laptop implementations, and we demonstrate the utility of game states in the game video recovery and the effectiveness of our overall design

    When Deep Learning Meets Data Alignment: A Review on Deep Registration Networks (DRNs)

    Get PDF
    Registration is the process that computes the transformation that aligns sets of data. Commonly, a registration process can be divided into four main steps: target selection, feature extraction, feature matching, and transform computation for the alignment. The accuracy of the result depends on multiple factors, the most significant are the quantity of input data, the presence of noise, outliers and occlusions, the quality of the extracted features, real-time requirements and the type of transformation, especially those ones defined by multiple parameters, like non-rigid deformations. Recent advancements in machine learning could be a turning point in these issues, particularly with the development of deep learning (DL) techniques, which are helping to improve multiple computer vision problems through an abstract understanding of the input data. In this paper, a review of deep learning-based registration methods is presented. We classify the different papers proposing a framework extracted from the traditional registration pipeline to analyse the new learning-based proposal strengths. Deep Registration Networks (DRNs) try to solve the alignment task either replacing part of the traditional pipeline with a network or fully solving the registration problem. The main conclusions extracted are, on the one hand, 1) learning-based registration techniques cannot always be clearly classified in the traditional pipeline. 2) These approaches allow more complex inputs like conceptual models as well as the traditional 3D datasets. 3) In spite of the generality of learning, the current proposals are still ad hoc solutions. Finally, 4) this is a young topic that still requires a large effort to reach general solutions able to cope with the problems that affect traditional approaches.Comment: Submitted to Pattern Recognitio

    Precision and power grip detection in egocentric hand-object Interaction using machine learning

    Get PDF
    This project, was carried out in Yverdon-les-Bains, Switzerland, between the University of Applied Sciences and Arts Western Switzerland (HEIG-VD / HES-SO) and the Centre Hospitalier Universitaire Vaudois (CHUV) in Lausanne, it focuses on the detection of grasp types from an egocentric point of view. The objective is to accurately determine the kind of grasp (power, precision and none) performed by a user based on images captured from their perspective. The successful implementation of this grasp detection system would greatly benefit the evaluation of patients undergoing upper limb rehabilitation. Various computer vision frameworks were utilized to detect hands, interacting objects, and depth information in the images. These extracted features were then fed into deep learning models for grasp prediction. Both custom recorded datasets and open-source datasets, such as EpicKitchen and the Yale dataset, were employed for training and evaluation. In conclusion, this project achieved satisfactory results in the detection of grasp types from an egocentric viewpoint, with a 0.76 F1-macro score in the final test set. The utilization of diverse videos, including custom recordings and publicly available datasets, facilitated comprehensive training and evaluation. A robust pipeline was developed through iterative refinement, enabling the extraction of crucial features from each frame to predict grasp types accurately. Furthermore, data mixtures were proposed to enhance dataset size and improve the generalization performance of the models, which played a crucial role in the project's final stages

    Artificial intelligence approaches for materials-by-design of energetic materials: state-of-the-art, challenges, and future directions

    Full text link
    Artificial intelligence (AI) is rapidly emerging as an enabling tool for solving various complex materials design problems. This paper aims to review recent advances in AI-driven materials-by-design and their applications to energetic materials (EM). Trained with data from numerical simulations and/or physical experiments, AI models can assimilate trends and patterns within the design parameter space, identify optimal material designs (micro-morphologies, combinations of materials in composites, etc.), and point to designs with superior/targeted property and performance metrics. We review approaches focusing on such capabilities with respect to the three main stages of materials-by-design, namely representation learning of microstructure morphology (i.e., shape descriptors), structure-property-performance (S-P-P) linkage estimation, and optimization/design exploration. We provide a perspective view of these methods in terms of their potential, practicality, and efficacy towards the realization of materials-by-design. Specifically, methods in the literature are evaluated in terms of their capacity to learn from a small/limited number of data, computational complexity, generalizability/scalability to other material species and operating conditions, interpretability of the model predictions, and the burden of supervision/data annotation. Finally, we suggest a few promising future research directions for EM materials-by-design, such as meta-learning, active learning, Bayesian learning, and semi-/weakly-supervised learning, to bridge the gap between machine learning research and EM research

    Video Transformers: A Survey

    Full text link
    Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced with the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey we analyze main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled as input-level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity
    corecore