130 research outputs found

    Large-scale video analysis and understanding

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Video understanding is a complex task in computer vision, which requires not only recognizing objects, persons, and scenes, but also capturing and remembering the changes of visual content along time. Rapid development in building blocks like image classification task in recent years provides great opportunities for accurate and efficient video understanding. Based on deep convolutional neural networks and recurrent neural networks, various kinds of deep learning applications on video understanding have been studied. In this thesis, I present my research on large-scale video analysis and understanding in three major aspects: video representation learning, recognition with limited examples, and vision & language. Representation and features are the most important part for vision tasks, since it is very general and can be used for classification task, detection task and also tasks for structural prediction like vision and language. We begin with video classification from multimodal features, which are hand-crafted features from different streams, i.e. vision and audio. For representation learning, we investigate aggregation methods to generate video representation from frame features. Significant improvements over classical pooling methods have been demonstrated. In addition, we propose a hierarchical recurrent neural network to learn the hierarchical structure for video. Going beyond supervised learning, we develop a sequence model to learn from reconstruction of future and past features based on the current sequences, showing that unlabeled videos can help learning good and generalizable video representation. We explore the problem of recognition with limited examples, which tries to tackle the situation that we cannot obtain enough data to train the model. The encouraging results show that it is feasible to obtain good performance with only a few examples for the target class. Except for the video classification task which only outputs labels for the video, we also seek for richer interaction between machine and human on vision content via natural language. We consider two major forms of vision and language tasks, the first is video captioning, i.e., to automatically generate caption to describe the given video sequence, and video question answering, i.e., to answer questions related to the presented video sequence. Finally, I conclude the thesis with some future directions on video understanding

    Ti-MAE: Self-Supervised Masked Time Series Autoencoders

    Full text link
    Multivariate Time Series forecasting has been an increasingly popular topic in various applications and scenarios. Recently, contrastive learning and Transformer-based models have achieved good performance in many long-term series forecasting tasks. However, there are still several issues in existing methods. First, the training paradigm of contrastive learning and downstream prediction tasks are inconsistent, leading to inaccurate prediction results. Second, existing Transformer-based models which resort to similar patterns in historical time series data for predicting future values generally induce severe distribution shift problems, and do not fully leverage the sequence information compared to self-supervised methods. To address these issues, we propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution. In detail, Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level. Ti-MAE adopts mask modeling (rather than contrastive learning) as the auxiliary task and bridges the connection between existing representation learning and generative Transformer-based methods, reducing the difference between upstream and downstream forecasting tasks while maintaining the utilization of original time series data. Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data, yielding better performance in time series forecasting and classification tasks.Comment: 20 pages, 7 figure

    Efficient Offline Policy Optimization with a Learned Model

    Full text link
    MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model. On the large-scale Atari benchmark, the proposed method outperforms MuZero Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e., 1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM normalized score with the same hardware and software stacks. Our implementation is open-sourced at https://github.com/sail-sg/rosmo.Comment: ICLR202

    DaXBench: Benchmarking Deformable Object Manipulation with Differentiable Physics

    Full text link
    Deformable Object Manipulation (DOM) is of significant importance to both daily and industrial applications. Recent successes in differentiable physics simulators allow learning algorithms to train a policy with analytic gradients through environment dynamics, which significantly facilitates the development of DOM algorithms. However, existing DOM benchmarks are either single-object-based or non-differentiable. This leaves the questions of 1) how a task-specific algorithm performs on other tasks and 2) how a differentiable-physics-based algorithm compares with the non-differentiable ones in general. In this work, we present DaXBench, a differentiable DOM benchmark with a wide object and task coverage. DaXBench includes 9 challenging high-fidelity simulated tasks, covering rope, cloth, and liquid manipulation with various difficulty levels. To better understand the performance of general algorithms on different DOM tasks, we conduct comprehensive experiments over representative DOM methods, ranging from planning to imitation learning and reinforcement learning. In addition, we provide careful empirical studies of existing decision-making algorithms based on differentiable physics, and discuss their limitations, as well as potential future directions.Comment: ICLR 2023 (Oral

    Strategies for Searching Video Content with Text Queries or Video Examples

    Full text link
    The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines. Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity problem by directly analyzing the visual and audio streams of each video. CBVR encompasses multiple research topics, including low-level feature design, feature fusion, semantic detector training and video search/reranking. We present novel strategies in these topics to enhance CBVR in both accuracy and speed under different query inputs, including pure textual queries and query by video examples. Our proposed strategies have been incorporated into our submission for the TRECVID 2014 Multimedia Event Detection evaluation, where our system outperformed other submissions in both text queries and video example queries, thus demonstrating the effectiveness of our proposed approaches
    • …
    corecore