19,126 research outputs found

    Embedding Visual Hierarchy with Deep Networks for Large-Scale Visual Recognition

    Full text link
    In this paper, a level-wise mixture model (LMM) is developed by embedding visual hierarchy with deep networks to support large-scale visual recognition (i.e., recognizing thousands or even tens of thousands of object classes), and a Bayesian approach is used to adapt a pre-trained visual hierarchy automatically to the improvements of deep features (that are used for image and object class representation) when more representative deep networks are learned along the time. Our LMM model can provide an end-to-end approach for jointly learning: (a) the deep networks to extract more discriminative deep features for image and object class representation; (b) the tree classifier for recognizing large numbers of object classes hierarchically; and (c) the visual hierarchy adaptation for achieving more accurate indexing of large numbers of object classes hierarchically. By supporting joint learning of the tree classifier, the deep networks and the visual hierarchy adaptation, our LMM algorithm can provide an effective approach for controlling inter-level error propagation effectively, thus it can achieve better accuracy rates on large-scale visual recognition. Our experiments are carried on ImageNet1K and ImageNet10K image sets, and our LMM algorithm can achieve very competitive results on both the accuracy rates and the computation efficiency as compared with the baseline methods

    Jointly Localizing and Describing Events for Dense Video Captioning

    Full text link
    Automatically describing a video with natural language is regarded as a fundamental challenge in computer vision. The problem nevertheless is not trivial especially when a video contains multiple events to be worthy of mention, which often happens in real videos. A valid question is how to temporally localize and then describe events, which is known as "dense video captioning." In this paper, we present a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner. To combine these two worlds, we integrate a new design, namely descriptiveness regression, into a single shot detection structure to infer the descriptive complexity of each detected proposal via sentence generation. This in turn adjusts the temporal locations of each event proposal. Our model differs from existing dense video captioning methods since we propose a joint and global optimization of detection and captioning, and the framework uniquely capitalizes on an attribute-augmented video captioning architecture. Extensive experiments are conducted on ActivityNet Captions dataset and our framework shows clear improvements when compared to the state-of-the-art techniques. More remarkably, we obtain a new record: METEOR of 12.96% on ActivityNet Captions official test set.Comment: CVPR 2018 Spotlight, Rank 1 in ActivityNet Captions Challenge 201

    End-to-End Video Classification with Knowledge Graphs

    Full text link
    Video understanding has attracted much research attention especially since the recent availability of large-scale video benchmarks. In this paper, we address the problem of multi-label video classification. We first observe that there exists a significant knowledge gap between how machines and humans learn. That is, while current machine learning approaches including deep neural networks largely focus on the representations of the given data, humans often look beyond the data at hand and leverage external knowledge to make better decisions. Towards narrowing the gap, we propose to incorporate external knowledge graphs into video classification. In particular, we unify traditional "knowledgeless" machine learning models and knowledge graphs in a novel end-to-end framework. The framework is flexible to work with most existing video classification algorithms including state-of-the-art deep models. Finally, we conduct extensive experiments on the largest public video dataset YouTube-8M. The results are promising across the board, improving mean average precision by up to 2.9%.Comment: 9 pages, 5 figure

    Design Challenges of Multi-UAV Systems in Cyber-Physical Applications: A Comprehensive Survey, and Future Directions

    Full text link
    Unmanned Aerial Vehicles (UAVs) have recently rapidly grown to facilitate a wide range of innovative applications that can fundamentally change the way cyber-physical systems (CPSs) are designed. CPSs are a modern generation of systems with synergic cooperation between computational and physical potentials that can interact with humans through several new mechanisms. The main advantages of using UAVs in CPS application is their exceptional features, including their mobility, dynamism, effortless deployment, adaptive altitude, agility, adjustability, and effective appraisal of real-world functions anytime and anywhere. Furthermore, from the technology perspective, UAVs are predicted to be a vital element of the development of advanced CPSs. Therefore, in this survey, we aim to pinpoint the most fundamental and important design challenges of multi-UAV systems for CPS applications. We highlight key and versatile aspects that span the coverage and tracking of targets and infrastructure objects, energy-efficient navigation, and image analysis using machine learning for fine-grained CPS applications. Key prototypes and testbeds are also investigated to show how these practical technologies can facilitate CPS applications. We present and propose state-of-the-art algorithms to address design challenges with both quantitative and qualitative methods and map these challenges with important CPS applications to draw insightful conclusions on the challenges of each application. Finally, we summarize potential new directions and ideas that could shape future research in these areas

    Robust Visual Knowledge Transfer via EDA

    Full text link
    We address the problem of visual knowledge adaptation by leveraging labeled patterns from source domain and a very limited number of labeled instances in target domain to learn a robust classifier for visual categorization. This paper proposes a new extreme learning machine based cross-domain network learning framework, that is called Extreme Learning Machine (ELM) based Domain Adaptation (EDA). It allows us to learn a category transformation and an ELM classifier with random projection by minimizing the l_(2,1)-norm of the network output weights and the learning error simultaneously. The unlabeled target data, as useful knowledge, is also integrated as a fidelity term to guarantee the stability during cross domain learning. It minimizes the matching error between the learned classifier and a base classifier, such that many existing classifiers can be readily incorporated as base classifiers. The network output weights cannot only be analytically determined, but also transferrable. Additionally, a manifold regularization with Laplacian graph is incorporated, such that it is beneficial to semi-supervised learning. Extensively, we also propose a model of multiple views, referred as MvEDA. Experiments on benchmark visual datasets for video event recognition and object recognition, demonstrate that our EDA methods outperform existing cross-domain learning methods.Comment: This paper has been accepted for publication in IEEE Transactions on Image Processin

    Unsupervised Representation Learning by Sorting Sequences

    Full text link
    We present an unsupervised representation learning approach using videos without semantic labels. We leverage the temporal coherence as a supervisory signal by formulating representation learning as a sequence sorting task. We take temporally shuffled frames (i.e., in non-chronological order) as inputs and train a convolutional neural network to sort the shuffled sequences. Similar to comparison-based sorting algorithms, we propose to extract features from all frame pairs and aggregate them to predict the correct order. As sorting shuffled image sequence requires an understanding of the statistical temporal structure of images, training with such a proxy task allows us to learn rich and generalizable visual representation. We validate the effectiveness of the learned representation using our method as pre-training on high-level recognition problems. The experimental results show that our method compares favorably against state-of-the-art methods on action recognition, image classification and object detection tasks.Comment: ICCV 2017. Project page: http://vllab1.ucmerced.edu/~hylee/OPN

    cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey

    Full text link
    The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers on computer vision, pattern recognition, and related fields. For this particular review, we focused on reading the ALL 602 conference papers presented at the CVPR2015, the premier annual computer vision event held in June 2015, in order to grasp the trends in the field. Further, we are proposing "DeepSurvey" as a mechanism embodying the entire process from the reading through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape

    Improving Interpretability of Deep Neural Networks with Semantic Information

    Full text link
    Interpretability of deep neural networks (DNNs) is essential since it enables users to understand the overall strengths and weaknesses of the models, conveys an understanding of how the models will behave in the future, and how to diagnose and correct potential problems. However, it is challenging to reason about what a DNN actually does due to its opaque or black-box nature. To address this issue, we propose a novel technique to improve the interpretability of DNNs by leveraging the rich semantic information embedded in human descriptions. By concentrating on the video captioning task, we first extract a set of semantically meaningful topics from the human descriptions that cover a wide range of visual concepts, and integrate them into the model with an interpretive loss. We then propose a prediction difference maximization algorithm to interpret the learned features of each neuron. Experimental results demonstrate its effectiveness in video captioning using the interpretable features, which can also be transferred to video action recognition. By clearly understanding the learned features, users can easily revise false predictions via a human-in-the-loop procedure.Comment: To appear in CVPR 201

    Audio-Based Activities of Daily Living (ADL) Recognition with Large-Scale Acoustic Embeddings from Online Videos

    Full text link
    Over the years, activity sensing and recognition has been shown to play a key enabling role in a wide range of applications, from sustainability and human-computer interaction to health care. While many recognition tasks have traditionally employed inertial sensors, acoustic-based methods offer the benefit of capturing rich contextual information, which can be useful when discriminating complex activities. Given the emergence of deep learning techniques and leveraging new, large-scaled multi-media datasets, this paper revisits the opportunity of training audio-based classifiers without the onerous and time-consuming task of annotating audio data. We propose a framework for audio-based activity recognition that makes use of millions of embedding features from public online video sound clips. Based on the combination of oversampling and deep learning approaches, our framework does not require further feature processing or outliers filtering as in prior work. We evaluated our approach in the context of Activities of Daily Living (ADL) by recognizing 15 everyday activities with 14 participants in their own homes, achieving 64.2% and 83.6% averaged within-subject accuracy in terms of top-1 and top-3 classification respectively. Individual class performance was also examined in the paper to further study the co-occurrence characteristics of the activities and the robustness of the framework.Comment: 18 pages,7 figures; new version: results update

    Towards Storytelling from Visual Lifelogging: An Overview

    Full text link
    Visual lifelogging consists of acquiring images that capture the daily experiences of the user by wearing a camera over a long period of time. The pictures taken offer considerable potential for knowledge mining concerning how people live their lives, hence, they open up new opportunities for many potential applications in fields including healthcare, security, leisure and the quantified self. However, automatically building a story from a huge collection of unstructured egocentric data presents major challenges. This paper provides a thorough review of advances made so far in egocentric data analysis, and in view of the current state of the art, indicates new lines of research to move us towards storytelling from visual lifelogging.Comment: 16 pages, 11 figures, Submitted to IEEE Transactions on Human-Machine System
    corecore