126 research outputs found

    Deceiving Google's Cloud Video Intelligence API Built for Summarizing Videos

    Full text link
    Despite the rapid progress of the techniques for image classification, video annotation has remained a challenging task. Automated video annotation would be a breakthrough technology, enabling users to search within the videos. Recently, Google introduced the Cloud Video Intelligence API for video analysis. As per the website, the system can be used to "separate signal from noise, by retrieving relevant information at the video, shot or per frame" level. A demonstration website has been also launched, which allows anyone to select a video for annotation. The API then detects the video labels (objects within the video) as well as shot labels (description of the video events over time). In this paper, we examine the usability of the Google's Cloud Video Intelligence API in adversarial environments. In particular, we investigate whether an adversary can subtly manipulate a video in such a way that the API will return only the adversary-desired labels. For this, we select an image, which is different from the video content, and insert it, periodically and at a very low rate, into the video. We found that if we insert one image every two seconds, the API is deceived into annotating the video as if it only contained the inserted image. Note that the modification to the video is hardly noticeable as, for instance, for a typical frame rate of 25, we insert only one image per 50 video frames. We also found that, by inserting one image per second, all the shot labels returned by the API are related to the inserted image. We perform the experiments on the sample videos provided by the API demonstration website and show that our attack is successful with different videos and images

    Positive and negative label propagation

    Get PDF

    Person Identity Label Propagation in Stereo Videos

    Get PDF

    Adaptive multi-view feature selection for human motion retrieval

    Get PDF
    Human motion retrieval plays an important role in many motion data based applications. In the past, many researchers tended to use a single type of visual feature as data representation. Because different visual feature describes different aspects about motion data, and they have dissimilar discriminative power with respect to one particular class of human motion, it led to poor retrieval performance. Thus, it would be beneficial to combine multiple visual features together for motion data representation. In this article, we present an Adaptive Multi-view Feature Selection (AMFS) method for human motion retrieval. Specifically, we first use a local linear regression model to automatically learn multiple view-based Laplacian graphs for preserving the local geometric structure of motion data. Then, these graphs are combined together with a non-negative view-weight vector to exploit the complementary information between different features. Finally, in order to discard the redundant and irrelevant feature components from the original high-dimensional feature representation, we formulate the objective function of AMFS as a general trace ratio optimization problem, and design an effective algorithm to solve the corresponding optimization problem. Extensive experiments on two public human motion database, i.e., HDM05 and MSR Action3D, demonstrate the effectiveness of the proposed AMFS over the state-of-art methods for motion data retrieval. The scalability with large motion dataset, and insensitivity with the algorithm parameters, make our method can be widely used in real-world applications

    Modeling sequences and temporal networks with dynamic community structures

    Full text link
    In evolving complex systems such as air traffic and social organizations, collective effects emerge from their many components' dynamic interactions. While the dynamic interactions can be represented by temporal networks with nodes and links that change over time, they remain highly complex. It is therefore often necessary to use methods that extract the temporal networks' large-scale dynamic community structure. However, such methods are subject to overfitting or suffer from effects of arbitrary, a priori imposed timescales, which should instead be extracted from data. Here we simultaneously address both problems and develop a principled data-driven method that determines relevant timescales and identifies patterns of dynamics that take place on networks as well as shape the networks themselves. We base our method on an arbitrary-order Markov chain model with community structure, and develop a nonparametric Bayesian inference framework that identifies the simplest such model that can explain temporal interaction data.Comment: 15 Pages, 6 figures, 2 table

    Implicit and Explicit Concept Relations in Deep Neural Networks for Multi-Label Video/Image Annotation

    Get PDF
    IEEE In this work we propose a DCNN (Deep Convolutional Neural Network) architecture that addresses the problem of video/image concept annotation by exploiting concept relations at two different levels. At the first level, we build on ideas from multi-task learning, and propose an approach to learn conceptspecific representations that are sparse, linear combinations of representations of latent concepts. By enforcing the sharing of the latent concept representations, we exploit the implicit relations between the target concepts. At a second level, we build on ideas from structured output learning, and propose the introduction, at training time, of a new cost term that explicitly models the correlations between the concepts. By doing so, we explicitly model the structure in the output space (i.e., the concept labels). Both of the above are implemented using standard convolutional layers and are incorporated in a single DCNN architecture that can then be trained end-to-end with standard back-propagation. Experiments on four large-scale video and image datasets show that the proposed DCNN improves concept annotation accuracy and outperforms the related state-of-the-art methods

    Unsupervised Visual and Textual Information Fusion in Multimedia Retrieval - A Graph-based Point of View

    Full text link
    Multimedia collections are more than ever growing in size and diversity. Effective multimedia retrieval systems are thus critical to access these datasets from the end-user perspective and in a scalable way. We are interested in repositories of image/text multimedia objects and we study multimodal information fusion techniques in the context of content based multimedia information retrieval. We focus on graph based methods which have proven to provide state-of-the-art performances. We particularly examine two of such methods : cross-media similarities and random walk based scores. From a theoretical viewpoint, we propose a unifying graph based framework which encompasses the two aforementioned approaches. Our proposal allows us to highlight the core features one should consider when using a graph based technique for the combination of visual and textual information. We compare cross-media and random walk based results using three different real-world datasets. From a practical standpoint, our extended empirical analysis allow us to provide insights and guidelines about the use of graph based methods for multimodal information fusion in content based multimedia information retrieval.Comment: An extended version of the paper: Visual and Textual Information Fusion in Multimedia Retrieval using Semantic Filtering and Graph based Methods, by J. Ah-Pine, G. Csurka and S. Clinchant, submitted to ACM Transactions on Information System

    Social and Semantic Contexts in Tourist Mobile Applications

    Get PDF
    The ongoing growth of the World Wide Web along with the increase possibility of access information through a variety of devices in mobility, has defi nitely changed the way users acquire, create, and personalize information, pushing innovative strategies for annotating and organizing it. In this scenario, Social Annotation Systems have quickly gained a huge popularity, introducing millions of metadata on di fferent Web resources following a bottom-up approach, generating free and democratic mechanisms of classi cation, namely folksonomies. Moving away from hierarchical classi cation schemas, folksonomies represent also a meaningful mean for identifying similarities among users, resources and tags. At any rate, they suff er from several limitations, such as the lack of specialized tools devoted to manage, modify, customize and visualize them as well as the lack of an explicit semantic, making di fficult for users to bene fit from them eff ectively. Despite appealing promises of Semantic Web technologies, which were intended to explicitly formalize the knowledge within a particular domain in a top-down manner, in order to perform intelligent integration and reasoning on it, they are still far from reach their objectives, due to di fficulties in knowledge acquisition and annotation bottleneck. The main contribution of this dissertation consists in modeling a novel conceptual framework that exploits both social and semantic contextual dimensions, focusing on the domain of tourism and cultural heritage. The primary aim of our assessment is to evaluate the overall user satisfaction and the perceived quality in use thanks to two concrete case studies. Firstly, we concentrate our attention on contextual information and navigation, and on authoring tool; secondly, we provide a semantic mapping of tags of the system folksonomy, contrasted and compared to the expert users' classi cation, allowing a bridge between social and semantic knowledge according to its constantly mutual growth. The performed user evaluations analyses results are promising, reporting a high level of agreement on the perceived quality in use of both the applications and of the speci c analyzed features, demonstrating that a social-semantic contextual model improves the general users' satisfactio
    corecore