2,176 research outputs found

    Going Deeper for Multilingual Visual Sentiment Detection

    Full text link
    This technical report details several improvements to the visual concept detector banks built on images from the Multilingual Visual Sentiment Ontology (MVSO). The detector banks are trained to detect a total of 9,918 sentiment-biased visual concepts from six major languages: English, Spanish, Italian, French, German and Chinese. In the original MVSO release, adjective-noun pair (ANP) detectors were trained for the six languages using an AlexNet-styled architecture by fine-tuning from DeepSentiBank. Here, through a more extensive set of experiments, parameter tuning, and training runs, we detail and release higher accuracy models for detecting ANPs across six languages from the same image pool and setting as in the original release using a more modern architecture, GoogLeNet, providing comparable or better performance with reduced network parameter cost. In addition, since the image pool in MVSO can be corrupted by user noise from social interactions, we partitioned out a sub-corpus of MVSO images based on tag-restricted queries for higher fidelity labels. We show that as a result of these higher fidelity labels, higher performing AlexNet-styled ANP detectors can be trained using the tag-restricted image subset as compared to the models in full corpus. We release all these newly trained models for public research use along with the list of tag-restricted images from the MVSO dataset.Comment: technical report, 7 page

    On the Difficulty of Nearest Neighbor Search

    Full text link
    Fast approximate nearest neighbor (NN) search in large databases is becoming popular. Several powerful learning-based formulations have been proposed recently. However, not much attention has been paid to a more fundamental question: how difficult is (approximate) nearest neighbor search in a given data set? And which data properties affect the difficulty of nearest neighbor search and how? This paper introduces the first concrete measure called Relative Contrast that can be used to evaluate the influence of several crucial data characteristics such as dimensionality, sparsity, and database size simultaneously in arbitrary normed metric spaces. Moreover, we present a theoretical analysis to prove how the difficulty measure (relative contrast) determines/affects the complexity of Local Sensitive Hashing, a popular approximate NN search method. Relative contrast also provides an explanation for a family of heuristic hashing algorithms with good practical performance based on PCA. Finally, we show that most of the previous works in measuring NN search meaningfulness/difficulty can be derived as special asymptotic cases for dense vectors of the proposed measure.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

    CamSwarm: Instantaneous Smartphone Camera Arrays for Collaborative Photography

    Full text link
    Camera arrays (CamArrays) are widely used in commercial filming projects for achieving special visual effects such as bullet time effect, but are very expensive to set up. We propose CamSwarm, a low-cost and lightweight alternative to professional CamArrays for consumer applications. It allows the construction of a collaborative photography platform from multiple mobile devices anywhere and anytime, enabling new capturing and editing experiences that a single camera cannot provide. Our system allows easy team formation; uses real-time visualization and feedback to guide camera positioning; provides a mechanism for synchronized capturing; and finally allows the user to efficiently browse and edit the captured imagery. Our user study suggests that CamSwarm is easy to use; the provided real-time guidance is helpful; and the full system achieves high quality results promising for non-professional use. A demo video is provided at https://www.youtube.com/watch?v=LgkHcvcyTTM

    Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

    Full text link
    We address temporal action localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes on the learned classification network to localize each action instance. We propose a novel loss function for the localization network to explicitly consider temporal overlap and therefore achieve high temporal localization accuracy. Only the proposal network and the localization network are used during prediction. On two large-scale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014, when the overlap threshold for evaluation is set to 0.5.Comment: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 201

    Event Specific Multimodal Pattern Mining with Image-Caption Pairs

    Full text link
    In this paper we describe a novel framework and algorithms for discovering image patch patterns from a large corpus of weakly supervised image-caption pairs generated from news events. Current pattern mining techniques attempt to find patterns that are representative and discriminative, we stipulate that our discovered patterns must also be recognizable by humans and preferably with meaningful names. We propose a new multimodal pattern mining approach that leverages the descriptive captions often accompanying news images to learn semantically meaningful image patch patterns. The mutltimodal patterns are then named using words mined from the associated image captions for each pattern. A novel evaluation framework is provided that demonstrates our patterns are 26.2% more semantically meaningful than those discovered by the state of the art vision only pipeline, and that we can provide tags for the discovered images patches with 54.5% accuracy with no direct supervision. Our methods also discover named patterns beyond those covered by the existing image datasets like ImageNet. To the best of our knowledge this is the first algorithm developed to automatically mine image patch patterns that have strong semantic meaning specific to high-level news events, and then evaluate these patterns based on that criteria

    Generic Instance Search and Re-identification from One Example via Attributes and Categories

    Full text link
    This paper aims for generic instance search from one example where the instance can be an arbitrary object like shoes, not just near-planar and one-sided instances like buildings and logos. First, we evaluate state-of-the-art instance search methods on this problem. We observe that what works for buildings loses its generality on shoes. Second, we propose to use automatically learned category-specific attributes to address the large appearance variations present in generic instance search. Searching among instances from the same category as the query, the category-specific attributes outperform existing approaches by a large margin on shoes and cars and perform on par with the state-of-the-art on buildings. Third, we treat person re-identification as a special case of generic instance search. On the popular VIPeR dataset, we reach state-of-the-art performance with the same method. Fourth, we extend our method to search objects without restriction to the specifically known category. We show that the combination of category-level information and the category-specific attributes is superior to the alternative method combining category-level information with low-level features such as Fisher vector.Comment: This technical report is an extended version of our previous conference paper 'Attributes and Categories for Generic Instance Search from One Example' (CVPR 2015

    Building A Large Concept Bank for Representing Events in Video

    Full text link
    Concept-based video representation has proven to be effective in complex event detection. However, existing methods either manually design concepts or directly adopt concept libraries not specifically designed for events. In this paper, we propose to build Concept Bank, the largest concept library consisting of 4,876 concepts specifically designed to cover 631 real-world events. To construct the Concept Bank, we first gather a comprehensive event collection from WikiHow, a collaborative writing project that aims to build the world's largest manual for any possible How-To event. For each event, we then search Flickr and discover relevant concepts from the tags of the returned images. We train a Multiple Kernel Linear SVM for each discovered concept as a concept detector in Concept Bank. We organize the concepts into a five-layer tree structure, in which the higher-level nodes correspond to the event categories while the leaf nodes are the event-specific concepts discovered for each event. Based on such tree ontology, we develop a semantic matching method to select relevant concepts for each textual event query, and then apply the corresponding concept detectors to generate concept-based video representations. We use TRECVID Multimedia Event Detection 2013 and Columbia Consumer Video open source event definitions and videos as our test sets and show very promising results on two video event detection tasks: event modeling over concept space and zero-shot event retrieval. To the best of our knowledge, this is the largest concept library covering the largest number of real-world events.Comment: 25 pages, 9 figure

    Learning to Hash for Indexing Big Data - A Survey

    Full text link
    The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area

    PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN

    Full text link
    We aim to tackle a novel vision task called Weakly Supervised Visual Relation Detection (WSVRD) to detect "subject-predicate-object" relations in an image with object relation groundtruths available only at the image level. This is motivated by the fact that it is extremely expensive to label the combinatorial relations between objects at the instance level. Compared to the extensively studied problem, Weakly Supervised Object Detection (WSOD), WSVRD is more challenging as it needs to examine a large set of regions pairs, which is computationally prohibitive and more likely stuck in a local optimal solution such as those involving wrong spatial context. To this end, we present a Parallel, Pairwise Region-based, Fully Convolutional Network (PPR-FCN) for WSVRD. It uses a parallel FCN architecture that simultaneously performs pair selection and classification of single regions and region pairs for object and relation detection, while sharing almost all computation shared over the entire image. In particular, we propose a novel position-role-sensitive score map with pairwise RoI pooling to efficiently capture the crucial context associated with a pair of objects. We demonstrate the superiority of PPR-FCN over all baselines in solving the WSVRD challenge by using results of extensive experiments over two visual relation benchmarks.Comment: To appear in International Conference on Computer Vision (ICCV) 2017, Venice, Ital

    PanoSwarm: Collaborative and Synchronized Multi-Device Panoramic Photography

    Full text link
    Taking a picture has been traditionally a one-persons task. In this paper we present a novel system that allows multiple mobile devices to work collaboratively in a synchronized fashion to capture a panorama of a highly dynamic scene, creating an entirely new photography experience that encourages social interactions and teamwork. Our system contains two components: a client app that runs on all participating devices, and a server program that monitors and communicates with each device. In a capturing session, the server collects in realtime the viewfinder images of all devices and stitches them on-the-fly to create a panorama preview, which is then streamed to all devices as visual guidance. The system also allows one camera to be the host and to send direct visual instructions to others to guide camera adjustment. When ready, all devices take pictures at the same time for panorama stitching. Our preliminary study suggests that the proposed system can help users capture high quality panoramas with an enjoyable teamwork experience. A demo video of the system in action is provided at http://youtu.be/PwQ6k_ZEQSs
    • …