23 research outputs found

    CNN Features off-the-shelf: an Astounding Baseline for Recognition

    Full text link
    Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the \overfeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the \overfeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the \overfeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or L2L2 distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.Comment: version 3 revisions: 1)Added results using feature processing and data augmentation 2)Referring to most recent efforts of using CNN for different visual recognition tasks 3) updated text/captio

    Persistent Evidence of Local Image Properties in Generic ConvNets

    Full text link
    Supervised training of a convolutional network for object classification should make explicit any information related to the class of objects and disregard any auxiliary information associated with the capture of the image or the variation within the object class. Does this happen in practice? Although this seems to pertain to the very final layers in the network, if we look at earlier layers we find that this is not the case. Surprisingly, strong spatial information is implicit. This paper addresses this, in particular, exploiting the image representation at the first fully connected layer, i.e. the global image descriptor which has been recently shown to be most effective in a range of visual recognition tasks. We empirically demonstrate evidences for the finding in the contexts of four different tasks: 2d landmark detection, 2d object keypoints prediction, estimation of the RGB values of input image, and recovery of semantic label of each pixel. We base our investigation on a simple framework with ridge rigression commonly across these tasks, and show results which all support our insight. Such spatial information can be used for computing correspondence of landmarks to a good accuracy, but should potentially be useful for improving the training of the convolutional nets for classification purposes

    ADSORPTION AND DECOMPOSITION OF HCOOH ON POTASSIUM-PROMOTED RH(111) SURFACES

    Get PDF
    Evidence is mounting that ConvNets are the best representation learning method for recognition. In the common scenario, a ConvNet is trained on a large labeled dataset and the feed-forward units activation, at a certain layer of the network, is used as a generic representation of an input image. Recent studies have shown this form of representation to be astoundingly effective for a wide range of recognition tasks. This paper thoroughly investigates the transferability of such representations w.r.t. several factors. It includes parameters for training the network such as its architecture and parameters of feature extraction. We further show that different visual recognition tasks can be categorically ordered based on their distance from the source task. We then show interesting results indicating a clear correlation between the performance of tasks and their distance from the source task conditioned on proposed factors. Furthermore, by optimizing these factors, we achieve stateof-the-art performances on 16 visual recognition tasks.QC 20150507. QC 20200701</p

    Unsupervised Contact Learning for Humanoid Estimation and Control

    Full text link
    This work presents a method for contact state estimation using fuzzy clustering to learn contact probability for full, six-dimensional humanoid contacts. The data required for training is solely from proprioceptive sensors - endeffector contact wrench sensors and inertial measurement units (IMUs) - and the method is completely unsupervised. The resulting cluster means are used to efficiently compute the probability of contact in each of the six endeffector degrees of freedom (DoFs) independently. This clustering-based contact probability estimator is validated in a kinematics-based base state estimator in a simulation environment with realistic added sensor noise for locomotion over rough, low-friction terrain on which the robot is subject to foot slip and rotation. The proposed base state estimator which utilizes these six DoF contact probability estimates is shown to perform considerably better than that which determines kinematic contact constraints purely based on measured normal force.Comment: Submitted to the IEEE International Conference on Robotics and Automation (ICRA) 201

    Nested Invariance Pooling and RBM Hashing for Image Instance Retrieval

    Get PDF
    The goal of this work is the computation of very compact binary hashes for image instance retrieval. Our approach has two novel contributions. The first one is Nested Invariance Pooling (NIP), a method inspired from i-theory, a mathematical theory for computing group invariant transformations with feed-forward neural networks. NIP is able to produce compact and well-performing descriptors with visual representations extracted from convolutional neural networks. We specifically incorporate scale, translation and rotation invariances but the scheme can be extended to any arbitrary sets of transformations. We also show that using moments of increasing order throughout nesting is important. The NIP descriptors are then hashed to the target code size (32-256 bits) with a Restricted Boltzmann Machine with a novel batch-level reg-ularization scheme specifically designed for the purpose of hashing (RBMH). A thorough empirical evaluation with state-of-the-art shows that the results obtained both with the NIP descriptors and the NIP+RBMH hashes are consistently outstanding across a wide range of datasets

    Convolutional Network Representation for Visual Recognition

    No full text
    Image representation is a key component in visual recognition systems. In visual recognition problem, the solution or the model should be able to learn and infer the quality of certain visual semantics in the image. Therefore, it is important for the model to represent the input image in a way that the semantics of interest can be inferred easily and reliably. This thesis is written in the form of a compilation of publications and tries to look into the Convolutional Networks (CovnNets) representation in visual recognition problems from an empirical perspective. Convolutional Network is a special class of Neural Networks with a hierarchical structure where every layer’s output (except for the last layer) will be the input of another one. It was shown that ConvNets are powerful tools to learn a generic representation of an image. In this body of work, we first showed that this is indeed the case and ConvNet representation with a simple classifier can outperform highly-tuned pipelines based on hand-crafted features. To be precise, we first trained a ConvNet on a large dataset, then for every image in another task with a small dataset, we feedforward the image to the ConvNet and take the ConvNets activation on a certain layer as the image representation. Transferring the knowledge from the large dataset (source task) to the small dataset (target task) proved to be effective and outperformed baselines on a variety of tasks in visual recognition. We also evaluated the presence of spatial visual semantics in ConvNet representation and observed that ConvNet retains significant spatial information despite the fact that it has never been explicitly trained to preserve low-level semantics. We then tried to investigate the factors that affect the transferability of these representations. We studied various factors on a diverse set of visual recognition tasks and found a consistent correlation between the effect of those factors and the similarity of the target task to the source task. This intuition alongside the experimental results provides a guideline to improve the performance of visual recognition tasks using ConvNet features. Finally, we addressed the task of visual instance retrieval specifically as an example of how these simple intuitions can increase the performance of the target task massively.QC 20161209</p

    A BASELINE FOR VISUAL INSTANCE RETRIEVAL WITH DEEP CONVOLUTIONAL NETWORKS

    No full text
    ABSTRACT This work presents simple pipelines for visual image retrieval exploiting image representations based on convolutional networks (ConvNets), and demonstrates that ConvNet image representations outperform other state-of-the-art image representations on six standard image retrieval datasets. ConvNet based image features have increasingly permeated the field of computer vision and are replacing hand-crafted features in many established application domains. Much recent work has illuminated the field on how to design and train ConvNets to maximize performance Beside the performance, another issue for visual instance retrieval is the dimensionality and memory requirements of the image representation. Usually two separate categories are considered, for which we report the results. These are the small footprint representations encoding each image with less than 1kbytes and the medium footprint representations which have dimensionality between 10k and 100k. The small regime is required when the number of images is huge and memory is a bottleneck, while the medium regime is more useful when the number of images is less than 50k. In our pipeline for the small we extract the features for 576×576 images, and for medium we use those features combined with the spatial search method described in . Furthermore, inspired by the recent work of RESULTS SUMMARY To evaluate our model, we used two networks. First, one which we refer to as AlexNet Krizhevsky et al. (Oxford5k performance drops to 82.6 while Paris6k performance increases to 87.5.) 2) Our pipelines are the first pipelines that work for both textured-less items (e.g. sculptures) and highly-textured items (e.g. buildings) using exactly the same settings. 3) Previous methods are often specialized and learn their parameters on similar datasets and could then suffer from domain shift. On the other hand, our pipeline does not rely on the bias of the dataset but it can still be specialized to a high degree (fine-tuning the OxfordNet with landmark dataset In sum, the work shows that ConvNet image representations outperform other s.o.a. image representations for visual image retrieval if one selects the appropriate responses from a generic deep ConvNet. Our result should only be viewed as a baseline and by no means we claim that our method is optimal yet. Even the simple additions such as concatinating different architecture representations gives a boost in performance (e.g. 87.2 for Oxford5k). Acknowledgment. We would like to thank NVIDIA Co. for the generous donation of K40 GPUs

    Estimating Attention in Exhibitions Using Wearable Cameras

    No full text
    This paper demonstrates a system for automatic detection of visual attention and identification of salient items at exhibitions (e.g. museum or an auction). The method is offline and is done on a video captured by a head mounted camera. Towards the estimation of attention, we define the notions of "saliency" and "interestingness" for an exhibition items. Our method is a combination of multiple state of the art techniques from different vision tasks such as tracking, image matching and retrieval. Many experiments are conducted to evaluate multiple aspects of our method. The method has proven to be robust to image blur, occlusion, truncation, and dimness. The experiments shows strong performance for the tasks of matching items, estimating focus frames and detecting salient and interesting items. This can be useful to the commercial vendors and museum curators and help them to understand which items are appealing more to the visitors.QC 20150521</p
    corecore