300,647 research outputs found

    Hierarchical Decomposition of Large Deep Networks

    Get PDF
    Teaching computers how to recognize people and objects from visual cues in images and videos is an interesting challenge. The computer vision and pattern recognition communities have already demonstrated the ability of intelligent algorithms to detect and classify objects in difficult conditions such as pose, occlusions and image fidelity. Recent deep learning approaches in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) are built using very large and deep convolution neural network architectures. In 2015, such architectures outperformed human performance (94.9% human vs 95.06% machine) for top-5 validation accuracies on the ImageNet dataset, and earlier this year deep learning approaches demonstrated a remarkable 96.43% accuracy. These successes have been made possible by deep architectures such as VGG, GoogLeNet, and most recently by deep residual models with as many as 152 weight layers. Training of these deep models is a difficult task due to compute intensive learning of millions of parameters. Due to the inevitability of these parameters, very small filters of size 3x3 are used in convolutional layers to reduce the parameters in very deep networks. On the other hand, deep networks generalize well on other datasets and outperform complex datasets with less features or Images. This thesis proposes a robust approach for large scale visual recognition by introducing a framework that automatically analyses the similarity between different classes among the dataset and configures a family of smaller networks that replace a single larger network. Classes that are similar are grouped together and are learnt by a smaller network. This allows one to divide and conquer the large classification problem by identifying the class category from its coarse label to its fine label, deploying two or more stages of networks. In this way the proposed framework learns the natural hierarchy and effectively uses it for the classification problem. A comprehensive analysis of the proposed methods show that hierarchical models outperform traditional models in terms of accuracy, reduced computations and attribute to expanding the ability to learn large scale visual information effectively

    Improving Efficiency in Deep Learning for Large Scale Visual Recognition

    Get PDF
    The emerging recent large scale visual recognition methods, and in particular the deep Convolutional Neural Networks (CNN), are promising to revolutionize many computer vision based artificial intelligent applications, such as autonomous driving and online image retrieval systems. One of the main challenges in large scale visual recognition is the complexity of the corresponding algorithms. This is further exacerbated by the fact that in most real-world scenarios they need to run in real time and on platforms that have limited computational resources. This dissertation focuses on improving the efficiency of such large scale visual recognition algorithms from several perspectives. First, to reduce the complexity of large scale classification to sub-linear with the number of classes, a probabilistic label tree framework is proposed. A test sample is classified by traversing the label tree from the root node. Each node in the tree is associated with a probabilistic estimation of all the labels. The tree is learned recursively with iterative maximum likelihood optimization. Comparing to the hard label partition proposed previously, the probabilistic framework performs classification more accurately with similar efficiency. Second, we explore the redundancy of parameters in Convolutional Neural Networks (CNN) and employ sparse decomposition to significantly reduce both the amount of parameters and computational complexity. Both inter-channel and inner-channel redundancy is exploit to achieve more than 90\% sparsity with approximately 1\% drop of classification accuracy. We also propose a CPU based efficient sparse matrix multiplication algorithm to reduce the actual running time of CNN models with sparse convolutional kernels. Third, we propose a multi-stage framework based on CNN to achieve better efficiency than a single traditional CNN model. With a combination of cascade model and the label tree framework, the proposed method divides the input images in both the image space and the label space, and processes each image with CNN models that are most suitable and efficient. The average complexity of the framework is significantly reduced, while the overall accuracy remains the same as in the single complex model

    Objects classification in still images using the region covariance descriptor

    Get PDF
    The goal of the Object Classification is to classify the objects in images. Classification aims for the recognition of generic classes, which is also known as Generic Object Recognition. This is quite different from Specific Object Recognition, such as recognizing specific person, own car, and etc. Human beings are generally better in recognizing generic classes than specific objects. Classification is a much harder problem to solve by artificial systems. Classification algorithm must be robust to changes in illumination, object scale, view point, and etc. The algorithm also has to manage large intra class variations and small inter class variations. In recent literature, some of the classification methods use Bag of Visual Words model. In this work the main emphasis is on region descriptor and representation of training images. Given a set of training images, interest points are detected through interest point detectors. Region around an interest point is described by a descriptor. Region covariance descriptor is adopted from porikli et al. [21], where they used this descriptor for object detection and classification. This region covariance descriptor is combined with Bag of Visual words model. We have used a different set of features for Classification task. Covariance of d-features, e.g. spatial location, Gaussian kernel with three different s values, first order Gaussian derivatives with two different s values, and second order Gaussian derivatives with four different s values, characterizes a region of interest. An image is also represented by Bag of Visual words obtained with both SIFT and Covariance descriptors. We worked on five datasets; Caltech-4, Caltech-3, Animal, Caltech-10, and Flower (17 classes), with first four taken from Caltech-256 and Caltech-101 datasets. Many researchers used Caltech-4 dataset for object classification task. The region covariance descriptor is outperforming SIFT descriptor on both Caltech-4 and Caltech-3 datasets, whereas Combined representation (SIFT + Covariance) is outperforming both SIFT and Covarianc

    ImageNet Large Scale Visual Recognition Challenge

    Get PDF
    The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.Comment: 43 pages, 16 figures. v3 includes additional comparisons with PASCAL VOC (per-category comparisons in Table 3, distribution of localization difficulty in Fig 16), a list of queries used for obtaining object detection images (Appendix C), and some additional reference

    Multiscale Discriminant Saliency for Visual Attention

    Full text link
    The bottom-up saliency, an early stage of humans' visual attention, can be considered as a binary classification problem between center and surround classes. Discriminant power of features for the classification is measured as mutual information between features and two classes distribution. The estimated discrepancy of two feature classes very much depends on considered scale levels; then, multi-scale structure and discriminant power are integrated by employing discrete wavelet features and Hidden markov tree (HMT). With wavelet coefficients and Hidden Markov Tree parameters, quad-tree like label structures are constructed and utilized in maximum a posterior probability (MAP) of hidden class variables at corresponding dyadic sub-squares. Then, saliency value for each dyadic square at each scale level is computed with discriminant power principle and the MAP. Finally, across multiple scales is integrated the final saliency map by an information maximization rule. Both standard quantitative tools such as NSS, LCC, AUC and qualitative assessments are used for evaluating the proposed multiscale discriminant saliency method (MDIS) against the well-know information-based saliency method AIM on its Bruce Database wity eye-tracking data. Simulation results are presented and analyzed to verify the validity of MDIS as well as point out its disadvantages for further research direction.Comment: 16 pages, ICCSA 2013 - BIOCA sessio
    • …
    corecore