119 research outputs found

    The Discriminative Generalized Hough Transform for Localization of Highly Variable Objects and its Application for Surveillance Recordings

    Get PDF
    This work is about the localization of arbitrary objects in 2D images in general and the localization of persons in video surveillance recordings in particular. More precisely, it is about localizing specific landmarks. Thereby the possibilities and limitations of localization approaches based on the Generalized Hough Transform (GHT), especially of the Discriminative Generalized Hough Transform (DGHT) will be evaluated. GHT-based approaches determine the number of matching model and feature points and the most likely target point position is given by the highest number of matching model and feature points. Additionally, the DGHT comprises a statistical learning approach to generate optimal DGHT-models achieving good results on medical images. This work will show that the DGHT is not restricted to medical tasks but has issues with large target object variabilities, which are frequent in video surveillance tasks. As all GHT-based approaches also the DGHT only considers the number of matching model-feature-point-combinations, which means that all model points are treated independently. This work will show that model points are not independent of each other and considering them independently will result in high error rates. This drawback is analyzed and a universal solution, which is not only applicable for the DGHT but all GHT-based approaches, is presented. This solution is based on an additional classifier that takes the whole set of matching model-feature-point-combinations into account to estimate a confidence score. On all tested databases, this approach could reduce the error rates drastically by up to 94.9%. Furthermore, this work presents a general approach for combining multiple GHT-models into a deeper model. This can be used to combine the localization results of different object landmarks such as mouth, nose, and eyes. Similar to Convolutional Neural Networks (CNNs) this will split the target object variability into multiple and smaller variabilities. A comparison of GHT-based approaches with CNNs and a description of the advantages, disadvantages, and potential application of both approaches will conclude this work.Diese Arbeit beschäftigt sich im Allgemeinen mit der Lokalisierung von Objekten in 2D Bilddaten und im Speziellen mit der Lokalisierung von Personen in Videoüberwachungsaufnahmen. Genauer gesagt handelt es sich hierbei um die Lokalisierung spezieller Landmarken. Dabei werden die Möglichkeiten und Limiterungen von Lokalisierungsverfahren basierend auf der Generalisierten Hough Transformation (GHT) untersucht, insbesondere die der Diskriminativen Generalisierten Hough Transformation (DGHT). Bei GHT-basierten Ansätze wird die Anzahl an übereinstimmenden Modelpunkten und Merkmalspunkten ermittelt und die wahrscheinlicheste Objekt-Position ergibt sich aus der höchsten Anzahl an übereinstimmenden Model- und Merkmalspunkte. Die DGHT umfasst darüber hinaus noch ein statistisches Lernverfahren, um optimale DGHT-Modele zu erzeugen und erzielte damit auf medizinischen Bilder und Anwendungen sehr gute Erfolge. Wie sich in dieser Arbeit zeigen wird, ist die DGHT nicht auf medizinische Anwendungen beschränkt, hat allerdings Schwierigkeiten große Variabilität der Ziel-Objekte abzudecken, wie sie in Überwachungsszenarien zu erwarten sind. Genau wie alle GHT-basierten Ansätze leidet auch die DGHT unter dem Problem, dass lediglich die Anzahl an übereinstimmenden Model- und Merkmalspunkten ermittelt wird, was bedeutet, dass alle Modelpunkte unabhängig voneinander betrachtet werden. Dass Modelpunkte nicht unabhängig voneinander sind, wird im Laufe dieser Arbeit gezeigt werden, und die unabhängige Betrachtung führt gerade bei sehr variablen Zielobjekten zu einer hohen Fehlerrate. Dieses Problem wird in dieser Arbeit grundlegend untersucht und ein allgemeiner Lösungsansatz vorgestellt, welcher nicht nur für die DGHT sondern grundsätzlich für alle GHT-basierten Verfahren Anwendung finden kann. Die Lösung basiert auf der Integration eines zusätzlichen Klassifikators, welcher die gesamte Menge an übereinstimmenden Model- und Merkmalspunkten betrachtet und anhand dessen ein zusätzliches Konfidenzmaß vergibt. Dadurch konnte auf allen getesteten Datenbanken eine deutliche Reduktion der Fehlerrate erzielt werden von bis zu 94.9%. Darüber hinaus umfasst die Arbeit einen generellen Ansatz zur Kombination mehrere GHT-Model in einem tieferen Model. Dies kann dazu verwendet werden, um die Lokalisierungsergebnisse verschiedener Objekt-Landmarken zu kombinieren, z. B. die von Mund, Nase und Augen. Ähnlich wie auch bei Convolutional Neural Networks (CNNs) ist es damit möglich über mehrere Ebenen unterschiedliche Bereiche zu lokalisieren und somit die Variabilität des Zielobjektes in mehrere, leichter zu handhabenden Variabilitäten aufzuspalten. Abgeschlossen wird die Arbeit durch einen Vergleich von GHT-basierten Ansätzen mit CNNs und einer Beschreibung der Vor- und Nachteile und mögliche Einsatzfelder beider Verfahren

    Automatic Multi-Scale and Multi-Object Pedestrian and Car Detection in Digital Images Based on the Discriminative Generalized Hough Transform and Deep Convolutional Neural Networks

    Get PDF
    Many approaches have been suggested for automatic pedestrian and car detection to cope with the large variability regarding object size, occlusion, background variability, aspect and so forth. Current state-of-the-art deep learning-based frameworks rely either on a proposal generation mechanism (e.g., "Faster R-CNN") or on the inspection of image quadrants / octants (e.g., "YOLO" or "SSD"), which are then further processed with deep convolutional neural networks (CNN). In this thesis, the Discriminative Generalized Hough Transform (DGHT), which operates on edge images, is analyzed for the application to automatic multi-scale and multi-object pedestrian and car detection in 2D digital images. The analysis motivates to use the DGHT as an efficient proposal generation mechanism, followed by a proposal (bounding box) refinement and proposal acceptance or rejection based on a deep CNN. The impact of the different components of the resulting DGHT object detection pipeline as well as the amount of DGHT training data on the detection performance are analyzed in detail. Due to the low false negative rate and the low number of candidates of the DGHT as well as the high classification accuracy of the CNN, competitive performance to the state-of-the-art in pedestrian and car detection is obtained on the IAIR database with much less generated proposals than other proposal-generating algorithms, being outperformed only by YOLOv2 fine-tuned to IAIR cars. By evaluations on further databases (without retraining or adaptation) the generalization capability of the DGHT object detection pipeline is shown

    Characterizing Objects in Images using Human Context

    Get PDF
    Humans have an unmatched capability of interpreting detailed information about existent objects by just looking at an image. Particularly, they can effortlessly perform the following tasks: 1) Localizing various objects in the image and 2) Assigning functionalities to the parts of localized objects. This dissertation addresses the problem of aiding vision systems accomplish these two goals. The first part of the dissertation concerns object detection in a Hough-based framework. To this end, the independence assumption between features is addressed by grouping them in a local neighborhood. We study the complementary nature of individual and grouped features and combine them to achieve improved performance. Further, we consider the challenging case of detecting small and medium sized household objects under human-object interactions. We first evaluate appearance based star and tree models. While the tree model is slightly better, appearance based methods continue to suffer due to deficiencies caused by human interactions. To this end, we successfully incorporate automatically extracted human pose as a form of context for object detection. The second part of the dissertation addresses the tedious process of manually annotating objects to train fully supervised detectors. We observe that videos of human-object interactions with activity labels can serve as weakly annotated examples of household objects. Since such objects cannot be localized only through appearance or motion, we propose a framework that includes human centric functionality to retrieve the common object. Designed to maximize data utility by detecting multiple instances of an object per video, the framework achieves performance comparable to its fully supervised counterpart. The final part of the dissertation concerns localizing functional regions or affordances within objects by casting the problem as that of semantic image segmentation. To this end, we introduce a dataset involving human-object interactions with strong i.e. pixel level and weak i.e. clickpoint and image level affordance annotations. We propose a framework that utilizes both forms of weak labels and demonstrate that efforts for weak annotation can be further optimized using human context

    Robust and real-time hand detection and tracking in monocular video

    Get PDF
    In recent years, personal computing devices such as laptops, tablets and smartphones have become ubiquitous. Moreover, intelligent sensors are being integrated into many consumer devices such as eyeglasses, wristwatches and smart televisions. With the advent of touchscreen technology, a new human-computer interaction (HCI) paradigm arose that allows users to interface with their device in an intuitive manner. Using simple gestures, such as swipe or pinch movements, a touchscreen can be used to directly interact with a virtual environment. Nevertheless, touchscreens still form a physical barrier between the virtual interface and the real world. An increasingly popular field of research that tries to overcome this limitation, is video based gesture recognition, hand detection and hand tracking. Gesture based interaction allows the user to directly interact with the computer in a natural manner by exploring a virtual reality using nothing but his own body language. In this dissertation, we investigate how robust hand detection and tracking can be accomplished under real-time constraints. In the context of human-computer interaction, real-time is defined as both low latency and low complexity, such that a complete video frame can be processed before the next one becomes available. Furthermore, for practical applications, the algorithms should be robust to illumination changes, camera motion, and cluttered backgrounds in the scene. Finally, the system should be able to initialize automatically, and to detect and recover from tracking failure. We study a wide variety of existing algorithms, and propose significant improvements and novel methods to build a complete detection and tracking system that meets these requirements. Hand detection, hand tracking and hand segmentation are related yet technically different challenges. Whereas detection deals with finding an object in a static image, tracking considers temporal information and is used to track the position of an object over time, throughout a video sequence. Hand segmentation is the task of estimating the hand contour, thereby separating the object from its background. Detection of hands in individual video frames allows us to automatically initialize our tracking algorithm, and to detect and recover from tracking failure. Human hands are highly articulated objects, consisting of finger parts that are connected with joints. As a result, the appearance of a hand can vary greatly, depending on the assumed hand pose. Traditional detection algorithms often assume that the appearance of the object of interest can be described using a rigid model and therefore can not be used to robustly detect human hands. Therefore, we developed an algorithm that detects hands by exploiting their articulated nature. Instead of resorting to a template based approach, we probabilistically model the spatial relations between different hand parts, and the centroid of the hand. Detecting hand parts, such as fingertips, is much easier than detecting a complete hand. Based on our model of the spatial configuration of hand parts, the detected parts can be used to obtain an estimate of the complete hand's position. To comply with the real-time constraints, we developed techniques to speed-up the process by efficiently discarding unimportant information in the image. Experimental results show that our method is competitive with the state-of-the-art in object detection while providing a reduction in computational complexity with a factor 1 000. Furthermore, we showed that our algorithm can also be used to detect other articulated objects such as persons or animals and is therefore not restricted to the task of hand detection. Once a hand has been detected, a tracking algorithm can be used to continuously track its position in time. We developed a probabilistic tracking method that can cope with uncertainty caused by image noise, incorrect detections, changing illumination, and camera motion. Furthermore, our tracking system automatically determines the number of hands in the scene, and can cope with hands entering or leaving the video canvas. We introduced several novel techniques that greatly increase tracking robustness, and that can also be applied in other domains than hand tracking. To achieve real-time processing, we investigated several techniques to reduce the search space of the problem, and deliberately employ methods that are easily parallelized on modern hardware. Experimental results indicate that our methods outperform the state-of-the-art in hand tracking, while providing a much lower computational complexity. One of the methods used by our probabilistic tracking algorithm, is optical flow estimation. Optical flow is defined as a 2D vector field describing the apparent velocities of objects in a 3D scene, projected onto the image plane. Optical flow is known to be used by many insects and birds to visually track objects and to estimate their ego-motion. However, most optical flow estimation methods described in literature are either too slow to be used in real-time applications, or are not robust to illumination changes and fast motion. We therefore developed an optical flow algorithm that can cope with large displacements, and that is illumination independent. Furthermore, we introduce a regularization technique that ensures a smooth flow-field. This regularization scheme effectively reduces the number of noisy and incorrect flow-vector estimates, while maintaining the ability to handle motion discontinuities caused by object boundaries in the scene. The above methods are combined into a hand tracking framework which can be used for interactive applications in unconstrained environments. To demonstrate the possibilities of gesture based human-computer interaction, we developed a new type of computer display. This display is completely transparent, allowing multiple users to perform collaborative tasks while maintaining eye contact. Furthermore, our display produces an image that seems to float in thin air, such that users can touch the virtual image with their hands. This floating imaging display has been showcased on several national and international events and tradeshows. The research that is described in this dissertation has been evaluated thoroughly by comparing detection and tracking results with those obtained by state-of-the-art algorithms. These comparisons show that the proposed methods outperform most algorithms in terms of accuracy, while achieving a much lower computational complexity, resulting in a real-time implementation. Results are discussed in depth at the end of each chapter. This research further resulted in an international journal publication; a second journal paper that has been submitted and is under review at the time of writing this dissertation; nine international conference publications; a national conference publication; a commercial license agreement concerning the research results; two hardware prototypes of a new type of computer display; and a software demonstrator

    Visual Tracking in Robotic Minimally Invasive Surgery

    Get PDF
    Intra-operative imaging and robotics are some of the technologies driving forward better and more effective minimally invasive surgical procedures. To advance surgical practice and capabilities further, one of the key requirements for computationally enhanced interventions is to know how instruments and tissues move during the operation. While endoscopic video captures motion, the complex appearance dynamic effects of surgical scenes are challenging for computer vision algorithms to handle with robustness. Tackling both tissue and instrument motion estimation, this thesis proposes a combined non-rigid surface deformation estimation method to track tissue surfaces robustly and in conditions with poor illumination. For instrument tracking, a keypoint based 2D tracker that relies on the Generalized Hough Transform is developed to initialize a 3D tracker in order to robustly track surgical instruments through long sequences that contain complex motions. To handle appearance changes and occlusion a patch-based adaptive weighting with segmentation and scale tracking framework is developed. It takes a tracking-by-detection approach and a segmentation model is used to assigns weights to template patches in order to suppress back- ground information. The performance of the method is thoroughly evaluated showing that without any offline-training, the tracker works well even in complex environments. Finally, the thesis proposes a novel 2D articulated instrument pose estimation framework, which includes detection-regression fully convolutional network and a multiple instrument parsing component. The framework achieves compelling performance and illustrates interesting properties includ- ing transfer between different instrument types and between ex vivo and in vivo data. In summary, the thesis advances the state-of-the art in visual tracking for surgical applications for both tissue and instrument motion estimation. It contributes to developing the technological capability of full surgical scene understanding from endoscopic video

    Visual Concept Detection in Images and Videos

    Get PDF
    The rapidly increasing proliferation of digital images and videos leads to a situation where content-based search in multimedia databases becomes more and more important. A prerequisite for effective image and video search is to analyze and index media content automatically. Current approaches in the field of image and video retrieval focus on semantic concepts serving as an intermediate description to bridge the “semantic gap” between the data representation and the human interpretation. Due to the large complexity and variability in the appearance of visual concepts, the detection of arbitrary concepts represents a very challenging task. In this thesis, the following aspects of visual concept detection systems are addressed: First, enhanced local descriptors for mid-level feature coding are presented. Based on the observation that scale-invariant feature transform (SIFT) descriptors with different spatial extents yield large performance differences, a novel concept detection system is proposed that combines feature representations for different spatial extents using multiple kernel learning (MKL). A multi-modal video concept detection system is presented that relies on Bag-of-Words representations for visual and in particular for audio features. Furthermore, a method for the SIFT-based integration of color information, called color moment SIFT, is introduced. Comparative experimental results demonstrate the superior performance of the proposed systems on the Mediamill and on the VOC Challenge. Second, an approach is presented that systematically utilizes results of object detectors. Novel object-based features are generated based on object detection results using different pooling strategies. For videos, detection results are assembled to object sequences and a shot-based confidence score as well as further features, such as position, frame coverage or movement, are computed for each object class. These features are used as additional input for the support vector machine (SVM)-based concept classifiers. Thus, other related concepts can also profit from object-based features. Extensive experiments on the Mediamill, VOC and TRECVid Challenge show significant improvements in terms of retrieval performance not only for the object classes, but also in particular for a large number of indirectly related concepts. Moreover, it has been demonstrated that a few object-based features are beneficial for a large number of concept classes. On the VOC Challenge, the additional use of object-based features led to a superior performance for the image classification task of 63.8% mean average precision (AP). Furthermore, the generalization capabilities of concept models are investigated. It is shown that different source and target domains lead to a severe loss in concept detection performance. In these cross-domain settings, object-based features achieve a significant performance improvement. Since it is inefficient to run a large number of single-class object detectors, it is additionally demonstrated how a concurrent multi-class object detection system can be constructed to speed up the detection of many object classes in images. Third, a novel, purely web-supervised learning approach for modeling heterogeneous concept classes in images is proposed. Tags and annotations of multimedia data in the WWW are rich sources of information that can be employed for learning visual concepts. The presented approach is aimed at continuous long-term learning of appearance models and improving these models periodically. For this purpose, several components have been developed: a crawling component, a multi-modal clustering component for spam detection and subclass identification, a novel learning component, called “random savanna”, a validation component, an updating component, and a scalability manager. Only a single word describing the visual concept is required to initiate the learning process. Experimental results demonstrate the capabilities of the individual components. Finally, a generic concept detection system is applied to support interdisciplinary research efforts in the field of psychology and media science. The psychological research question addressed in the field of behavioral sciences is, whether and how playing violent content in computer games may induce aggression. Therefore, novel semantic concepts most notably “violence” are detected in computer game videos to gain insights into the interrelationship of violent game events and the brain activity of a player. Experimental results demonstrate the excellent performance of the proposed automatic concept detection approach for such interdisciplinary research

    3D Classification of Power Line Scene Using Airborne Lidar Data

    Get PDF
    Failure to adequately maintain vegetation within a power line corridor has been identified as a main cause of the August 14, 2003 electric power blackout. Such that, timely and accurate corridor mapping and monitoring are indispensible to mitigate such disaster. Moreover, airborne LiDAR (Light Detection And Ranging) has been recently introduced and widely utilized in industries and academies thanks to its potential to automate the data processing for scene analysis including power line corridor mapping. However, today’s corridor mapping practice using LiDAR in industries still remains an expensive manual process that is not suitable for the large-scale, rapid commercial compilation of corridor maps. Additionally, in academies only few studies have developed algorithms capable of recognizing corridor objects in the power line scene, which are mostly based on 2-dimensional classification. Thus, the objective of this dissertation is to develop a 3-dimensional classification system which is able to automatically identify key objects in the power line corridor from large-scale LiDAR data. This dissertation introduces new features for power structures, especially for the electric pylon, and existing features which are derived through diverse piecewise (i.e., point, line and plane) feature extraction, and then constructs a classification model pool by building individual models according to the piecewise feature sets and diverse voltage training samples using Random Forests. Finally, this dissertation proposes a Multiple Classifier System (MCS) which provides an optimal committee of models from the model pool for classification of new incoming power line scene. The proposed MCS has been tested on a power line corridor where medium voltage transmission lines (115 kV and 230 kV) pass. The classification results based on the MCS applied by optimally selecting the pre-built classification models according to the voltage type of the test corridor demonstrate a good accuracy (89.07%) and computationally effective time cost (approximately 4 hours/km) without additional training fees

    Deep Learning for Detection and Segmentation in High-Content Microscopy Images

    Get PDF
    High-content microscopy led to many advances in biology and medicine. This fast emerging technology is transforming cell biology into a big data driven science. Computer vision methods are used to automate the analysis of microscopy image data. In recent years, deep learning became popular and had major success in computer vision. Most of the available methods are developed to process natural images. Compared to natural images, microscopy images pose domain specific challenges such as small training datasets, clustered objects, and class imbalance. In this thesis, new deep learning methods for object detection and cell segmentation in microscopy images are introduced. For particle detection in fluorescence microscopy images, a deep learning method based on a domain-adapted Deconvolution Network is presented. In addition, a method for mitotic cell detection in heterogeneous histopathology images is proposed, which combines a deep residual network with Hough voting. The method is used for grading of whole-slide histology images of breast carcinoma. Moreover, a method for both particle detection and cell detection based on object centroids is introduced, which is trainable end-to-end. It comprises a novel Centroid Proposal Network, a layer for ensembling detection hypotheses over image scales and anchors, an anchor regularization scheme which favours prior anchors over regressed locations, and an improved algorithm for Non-Maximum Suppression. Furthermore, a novel loss function based on Normalized Mutual Information is proposed which can cope with strong class imbalance and is derived within a Bayesian framework. For cell segmentation, a deep neural network with increased receptive field to capture rich semantic information is introduced. Moreover, a deep neural network which combines both paradigms of multi-scale feature aggregation of Convolutional Neural Networks and iterative refinement of Recurrent Neural Networks is proposed. To increase the robustness of the training and improve segmentation, a novel focal loss function is presented. In addition, a framework for black-box hyperparameter optimization for biomedical image analysis pipelines is proposed. The framework has a modular architecture that separates hyperparameter sampling and hyperparameter optimization. A visualization of the loss function based on infimum projections is suggested to obtain further insights into the optimization problem. Also, a transfer learning approach is presented, which uses only one color channel for pre-training and performs fine-tuning on more color channels. Furthermore, an approach for unsupervised domain adaptation for histopathological slides is presented. Finally, Galaxy Image Analysis is presented, a platform for web-based microscopy image analysis. Galaxy Image Analysis workflows for cell segmentation in cell cultures, particle detection in mice brain tissue, and MALDI/H&E image registration have been developed. The proposed methods were applied to challenging synthetic as well as real microscopy image data from various microscopy modalities. It turned out that the proposed methods yield state-of-the-art or improved results. The methods were benchmarked in international image analysis challenges and used in various cooperation projects with biomedical researchers

    Lip print based authentication in physical access control Environments

    Get PDF
    Abstract: In modern society, there is an ever-growing need to determine the identity of a person in many applications including computer security, financial transactions, borders, and forensics. Early automated methods of authentication relied mostly on possessions and knowledge. Notably these authentication methods such as passwords and access cards are based on properties that can be lost, stolen, forgotten, or disclosed. Fortunately, biometric recognition provides an elegant solution to these shortcomings by identifying a person based on their physiological or behaviourial characteristics. However, due to the diverse nature of biometric applications (e.g., unlocking a mobile phone to cross an international border), no biometric trait is likely to be ideal and satisfy the criteria for all applications. Therefore, it is necessary to investigate novel biometric modalities to establish the identity of individuals on occasions where techniques such as fingerprint or face recognition are unavailable. One such modality that has gained much attention in recent years which originates from forensic practices is the lip. This research study considers the use of computer vision methods to recognise different lip prints for achieving the task of identification. To determine whether the research problem of the study is valid, a literature review is conducted which helps identify the problem areas and the different computer vision methods that can be used for achieving lip print recognition. Accordingly, the study builds on these areas and proposes lip print identification experiments with varying models which identifies individuals solely based on their lip prints and provides guidelines for the implementation of the proposed system. Ultimately, the experiments encapsulate the broad categories of methods for achieving lip print identification. The implemented computer vision pipelines contain different stages including data augmentation, lip detection, pre-processing, feature extraction, feature representation and classification. Three pipelines were implemented from the proposed model which include a traditional machine learning pipeline, a deep learning-based pipeline and a deep hybridlearning based pipeline. Different metrics reported in literature are used to assess the performance of the prototype such as IoU, mAP, accuracy, precision, recall, F1 score, EER, ROC curve, PR curve, accuracy and loss curves. The first pipeline of the current study is a classical pipeline which employs a facial landmark detector (One Millisecond Face Alignment algorithm) to detect the lip, SURF for feature extraction, BoVW for feature representation and an SVM or K-NN classifier. The second pipeline makes use of the facial landmark detector and a VGG16 or ResNet50 architecture. The findings reveal that the ResNet50 is the best performing method for lip print identification for the current study. The third pipeline also employs the facial landmark detector, the ResNet50 architecture for feature extraction with an SVM classifier. The development of the experiments is validated and benchmarked to determine the extent or performance at which it can achieve lip print identification. The results of the benchmark for the prototype, indicate that the study accomplishes the objective of identifying individuals based on their lip prints using computer vision methods. The results also determine that the use of deep learning architectures such as ResNet50 yield promising results.M.Sc. (Science

    Boosting for Generic 2D/3D Object Recognition

    Get PDF
    Generic object recognition is an important function of the human visual system. For an artificial vision system to be able to emulate the human perception abilities, it should also be able to perform generic object recognition. In this thesis, we address the generic object recognition problem and present different approaches and models which tackle different aspects of this difficult problem. First, we present a model for generic 2D object recognition from complex 2D images. The model exploits only appearance-based information, in the form of a combination of texture and color cues, for binary classification of 2D object classes. Learning is accomplished in a weakly supervised manner using Boosting. However, we live in a 3D world and the ability to recognize 3D objects is very important for any vision system. Therefore, we present a model for generic recognition of 3D objects from range images. Our model makes use of a combination of simple local shape descriptors extracted from range images for recognizing 3D object categories, as shape is an important information provided by range images. Moreover, we present a novel dataset for generic object recognition that provides 2D and range images about different object classes using a Time-of-Flight (ToF) camera. As the surrounding world contains thousands of different object categories, recognizing many different object classes is important as well. Therefore, we extend our generic 3D object recognition model to deal with the multi-class learning and recognition task. Moreover, we extend the multi-class recognition model by introducing a novel model which uses a combination of appearance-based information extracted from 2D images and range-based (shape) information extracted from range images for multi-class generic 3D object recognition and promising results are obtained
    corecore