297 research outputs found

    ModDrop: adaptive multi-modal gesture recognition

    Full text link
    We present a method for gesture detection and localisation based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at three temporal scales. Key to our technique is a training strategy which exploits: i) careful initialization of individual modalities; and ii) gradual fusion involving random dropping of separate channels (dubbed ModDrop) for learning cross-modality correlations while preserving uniqueness of each modality-specific representation. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams. Fusing multiple modalities at several spatial and temporal scales leads to a significant increase in recognition rates, allowing the model to compensate for errors of the individual classifiers as well as noise in the separate channels. Futhermore, the proposed ModDrop training technique ensures robustness of the classifier to missing signals in one or several channels to produce meaningful predictions from any number of available modalities. In addition, we demonstrate the applicability of the proposed fusion scheme to modalities of arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure

    Performance Analysis of Tracking on Mobile Devices using Local Binary Descriptors

    Get PDF
    With the growing ubiquity of mobile devices, users are turning to their smartphones and tablets to perform more complex tasks than ever before. Performing computer vision tasks on mobile devices must be done despite the constraints on CPU performance, memory, and power consumption. One such task for mobile devices involves object tracking, an important area of computer vision. The computational complexity of tracking algorithms makes them ideal candidates for optimization on mobile platforms. This thesis presents a mobile implementation for real time object tracking. Currently few tracking approaches take into consideration the resource constraints on mobile devices. Optimizing performance for mobile devices can result in better and more efficient tracking approaches for mobile applications such as augmented reality. These performance benefits aim to increase the frame rate at which an object is tracked and reduce power consumption during tracking. For this thesis, we utilize binary descriptors, such as Binary Robust Independent Elementary Features (BRIEF), Oriented FAST and Rotated BRIEF (ORB), Binary Robust Invariant Scalable Keypoints (BRISK), and Fast Retina Keypoint (FREAK). The tracking performance of these descriptors is benchmarked on mobile devices. We consider an object tracking approach based on a dictionary of templates that involves generating keypoints of a detected object and candidate regions in subsequent frames. Descriptor matching, between candidate regions in a new frame and a dictionary of templates, identifies the location of the tracked object. These comparisons are often computationally intensive and require a great deal of memory and processing time. Google\u27s Android operating system is used to implement the tracking application on a Samsung Galaxy series phone and tablet. Control of the Android camera is largely done through OpenCV\u27s Android SDK. Power consumption is measured using the PowerTutor Android application. Other performance characteristics, such as processing time, are gathered using the Dalvik Debug Monitor Server (DDMS) tool included in the Android SDK. These metrics are used to evaluate the tracker\u27s performance on mobile devices

    Improving Bags-of-Words model for object categorization

    Get PDF
    In the past decade, Bags-of-Words (BOW) models have become popular for the task of object recognition, owing to their good performance and simplicity. Some of the most effective recent methods for computer-based object recognition work by detecting and extracting local image features, before quantizing them according to a codebook rule such as k-means clustering, and classifying these with conventional classifiers such as Support Vector Machines and Naive Bayes. In this thesis, a Spatial Object Recognition Framework is presented that consists of the four main contributions of the research. The first contribution, frequent keypoint pattern discovery, works by combining pairs and triplets of frequent keypoints in order to discover intermediate representations for object classes. Based on the same frequent keypoints principle, algorithms for locating the region-of-interest in training images is then discussed. Extensions to the successful Spatial Pyramid Matching scheme, in order to better capture spatial relationships, are then proposed. The pairs frequency histogram and shapes frequency histogram work by capturing more redefined spatial information between local image features. Finally, alternative techniques to Spatial Pyramid Matching for capturing spatial information are presented. The proposed techniques, variations of binned log-polar histograms, divides the image into grids of different scale and different orientation. Thus captures the distribution of image features both in distance and orientation explicitly. Evaluations on the framework are focused on several recent and popular datasets, including image retrieval, object recognition, and object categorization. Overall, while the effectiveness of the framework is limited in some of the datasets, the proposed contributions are nevertheless powerful improvements of the BOW model

    MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document Analysis

    Get PDF
    Identity documents recognition is an important sub-field of document analysis, which deals with tasks of robust document detection, type identification, text fields recognition, as well as identity fraud prevention and document authenticity validation given photos, scans, or video frames of an identity document capture. Significant amount of research has been published on this topic in recent years, however a chief difficulty for such research is scarcity of datasets, due to the subject matter being protected by security requirements. A few datasets of identity documents which are available lack diversity of document types, capturing conditions, or variability of document field values. In addition, the published datasets were typically designed only for a subset of document recognition problems, not for a complex identity document analysis. In this paper, we present a dataset MIDV-2020 which consists of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock identity documents, each with unique text field values and unique artificially generated faces, with rich annotation. For the presented benchmark dataset baselines are provided for such tasks as document location and identification, text fields recognition, and face detection. With 72409 annotated images in total, to the date of publication the proposed dataset is the largest publicly available identity documents dataset with variable artificially generated data, and we believe that it will prove invaluable for advancement of the field of document analysis and recognition. The dataset is available for download at ftp://smartengines.com/midv-2020 and http://l3i-share.univ-lr.fr

    Text Extraction From Natural Scene: Methodology And Application

    Full text link
    With the popularity of the Internet and the smart mobile device, there is an increasing demand for the techniques and applications of image/video-based analytics and information retrieval. Most of these applications can benefit from text information extraction in natural scene. However, scene text extraction is a challenging problem to be solved, due to cluttered background of natural scene and multiple patterns of scene text itself. To solve these problems, this dissertation proposes a framework of scene text extraction. Scene text extraction in our framework is divided into two components, detection and recognition. Scene text detection is to find out the regions containing text from camera captured images/videos. Text layout analysis based on gradient and color analysis is performed to extract candidates of text strings from cluttered background in natural scene. Then text structural analysis is performed to design effective text structural features for distinguishing text from non-text outliers among the candidates of text strings. Scene text recognition is to transform image-based text in detected regions into readable text codes. The most basic and significant step in text recognition is scene text character (STC) prediction, which is multi-class classification among a set of text character categories. We design robust and discriminative feature representations for STC structure, by integrating multiple feature descriptors, coding/pooling schemes, and learning models. Experimental results in benchmark datasets demonstrate the effectiveness and robustness of our proposed framework, which obtains better performance than previously published methods. Our proposed scene text extraction framework is applied to 4 scenarios, 1) reading print labels in grocery package for hand-held object recognition; 2) combining with car detection to localize license plate in camera captured natural scene image; 3) reading indicative signage for assistant navigation in indoor environments; and 4) combining with object tracking to perform scene text extraction in video-based natural scene. The proposed prototype systems and associated evaluation results show that our framework is able to solve the challenges in real applications

    Three-dimensional Laser-based Classification in Outdoor Environments

    Get PDF
    Robotics research strives for deploying autonomous systems in populated environments, such as inner city traffic. Autonomous cars need a reliable collision avoidance, but also an object recognition to distinguish different classes of traffic participants. For both tasks, fast three-dimensional laser range sensors generating multiple accurate laser range scans per second, each consisting of a vast number of laser points, are often employed. In this thesis, we investigate and develop classification algorithms that allow us to automatically assign semantic labels to laser scans. We mainly face two challenges: (1) we have to ensure consistent and correct classification results and (2) we must efficiently process a vast number of laser points per scan. In consideration of these challenges, we cover both stages of classification -- the feature extraction from laser range scans and the classification model that maps from the features to semantic labels. As for the feature extraction, we contribute by thoroughly evaluating important state-of-the-art histogram descriptors. We investigate critical parameters of the descriptors and experimentally show for the first time that the classification performance can be significantly improved using a large support radius and a global reference frame. As for learning the classification model, we contribute with new algorithms that improve the classification efficiency and accuracy. Our first approach aims at deriving a consistent point-wise interpretation of the whole laser range scan. By combining efficient similarity-preserving hashing and multiple linear classifiers, we considerably improve the consistency of label assignments, requiring only minimal computational overhead compared to a single linear classifier. In the last part of the thesis, we aim at classifying objects represented by segments. We propose a novel hierarchical segmentation approach comprising multiple stages and a novel mixture classification model of multiple bag-of-words vocabularies. We demonstrate superior performance of both approaches compared to their single component counterparts using challenging real world datasets.Ziel des Forschungsbereichs Robotik ist der Einsatz autonomer Systeme in natürlichen Umgebungen, wie zum Beispiel innerstädtischem Verkehr. Autonome Fahrzeuge benötigen einerseits eine zuverlässige Kollisionsvermeidung und andererseits auch eine Objekterkennung zur Unterscheidung verschiedener Klassen von Verkehrsteilnehmern. Verwendung finden vorallem drei-dimensionale Laserentfernungssensoren, die mehrere präzise Laserentfernungsscans pro Sekunde erzeugen und jeder Scan besteht hierbei aus einer hohen Anzahl an Laserpunkten. In dieser Dissertation widmen wir uns der Untersuchung und Entwicklung neuartiger Klassifikationsverfahren zur automatischen Zuweisung von semantischen Objektklassen zu Laserpunkten. Hierbei begegnen wir hauptsächlich zwei Herausforderungen: (1) wir möchten konsistente und korrekte Klassifikationsergebnisse erreichen und (2) die immense Menge an Laserdaten effizient verarbeiten. Unter Berücksichtigung dieser Herausforderungen untersuchen wir beide Verarbeitungsschritte eines Klassifikationsverfahrens -- die Merkmalsextraktion unter Nutzung von Laserdaten und das eigentliche Klassifikationsmodell, welches die Merkmale auf semantische Objektklassen abbildet. Bezüglich der Merkmalsextraktion leisten wir ein Beitrag durch eine ausführliche Evaluation wichtiger Histogrammdeskriptoren. Wir untersuchen kritische Deskriptorparameter und zeigen zum ersten Mal, dass die Klassifikationsgüte unter Nutzung von großen Merkmalsradien und eines globalen Referenzrahmens signifikant gesteigert wird. Bezüglich des Lernens des Klassifikationsmodells, leisten wir Beiträge durch neue Algorithmen, welche die Effizienz und Genauigkeit der Klassifikation verbessern. In unserem ersten Ansatz möchten wir eine konsistente punktweise Interpretation des gesamten Laserscans erreichen. Zu diesem Zweck kombinieren wir eine ähnlichkeitserhaltende Hashfunktion und mehrere lineare Klassifikatoren und erreichen hierdurch eine erhebliche Verbesserung der Konsistenz der Klassenzuweisung bei minimalen zusätzlichen Aufwand im Vergleich zu einem einzelnen linearen Klassifikator. Im letzten Teil der Dissertation möchten wir Objekte, die als Segmente repräsentiert sind, klassifizieren. Wir stellen eine neuartiges hierarchisches Segmentierungsverfahren und ein neuartiges Klassifikationsmodell auf Basis einer Mixtur mehrerer bag-of-words Vokabulare vor. Wir demonstrieren unter Nutzung von praxisrelevanten Datensätzen, dass beide Ansätze im Vergleich zu ihren Entsprechungen aus einer einzelnen Komponente zu erheblichen Verbesserungen führen