529 research outputs found

    Local Binary Patterns in Focal-Plane Processing. Analysis and Applications

    Get PDF
    Feature extraction is the part of pattern recognition, where the sensor data is transformed into a more suitable form for the machine to interpret. The purpose of this step is also to reduce the amount of information passed to the next stages of the system, and to preserve the essential information in the view of discriminating the data into different classes. For instance, in the case of image analysis the actual image intensities are vulnerable to various environmental effects, such as lighting changes and the feature extraction can be used as means for detecting features, which are invariant to certain types of illumination changes. Finally, classification tries to make decisions based on the previously transformed data. The main focus of this thesis is on developing new methods for the embedded feature extraction based on local non-parametric image descriptors. Also, feature analysis is carried out for the selected image features. Low-level Local Binary Pattern (LBP) based features are in a main role in the analysis. In the embedded domain, the pattern recognition system must usually meet strict performance constraints, such as high speed, compact size and low power consumption. The characteristics of the final system can be seen as a trade-off between these metrics, which is largely affected by the decisions made during the implementation phase. The implementation alternatives of the LBP based feature extraction are explored in the embedded domain in the context of focal-plane vision processors. In particular, the thesis demonstrates the LBP extraction with MIPA4k massively parallel focal-plane processor IC. Also higher level processing is incorporated to this framework, by means of a framework for implementing a single chip face recognition system. Furthermore, a new method for determining optical flow based on LBPs, designed in particular to the embedded domain is presented. Inspired by some of the principles observed through the feature analysis of the Local Binary Patterns, an extension to the well known non-parametric rank transform is proposed, and its performance is evaluated in face recognition experiments with a standard dataset. Finally, an a priori model where the LBPs are seen as combinations of n-tuples is also presentedSiirretty Doriast

    Multi-dimensional local binary pattern texture descriptors and their application for medical image analysis

    Get PDF
    Texture can be broadly stated as spatial variation of image intensities. Texture analysis and classification is a well researched area for its importance to many computer vision applications. Consequently, much research has focussed on deriving powerful and efficient texture descriptors. Local binary patterns (LBP) and its variants are simple yet powerful texture descriptors. LBP features describe the texture neighbourhood of a pixel using simple comparison operators, and are often calculated based on varying neighbourhood radii to provide multi-resolution texture descriptions. A comprehensive evaluation of different LBP variants on a common benchmark dataset is missing in the literature. This thesis presents the performance for different LBP variants on texture classification and retrieval tasks. The results show that multi-scale local binary pattern variance (LBPV) gives the best performance over eight benchmarked datasets. Furthermore, improvements to the Dominant LBP (D-LBP) by ranking dominant patterns over complete training set and Compound LBP (CM-LBP) by considering 16 bits binary codes are suggested which are shown to outperform their original counterparts. The main contribution of the thesis is the introduction of multi-dimensional LBP features, which preserve the relationships between different scales by building a multi-dimensional histogram. The results on benchmarked classification and retrieval datasets clearly show that the multi-dimensional LBP (MD-LBP) improves the results compared to conventional multi-scale LBP. The same principle is applied to LBPV (MD-LBPV), again leading to improved performance. The proposed variants result in relatively large feature lengths which is addressed using three different feature length reduction techniques. Principle component analysis (PCA) is shown to give the best performance when the feature length is reduced to match that of conventional multi-scale LBP. The proposed multi-dimensional LBP variants are applied for medical image analysis application. The first application is nailfold capillary (NC) image classification. Performance of MD-LBPV on NC images is highest, whereas for second application, HEp-2 cell classification, performance of MD-LBP is highest. It is observed that the proposed texture descriptors gives improved texture classification accuracy

    Scene Image Classification and Retrieval

    Get PDF
    Scene image classification and retrieval not only have a great impact on scene image management, but also they can offer immeasurable assistance to other computer vision problems, such as image completion, human activity analysis, object recognition etc. Intuitively scene identification is correlated to recognition of objects or image regions, which prompts the notion to apply local features to scene categorization applications. Even though the adoption of local features in these tasks has yielded promising results, a global perception on scene images is also well-conditioned in cognitive science studies. Since the global description of a scene imposes less computational burden, it is favoured by some scholars despite its less discriminative capacity. Recent studies on global scene descriptors have even yielded classification performance that rivals results obtained by local approaches. The primary objective of this work is to tackle two of the limitations of existing global scene features: representation ineffectiveness and computational complexity. The thesis proposes two global scene features that seek to represent finer scene structures and reduce the dimensionality of feature vectors. Experimental results show that the proposed scene features exceed the performance of existing methods. The thesis is roughly divided into two parts. The first three chapters give an overview on the topic of scene image classification and retrieval methods, with a special attention to the most effective global scene features. In chapter 4, a novel scene descriptor, called ARP-GIST, is proposed and evaluated against the existing methods to show its ability to detect finer scene structures. In chapter 5, a low-dimensional scene feature, GIST-LBP, is proposed. In conjunction with a block ranking approach, the GIST-LBP feature is tested on a standard scene dataset to demonstrate its state-of-the-art performance

    Identification, synchronisation and composition of user-generated videos

    Get PDF
    Cotutela Universitat Politècnica de Catalunya i Queen Mary University of LondonThe increasing availability of smartphones is facilitating people to capture videos of their experience when attending events such as concerts, sports competitions and public rallies. Smartphones are equipped with inertial sensors which could be beneficial for event understanding. The captured User-Generated Videos (UGVs) are made available on media sharing websites. Searching and mining of UGVs of the same event are challenging due to inconsistent tags or incorrect timestamps. A UGV recorded from a fixed location contains monotonic content and unintentional camera motions, which may make it less interesting to playback. In this thesis, we propose the following identification, synchronisation and video composition frameworks for UGVs. We propose a framework for the automatic identification and synchronisation of unedited multi-camera UGVs within a database. The proposed framework analyses the sound to match and cluster UGVs that capture the same spatio-temporal event, and estimate their relative time-shift to temporally align them. We design a novel descriptor derived from the pairwise matching of audio chroma features of UGVs. The descriptor facilitates the definition of a classification threshold for automatic query-by-example event identification. We contribute a database of 263 multi-camera UGVs of 48 real-world events. We evaluate the proposed framework on this database and compare it with state-of-the-art methods. Experimental results show the effectiveness of the proposed approach in the presence of audio degradations (channel noise, ambient noise, reverberations). Moreover, we present an automatic audio and visual-based camera selection framework for composing uninterrupted recording from synchronised multi-camera UGVs of the same event. We design an automatic audio-based cut-point selection method that provides a common reference for audio and video segmentation. To filter low quality video segments, spatial and spatio-temporal assessments are computed. The framework combines segments of UGVs using a rank-based camera selection strategy by considering visual quality scores and view diversity. The proposed framework is validated on a dataset of 13 events (93~UGVs) through subjective tests and compared with state-of-the-art methods. Suitable cut-point selection, specific visual quality assessments and rank-based camera selection contribute to the superiority of the proposed framework over the existing methods. Finally, we contribute a method for Camera Motion Detection using Gyroscope for UGVs captured from smartphones and design a gyro-based quality score for video composition. The gyroscope measures the angular velocity of the smartphone that can be use for camera motion analysis. We evaluate the proposed camera motion detection method on a dataset of 24 multi-modal UGVs captured by us, and compare it with existing visual and inertial sensor-based methods. By designing a gyro-based score to quantify the goodness of the multi-camera UGVs, we develop a gyro-based video composition framework. A gyro-based score substitutes the spatial and spatio-temporal scores and reduces the computational complexity. We contribute a multi-modal dataset of 3 events (12~UGVs), which is used to validate the proposed gyro-based video composition framework.El incremento de la disponibilidad de teléfonos inteligentes o smartphones posibilita a la gente capturar videos de sus experiencias cuando asisten a eventos así como como conciertos, competiciones deportivas o mítines públicos. Los Videos Generados por Usuarios (UGVs) pueden estar disponibles en sitios web públicos especializados en compartir archivos. La búsqueda y la minería de datos de los UGVs del mismo evento son un reto debido a que los etiquetajes son incoherentes o las marcas de tiempo erróneas. Por otra parte, un UGV grabado desde una ubicación fija, contiene información monótona y movimientos de cámara no intencionados haciendo menos interesante su reproducción. En esta tesis, se propone una identificación, sincronización y composición de tramas de vídeo para UGVs. Se ha propuesto un sistema para la identificación y sincronización automática de UGVs no editados provenientes de diferentes cámaras dentro de una base de datos. El sistema propuesto analiza el sonido con el fin de hacerlo coincidir e integrar UGVs que capturan el mismo evento en el espacio y en el tiempo, estimando sus respectivos desfases temporales y alinearlos en el tiempo. Se ha diseñado un nuevo descriptor a partir de la coincidencia por parejas de características de la croma del audio de los UGVs. Este descriptor facilita la determinación de una clasificación por umbral para una identificación de eventos automática basada en búsqueda mediante ejemplo (en inglés, query by example). Se ha contribuido con una base de datos de 263 multi-cámaras UGVs de un total de 48 eventos reales. Se ha evaluado la trama propuesta en esta base de datos y se ha comparado con los métodos elaborados en el estado del arte. Los resultados experimentales muestran la efectividad del enfoque propuesto con la presencia alteraciones en el audio. Además, se ha presentado una selección automática de tramas en base a la reproducción de video y audio componiendo una grabación ininterrumpida de multi-cámaras UGVs sincronizadas en el mismo evento. También se ha diseñado un método de selección de puntos de corte automático basado en audio que proporciona una referencia común para la segmentación de audio y video. Con el fin de filtrar segmentos de videos de baja calidad, se han calculado algunas medidas espaciales y espacio-temporales. El sistema combina segmentos de UGVs empleando una estrategia de selección de cámaras basadas en la evaluación a través de un ranking considerando puntuaciones de calidad visuales y diversidad de visión. El sistema propuesto se ha validado con un conjunto de datos de 13 eventos (93 UGVs) a través de pruebas subjetivas y se han comparado con los métodos elaborados en el estado del arte. La selección de puntos de corte adecuados, evaluaciones de calidad visual específicas y la selección de cámara basada en ranking contribuyen en la mejoría de calidad del sistema propuesto respecto a otros métodos existentes. Finalmente, se ha realizado un método para la Detección de Movimiento de Cámara usando giróscopos para las UGVs capturadas desde smartphones y se ha diseñado un método de puntuación de calidad basada en el giro. El método de detección de movimiento de la cámara con una base de datos de 24 UGVs multi-modales y se ha comparado con los métodos actuales basados en visión y sistemas inerciales. A través del diseño de puntuación para cuantificar con el giróscopo cuán bien funcionan los sistemas de UGVs con multi-cámara, se ha desarrollado un sistema de composición de video basada en el movimiento del giroscopio. Este sistema basado en la puntuación a través del giróscopo sustituye a los sistemas de puntuaciones basados en parámetros espacio-temporales reduciendo la complejidad computacional. Además, se ha contribuido con un conjunto de datos de 3 eventos (12 UGVs), que se han empleado para validar los sistemas de composición de video basados en giróscopo.Postprint (published version

    Learned image representations for visual recognition

    Get PDF

    High-Level Facade Image Interpretation using Marked Point Processes

    Get PDF
    In this thesis, we address facade image interpretation as one essential ingredient for the generation of high-detailed, semantic meaningful, three-dimensional city-models. Given a single rectified facade image, we detect relevant facade objects such as windows, entrances, and balconies, which yield a description of the image in terms of accurate position and size of these objects. Urban digital three-dimensional reconstruction and documentation is an active area of research with several potential applications, e.g., in the area of digital mapping for navigation, urban planning, emergency management, disaster control or the entertainment industry. A detailed building model which is not just a geometric object enriched with texture, allows for semantic requests as the number of floors or the location of balconies and entrances. Facade image interpretation is one essential step in order to yield such models. In this thesis, we propose the interpretation of facade images by combining evidence for the occurrence of individual object classes which we derive from data, and prior knowledge which guides the image interpretation in its entirety. We present a three-step procedure which generates features that are suited to describe relevant objects, learns a representation that is suited for object detection, and that enables the image interpretation using the results of object detection while incorporating prior knowledge about typical configurations of facade objects, which we learn from training data. According to these three sub-tasks, our major achievements are: We propose a novel method for facade image interpretation based on a marked point process. Therefor, we develop a model for the description of typical configurations of facade objects and propose an image interpretation system which combines evidence derived from data and prior knowledge about typical configurations of facade objects. In order to generate evidence from data, we propose a feature type which we call shapelets. They are scale invariant and provide large distinctiveness for facade objects. Segments of lines, arcs, and ellipses serve as basic features for the generation of shapelets. Therefor, we propose a novel line simplification approach which approximates given pixel-chains by a sequence of lines, circular, and elliptical arcs. Among others, it is based on an adaption to Douglas-Peucker's algorithm, which is based on circles as basic geometric elements We evaluate each step separately. We show the effects of polyline segmentation and simplification on several images with comparable good or even better results, referring to a state-of-the-art algorithm, which proves their large distinctiveness for facade objects. Using shapelets we provide a reasonable classification performance on a challenging dataset, including intra-class variations, clutter, and scale changes. Finally, we show promising results for the facade interpretation system on several datasets and provide a qualitative evaluation which demonstrates the capability of complete and accurate detection of facade objectsHigh-Level Interpretation von Fassaden-Bildern unter Benutzung von Markierten PunktprozessenDas Thema dieser Arbeit ist die Interpretation von Fassadenbildern als wesentlicher Beitrag zur Erstellung hoch detaillierter, semantisch reichhaltiger dreidimensionaler Stadtmodelle. In rektifizierten Einzelaufnahmen von Fassaden detektieren wir relevante Objekte wie Fenster, Türen und Balkone, um daraus eine Bildinterpretation in Form von präzisen Positionen und Größen dieser Objekte abzuleiten. Die digitale dreidimensionale Rekonstruktion urbaner Regionen ist ein aktives Forschungsfeld mit zahlreichen Anwendungen, beispielsweise der Herstellung digitaler Kartenwerke für Navigation, Stadtplanung, Notfallmanagement, Katastrophenschutz oder die Unterhaltungsindustrie. Detaillierte Gebäudemodelle, die nicht nur als geometrische Objekte repräsentiert und durch eine geeignete Textur visuell ansprechend dargestellt werden, erlauben semantische Anfragen, wie beispielsweise nach der Anzahl der Geschosse oder der Position der Balkone oder Eingänge. Die semantische Interpretation von Fassadenbildern ist ein wesentlicher Schritt für die Erzeugung solcher Modelle. In der vorliegenden Arbeit lösen wir diese Aufgabe, indem wir aus Daten abgeleitete Evidenz für das Vorkommen einzelner Objekte mit Vorwissen kombinieren, das die Analyse der gesamten Bildinterpretation steuert. Wir präsentieren dafür ein dreistufiges Verfahren: Wir erzeugen Bildmerkmale, die für die Beschreibung der relevanten Objekte geeignet sind. Wir lernen, auf Basis abgeleiteter Merkmale, eine Repräsentation dieser Objekte. Schließlich realisieren wir die Bildinterpretation basierend auf der zuvor gelernten Repräsentation und dem Vorwissen über typische Konfigurationen von Fassadenobjekten, welches wir aus Trainingsdaten ableiten. Wir leisten dazu die folgenden wissenschaftlichen Beiträge: Wir schlagen eine neuartige Me-thode zur Interpretation von Fassadenbildern vor, die einen sogenannten markierten Punktprozess verwendet. Dafür entwickeln wir ein Modell zur Beschreibung typischer Konfigurationen von Fassadenobjekten und entwickeln ein Bildinterpretationssystem, welches aus Daten abgeleitete Evidenz und a priori Wissen über typische Fassadenkonfigurationen kombiniert. Für die Erzeugung der Evidenz stellen wir Merkmale vor, die wir Shapelets nennen und die skaleninvariant und durch eine ausgesprochene Distinktivität im Bezug auf Fassadenobjekte gekennzeichnet sind. Als Basismerkmale für die Erzeugung der Shapelets dienen Linien-, Kreis- und Ellipsensegmente. Dafür stellen wir eine neuartige Methode zur Vereinfachung von Liniensegmenten vor, die eine Pixelkette durch eine Sequenz von geraden Linienstücken und elliptischen Bogensegmenten approximiert. Diese basiert unter anderem auf einer Adaption des Douglas-Peucker Algorithmus, die anstelle gerader Linienstücke, Bogensegmente als geometrische Basiselemente verwendet. Wir evaluieren jeden dieser drei Teilschritte separat. Wir zeigen Ergebnisse der Liniensegmen-tierung anhand verschiedener Bilder und weisen dabei vergleichbare und teilweise verbesserte Ergebnisse im Vergleich zu bestehende Verfahren nach. Für die vorgeschlagenen Shapelets weisen wir in der Evaluation ihre diskriminativen Eigenschaften im Bezug auf Fassadenobjekte nach. Wir erzeugen auf einem anspruchsvollen Datensatz von skalenvariablen Fassadenobjekten, mit starker Variabilität der Erscheinung innerhalb der Klassen, vielversprechende Klassifikationsergebnisse, die die Verwendbarkeit der gelernten Shapelets für die weitere Interpretation belegen. Schließlich zeigen wir Ergebnisse der Interpretation der Fassadenstruktur anhand verschiedener Datensätze. Die qualitative Evaluation demonstriert die Fähigkeit des vorgeschlagenen Lösungsansatzes zur vollständigen und präzisen Detektion der genannten Fassadenobjekte

    Large-scale interactive exploratory visual search

    Get PDF
    Large scale visual search has been one of the challenging issues in the era of big data. It demands techniques that are not only highly effective and efficient but also allow users conveniently express their information needs and refine their intents. In this thesis, we focus on developing an exploratory framework for large scale visual search. We also develop a number of enabling techniques in this thesis, including compact visual content representation for scalable search, near duplicate video shot detection, and action based event detection. We propose a novel scheme for extremely low bit rate visual search, which sends compressed visual words consisting of vocabulary tree histogram and descriptor orientations rather than descriptors. Compact representation of video data is achieved through identifying keyframes of a video which can also help users comprehend visual content efficiently. We propose a novel Bag-of-Importance model for static video summarization. Near duplicate detection is one of the key issues for large scale visual search, since there exist a large number nearly identical images and videos. We propose an improved near-duplicate video shot detection approach for more effective shot representation. Event detection has been one of the solutions for bridging the semantic gap in visual search. We particular focus on human action centred event detection. We propose an enhanced sparse coding scheme to model human actions. Our proposed approach is able to significantly reduce computational cost while achieving recognition accuracy highly comparable to the state-of-the-art methods. At last, we propose an integrated solution for addressing the prime challenges raised from large-scale interactive visual search. The proposed system is also one of the first attempts for exploratory visual search. It provides users more robust results to satisfy their exploring experiences

    Facial expression recognition in the wild : from individual to group

    Get PDF
    The progress in computing technology has increased the demand for smart systems capable of understanding human affect and emotional manifestations. One of the crucial factors in designing systems equipped with such intelligence is to have accurate automatic Facial Expression Recognition (FER) methods. In computer vision, automatic facial expression analysis is an active field of research for over two decades now. However, there are still a lot of questions unanswered. The research presented in this thesis attempts to address some of the key issues of FER in challenging conditions mentioned as follows: 1) creating a facial expressions database representing real-world conditions; 2) devising Head Pose Normalisation (HPN) methods which are independent of facial parts location; 3) creating automatic methods for the analysis of mood of group of people. The central hypothesis of the thesis is that extracting close to real-world data from movies and performing facial expression analysis on movies is a stepping stone in the direction of moving the analysis of faces towards real-world, unconstrained condition. A temporal facial expressions database, Acted Facial Expressions in the Wild (AFEW) is proposed. The database is constructed and labelled using a semi-automatic process based on closed caption subtitle based keyword search. Currently, AFEW is the largest facial expressions database representing challenging conditions available to the research community. For providing a common platform to researchers in order to evaluate and extend their state-of-the-art FER methods, the first Emotion Recognition in the Wild (EmotiW) challenge based on AFEW is proposed. An image-only based facial expressions database Static Facial Expressions In The Wild (SFEW) extracted from AFEW is proposed. Furthermore, the thesis focuses on HPN for real-world images. Earlier methods were based on fiducial points. However, as fiducial points detection is an open problem for real-world images, HPN can be error-prone. A HPN method based on response maps generated from part-detectors is proposed. The proposed shape-constrained method does not require fiducial points and head pose information, which makes it suitable for real-world images. Data from movies and the internet, representing real-world conditions poses another major challenge of the presence of multiple subjects to the research community. This defines another focus of this thesis where a novel approach for modeling the perception of mood of a group of people in an image is presented. A new database is constructed from Flickr based on keywords related to social events. Three models are proposed: averaging based Group Expression Model (GEM), Weighted Group Expression Model (GEM_w) and Augmented Group Expression Model (GEM_LDA). GEM_w is based on social contextual attributes, which are used as weights on each person's contribution towards the overall group's mood. Further, GEM_LDA is based on topic model and feature augmentation. The proposed framework is applied to applications of group candid shot selection and event summarisation. The application of Structural SIMilarity (SSIM) index metric is explored for finding similar facial expressions. The proposed framework is applied to the problem of creating image albums based on facial expressions, finding corresponding expressions for training facial performance transfer algorithms
    corecore