94 research outputs found

    Human object annotation for surveillance video forensics

    Get PDF
    A system that can automatically annotate surveillance video in a manner useful for locating a person with a given description of clothing is presented. Each human is annotated based on two appearance features: primary colors of clothes and the presence of text/logos on clothes. The annotation occurs after a robust foreground extraction stage employing a modified Gaussian mixture model-based approach. The proposed pipeline consists of a preprocessing stage where color appearance of an image is improved using a color constancy algorithm. In order to annotate color information for human clothes, we use the color histogram feature in HSV space and find local maxima to extract dominant colors for different parts of a segmented human object. To detect text/logos on clothes, we begin with the extraction of connected components of enhanced horizontal, vertical, and diagonal edges in the frames. These candidate regions are classified as text or nontext on the basis of their local energy-based shape histogram features. Further, to detect humans, a novel technique has been proposed that uses contourlet transform-based local binary pattern (CLBP) features. In the proposed method, we extract the uniform direction invariant LBP feature descriptor for contourlet transformed high-pass subimages from vertical and diagonal directional bands. In the final stage, extracted CLBP descriptors are classified by a trained support vector machine. Experimental results illustrate the superiority of our method on large-scale surveillance video data

    Video content analysis for intelligent forensics

    Get PDF
    The networks of surveillance cameras installed in public places and private territories continuously record video data with the aim of detecting and preventing unlawful activities. This enhances the importance of video content analysis applications, either for real time (i.e. analytic) or post-event (i.e. forensic) analysis. In this thesis, the primary focus is on four key aspects of video content analysis, namely; 1. Moving object detection and recognition, 2. Correction of colours in the video frames and recognition of colours of moving objects, 3. Make and model recognition of vehicles and identification of their type, 4. Detection and recognition of text information in outdoor scenes. To address the first issue, a framework is presented in the first part of the thesis that efficiently detects and recognizes moving objects in videos. The framework targets the problem of object detection in the presence of complex background. The object detection part of the framework relies on background modelling technique and a novel post processing step where the contours of the foreground regions (i.e. moving object) are refined by the classification of edge segments as belonging either to the background or to the foreground region. Further, a novel feature descriptor is devised for the classification of moving objects into humans, vehicles and background. The proposed feature descriptor captures the texture information present in the silhouette of foreground objects. To address the second issue, a framework for the correction and recognition of true colours of objects in videos is presented with novel noise reduction, colour enhancement and colour recognition stages. The colour recognition stage makes use of temporal information to reliably recognize the true colours of moving objects in multiple frames. The proposed framework is specifically designed to perform robustly on videos that have poor quality because of surrounding illumination, camera sensor imperfection and artefacts due to high compression. In the third part of the thesis, a framework for vehicle make and model recognition and type identification is presented. As a part of this work, a novel feature representation technique for distinctive representation of vehicle images has emerged. The feature representation technique uses dense feature description and mid-level feature encoding scheme to capture the texture in the frontal view of the vehicles. The proposed method is insensitive to minor in-plane rotation and skew within the image. The capability of the proposed framework can be enhanced to any number of vehicle classes without re-training. Another important contribution of this work is the publication of a comprehensive up to date dataset of vehicle images to support future research in this domain. The problem of text detection and recognition in images is addressed in the last part of the thesis. A novel technique is proposed that exploits the colour information in the image for the identification of text regions. Apart from detection, the colour information is also used to segment characters from the words. The recognition of identified characters is performed using shape features and supervised learning. Finally, a lexicon based alignment procedure is adopted to finalize the recognition of strings present in word images. Extensive experiments have been conducted on benchmark datasets to analyse the performance of proposed algorithms. The results show that the proposed moving object detection and recognition technique superseded well-know baseline techniques. The proposed framework for the correction and recognition of object colours in video frames achieved all the aforementioned goals. The performance analysis of the vehicle make and model recognition framework on multiple datasets has shown the strength and reliability of the technique when used within various scenarios. Finally, the experimental results for the text detection and recognition framework on benchmark datasets have revealed the potential of the proposed scheme for accurate detection and recognition of text in the wild

    Extraction of Text from Images and Videos

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Text-detection and -recognition from natural images

    Get PDF
    Text detection and recognition from images could have numerous functional applications for document analysis, such as assistance for visually impaired people; recognition of vehicle license plates; evaluation of articles containing tables, street signs, maps, and diagrams; keyword-based image exploration; document retrieval; recognition of parts within industrial automation; content-based extraction; object recognition; address block location; and text-based video indexing. This research exploited the advantages of artificial intelligence (AI) to detect and recognise text from natural images. Machine learning and deep learning were used to accomplish this task.In this research, we conducted an in-depth literature review on the current detection and recognition methods used by researchers to identify the existing challenges, wherein the differences in text resulting from disparity in alignment, style, size, and orientation combined with low image contrast and a complex background make automatic text extraction a considerably challenging and problematic task. Therefore, the state-of-the-art suggested approaches obtain low detection rates (often less than 80%) and recognition rates (often less than 60%). This has led to the development of new approaches. The aim of the study was to develop a robust text detection and recognition method from natural images with high accuracy and recall, which would be used as the target of the experiments. This method could detect all the text in the scene images, despite certain specific features associated with the text pattern. Furthermore, we aimed to find a solution to the two main problems concerning arbitrarily shaped text (horizontal, multi-oriented, and curved text) detection and recognition in a low-resolution scene and with various scales and of different sizes.In this research, we propose a methodology to handle the problem of text detection by using novel combination and selection features to deal with the classification algorithms of the text/non-text regions. The text-region candidates were extracted from the grey-scale images by using the MSER technique. A machine learning-based method was then applied to refine and validate the initial detection. The effectiveness of the features based on the aspect ratio, GLCM, LBP, and HOG descriptors was investigated. The text-region classifiers of MLP, SVM, and RF were trained using selections of these features and their combinations. The publicly available datasets ICDAR 2003 and ICDAR 2011 were used to evaluate the proposed method. This method achieved the state-of-the-art performance by using machine learning methodologies on both databases, and the improvements were significant in terms of Precision, Recall, and F-measure. The F-measure for ICDAR 2003 and ICDAR 2011 was 81% and 84%, respectively. The results showed that the use of a suitable feature combination and selection approach could significantly increase the accuracy of the algorithms.A new dataset has been proposed to fill the gap of character-level annotation and the availability of text in different orientations and of curved text. The proposed dataset was created particularly for deep learning methods which require a massive completed and varying range of training data. The proposed dataset includes 2,100 images annotated at the character and word levels to obtain 38,500 samples of English characters and 12,500 words. Furthermore, an augmentation tool has been proposed to support the proposed dataset. The missing of object detection augmentation tool encroach to proposed tool which has the ability to update the position of bounding boxes after applying transformations on images. This technique helps to increase the number of samples in the dataset and reduce the time of annotations where no annotation is required. The final part of the thesis presents a novel approach for text spotting, which is a new framework for an end-to-end character detection and recognition system designed using an improved SSD convolutional neural network, wherein layers are added to the SSD networks and the aspect ratio of the characters is considered because it is different from that of the other objects. Compared with the other methods considered, the proposed method could detect and recognise characters by training the end-to-end model completely. The performance of the proposed method was better on the proposed dataset; it was 90.34. Furthermore, the F-measure of the method’s accuracy on ICDAR 2015, ICDAR 2013, and SVT was 84.5, 91.9, and 54.8, respectively. On ICDAR13, the method achieved the second-best accuracy. The proposed method could spot text in arbitrarily shaped (horizontal, oriented, and curved) scene text.</div

    Trademark image retrieval by local features

    Get PDF
    The challenge of abstract trademark image retrieval as a test of machine vision algorithms has attracted considerable research interest in the past decade. Current operational trademark retrieval systems involve manual annotation of the images (the current ‘gold standard’). Accordingly, current systems require a substantial amount of time and labour to access, and are therefore expensive to operate. This thesis focuses on the development of algorithms that mimic aspects of human visual perception in order to retrieve similar abstract trademark images automatically. A significant category of trademark images are typically highly stylised, comprising a collection of distinctive graphical elements that often include geometric shapes. Therefore, in order to compare the similarity of such images the principal aim of this research has been to develop a method for solving the partial matching and shape perception problem. There are few useful techniques for partial shape matching in the context of trademark retrieval, because those existing techniques tend not to support multicomponent retrieval. When this work was initiated most trademark image retrieval systems represented images by means of global features, which are not suited to solving the partial matching problem. Instead, the author has investigated the use of local image features as a means to finding similarities between trademark images that only partially match in terms of their subcomponents. During the course of this work, it has been established that the Harris and Chabat detectors could potentially perform sufficiently well to serve as the basis for local feature extraction in trademark image retrieval. Early findings in this investigation indicated that the well established SIFT (Scale Invariant Feature Transform) local features, based on the Harris detector, could potentially serve as an adequate underlying local representation for matching trademark images. There are few researchers who have used mechanisms based on human perception for trademark image retrieval, implying that the shape representations utilised in the past to solve this problem do not necessarily reflect the shapes contained in these image, as characterised by human perception. In response, a ii practical approach to trademark image retrieval by perceptual grouping has been developed based on defining meta-features that are calculated from the spatial configurations of SIFT local image features. This new technique measures certain visual properties of the appearance of images containing multiple graphical elements and supports perceptual grouping by exploiting the non-accidental properties of their configuration. Our validation experiments indicated that we were indeed able to capture and quantify the differences in the global arrangement of sub-components evident when comparing stylised images in terms of their visual appearance properties. Such visual appearance properties, measured using 17 of the proposed metafeatures, include relative sub-component proximity, similarity, rotation and symmetry. Similar work on meta-features, based on the above Gestalt proximity, similarity, and simplicity groupings of local features, had not been reported in the current computer vision literature at the time of undertaking this work. We decided to adopted relevance feedback to allow the visual appearance properties of relevant and non-relevant images returned in response to a query to be determined by example. Since limited training data is available when constructing a relevance classifier by means of user supplied relevance feedback, the intrinsically non-parametric machine learning algorithm ID3 (Iterative Dichotomiser 3) was selected to construct decision trees by means of dynamic rule induction. We believe that the above approach to capturing high-level visual concepts, encoded by means of meta-features specified by example through relevance feedback and decision tree classification, to support flexible trademark image retrieval and to be wholly novel. The retrieval performance the above system was compared with two other state-of-the-art image trademark retrieval systems: Artisan developed by Eakins (Eakins et al., 1998) and a system developed by Jiang (Jiang et al., 2006). Using relevance feedback, our system achieves higher average normalised precision than either of the systems developed by Eakins’ or Jiang. However, while our trademark image query and database set is based on an image dataset used by Eakins, we employed different numbers of images. It was not possible to access to the same query set and image database used in the evaluation of Jiang’s trademark iii image retrieval system evaluation. Despite these differences in evaluation methodology, our approach would appear to have the potential to improve retrieval effectiveness

    Activie vision in robot cognition

    Get PDF
    Tese de doutoramento, Engenharia Informática, Faculdade de Ciências e Tecnologia, Universidade do Algarve, 2016As technology and our understanding of the human brain evolve, the idea of creating robots that behave and learn like humans seems to get more and more attention. However, although that knowledge and computational power are constantly growing we still have much to learn to be able to create such machines. Nonetheless, that does not mean we cannot try to validate our knowledge by creating biologically inspired models to mimic some of our brain processes and use them for robotics applications. In this thesis several biologically inspired models for vision are presented: a keypoint descriptor based on cortical cell responses that allows to create binary codes which can be used to represent speci c image regions; and a stereo vision model based on cortical cell responses and visual saliency based on color, disparity and motion. Active vision is achieved by combining these vision modules with an attractor dynamics approach for head pan control. Although biologically inspired models are usually very heavy in terms of processing power, these models were designed to be lightweight so that they can be tested for real-time robot navigation, object recognition and vision steering. The developed vision modules were tested on a child-sized robot, which uses only visual information to navigate, to detect obstacles and to recognize objects in real time. The biologically inspired visual system is integrated with a cognitive architecture, which combines vision with short- and long-term memory for simultaneous localization and mapping (SLAM). Motor control for navigation is also done using attractor dynamics

    Multimedia Retrieval

    Get PDF

    Interfaces e métodos de pesquisa visual em imagens

    Get PDF
    O objetivo principal deste trabalho foi desenvolver uma aplicação com uma interface gráfica que permita a utilização de métodos de processamento de imagem em pesquisas efetuadas por utilizadores comuns, assim como visualizar os resultados obtidos de forma rápida e eficiente. Isto porque a recente e rápida evolução de técnicas de processamento de imagem e do hardware suscitam interesse no estudo da sua aplicabilidade em pesquisa visual de imagens. A par desta evolução, atualmente podem também ser geradas com facilidade uma grande quantidade de imagens, tornando-se necessário desenvolver interfaces para visualizá-las, mas não tem havido muito progresso nos últimos anos. Assim sendo, e como os algoritmos implementados podem ser aplicados a coleções de imagens, foram também estudadas e desenvolvidas interfaces para visualizá-las. Foi feita uma revisão de literatura extensa que serviu para determinar que métodos seriam implementados, e como inspiração para desenvolver quatro modos de visualização, mais concretamente uma grelha de thumbnails, uma grelha de thumbnails de tamanho variável, uma pilha de imagens e uma espiral. Os métodos foram avaliados quanto ao desempenho e qualidade dos resultados. As visualizações foram avaliadas num teste com 9 participantes, em que foram realizadastarefas de pesquisa geral/específica. Relativamente ao desempenho, todos os métodos foram testados com CPU, e os compatíveis com GPU. Foram testadas várias configurações de hardware. Constatou-se que o desempenho é satisfatório, especialmente com GPU. A qualidade dos resultados de alguns métodos ficou aquém dos valores anunciados nas suas publicações, mas foi suficiente para serem úteis e satisfazerem as necessidades da aplicação. Os testes com utilizadores indicaram que as visualizações da mais rápida para a menos são Grelha Normal > Grelha Variável > Espiral > Pilha, sem diferenças significativas entre as grelhas. Constatou-se que a Grelha Normal tem a melhor pontuação SUS, seguida da Grelha Variável, Espiral e Pilha. As visualizações da mais útil para a menos são Grelha Variável > Grelha Normal > Pilha > Espiral. Os aspetos mais importantes foram o tempo necessário para encontrar os objetos, a dificuldade de localizá-los e a intuição. Não foram encontradas diferenças significativas de precisão, revocação e f-measure em ambos os tipos de tarefas.The main goal of this work was to develop an application with a GUI that allows common users access to complex image processing algorithms, as well as quick and efficient result browsing. This is because the recent and quick evolution of visual search techniques, hardware and its increased accessibility sparked interest in studying what is possible today. Along with this evolution there has been a huge increase in the number of images that are generated, making it necessary to develop new interfaces to visualize them, but there has not been significant progress recently. Because of this, and because the algorithms implemented can be applied to image collections, different interfaces were studied and developed in this work. The starting point was a literature review that served to determine which methods would be implemented and as inspiration for the development of four interfaces, including a grid of thumbnails, another with varying size thumbnails, a pile of images and a spiral. The performance and the quality of the results of the methods were evaluated. The visualizations were evaluated in a user test with 9 participants, where they were asked to perform broad/specific search tasks. Regarding the performance, every method was tested with CPU and when supported with GPU, in four different hardware configurations. It was found that the performance satisfies the application’s needs, especially when using a GPU. The quality of the results of some methods didn’t match the values announced by their authors in their original publications, but it was enough to be fulfil their purpose. The user tests indicated that the visualizations ordered from fastest to slowest are Regular Grid > Variable Size Grid > Spiral > Pile, with no significative difference between the grids. The Regular Grid got the best SUS score, followed by the Variable Size Grid, Spiral and Pile. The visualizations ordered from most to least useful are Variable Size Grid > Regular Grid > Pile > Spiral. The key aspects were the time required to locate the objects, the difficulty of spotting them and the intuitiveness. Regarding precision, recall and f-measure, no significative differences were found in both types of tasks

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Video Sequence Alignment

    Get PDF
    The task of aligning multiple audio visual sequences with similar contents needs careful synchronisation in both spatial and temporal domains. It is a challenging task due to a broad range of contents variations, background clutter, occlusions, and other factors. This thesis is concerned with aligning video contents by characterising the spatial and temporal information embedded in the high-dimensional space. To that end a three- stage framework is developed, involving space-time representation of video clips with local linear coding, followed by their alignment in the manifold embedded space. The first two stages present a video representation techniques based on local feature extraction and linear coding methods. Firstly, the scale invariant feature transform (SIFT) is extended to extract interest points not only from the spatial plane but also from the planes along the space-time axis. Locality constrained coding is then incorporated to project each descriptor into a local coordinate system produced by a pooling technique. Human action classification benchmarks are adopted to evaluate these two stages, comparing their performance against existing techniques. The results shows that space-time extension of SIFT with a linear coding scheme outperforms most of the state-of-the-art approaches on the action classification task owing to its ability to represent complex events in video sequences. The final stage presents a manifold learning algorithm with spatio-temporal constraints to embed a video clip in a lower dimensional space while preserving the intrinsic geometry of the data. The similarities observed between frame sequences are captured by defining two types of correlation graphs: an intra-correlation graph within a single video sequence and an inter-correlation graph between two sequences. A video retrieval and ranking tasks are designed to evaluate the manifold learning stage. The experimental outcome shows that the approach outperforms the conventional techniques in defining similar video contents and capture the spatio-temporal correlations between them
    corecore