1,311 research outputs found

    Layered Interpretation of Street View Images

    Full text link
    We propose a layered street view model to encode both depth and semantic information on street view images for autonomous driving. Recently, stixels, stix-mantics, and tiered scene labeling methods have been proposed to model street view images. We propose a 4-layer street view model, a compact representation over the recently proposed stix-mantics model. Our layers encode semantic classes like ground, pedestrians, vehicles, buildings, and sky in addition to the depths. The only input to our algorithm is a pair of stereo images. We use a deep neural network to extract the appearance features for semantic classes. We use a simple and an efficient inference algorithm to jointly estimate both semantic classes and layered depth values. Our method outperforms other competing approaches in Daimler urban scene segmentation dataset. Our algorithm is massively parallelizable, allowing a GPU implementation with a processing speed about 9 fps.Comment: The paper will be presented in the 2015 Robotics: Science and Systems Conference (RSS

    A Tree-Based Context Model for Object Recognition

    Full text link

    Automated Semantic Content Extraction from Images

    Get PDF
    In this study, an automatic semantic segmentation and object recognition methodology is implemented which bridges the semantic gap between low level features of image content and high level conceptual meaning. Semantically understanding an image is essential in modeling autonomous robots, targeting customers in marketing or reverse engineering of building information modeling in the construction industry. To achieve an understanding of a room from a single image we proposed a new object recognition framework which has four major components: segmentation, scene detection, conceptual cueing and object recognition. The new segmentation methodology developed in this research extends Felzenswalb\u27s cost function to include new surface index and depth features as well as color, texture and normal features to overcome issues of occlusion and shadowing commonly found in images. Adding depth allows capturing new features for object recognition stage to achieve high accuracy compared to the current state of the art. The goal was to develop an approach to capture and label perceptually important regions which often reflect global representation and understanding of the image. We developed a system by using contextual and common sense information for improving object recognition and scene detection, and fused the information from scene and objects to reduce the level of uncertainty. This study in addition to improving segmentation, scene detection and object recognition, can be used in applications that require physical parsing of the image into objects, surfaces and their relations. The applications include robotics, social networking, intelligence and anti-terrorism efforts, criminal investigations and security, marketing, and building information modeling in the construction industry. In this dissertation a structural framework (ontology) is developed that generates text descriptions based on understanding of objects, structures and the attributes of an image

    Rich probabilistic models for semantic labeling

    Get PDF
    Das Ziel dieser Monographie ist es die Methoden und Anwendungen des semantischen Labelings zu erforschen. Unsere Beiträge zu diesem sich rasch entwickelten Thema sind bestimmte Aspekte der Modellierung und der Inferenz in probabilistischen Modellen und ihre Anwendungen in den interdisziplinären Bereichen der Computer Vision sowie medizinischer Bildverarbeitung und Fernerkundung

    Semantic multimedia modelling & interpretation for annotation

    Get PDF
    The emergence of multimedia enabled devices, particularly the incorporation of cameras in mobile phones, and the accelerated revolutions in the low cost storage devices, boosts the multimedia data production rate drastically. Witnessing such an iniquitousness of digital images and videos, the research community has been projecting the issue of its significant utilization and management. Stored in monumental multimedia corpora, digital data need to be retrieved and organized in an intelligent way, leaning on the rich semantics involved. The utilization of these image and video collections demands proficient image and video annotation and retrieval techniques. Recently, the multimedia research community is progressively veering its emphasis to the personalization of these media. The main impediment in the image and video analysis is the semantic gap, which is the discrepancy among a user’s high-level interpretation of an image and the video and the low level computational interpretation of it. Content-based image and video annotation systems are remarkably susceptible to the semantic gap due to their reliance on low-level visual features for delineating semantically rich image and video contents. However, the fact is that the visual similarity is not semantic similarity, so there is a demand to break through this dilemma through an alternative way. The semantic gap can be narrowed by counting high-level and user-generated information in the annotation. High-level descriptions of images and or videos are more proficient of capturing the semantic meaning of multimedia content, but it is not always applicable to collect this information. It is commonly agreed that the problem of high level semantic annotation of multimedia is still far from being answered. This dissertation puts forward approaches for intelligent multimedia semantic extraction for high level annotation. This dissertation intends to bridge the gap between the visual features and semantics. It proposes a framework for annotation enhancement and refinement for the object/concept annotated images and videos datasets. The entire theme is to first purify the datasets from noisy keyword and then expand the concepts lexically and commonsensical to fill the vocabulary and lexical gap to achieve high level semantics for the corpus. This dissertation also explored a novel approach for high level semantic (HLS) propagation through the images corpora. The HLS propagation takes the advantages of the semantic intensity (SI), which is the concept dominancy factor in the image and annotation based semantic similarity of the images. As we are aware of the fact that the image is the combination of various concepts and among the list of concepts some of them are more dominant then the other, while semantic similarity of the images are based on the SI and concept semantic similarity among the pair of images. Moreover, the HLS exploits the clustering techniques to group similar images, where a single effort of the human experts to assign high level semantic to a randomly selected image and propagate to other images through clustering. The investigation has been made on the LabelMe image and LabelMe video dataset. Experiments exhibit that the proposed approaches perform a noticeable improvement towards bridging the semantic gap and reveal that our proposed system outperforms the traditional systems

    Audio-coupled video content understanding of unconstrained video sequences

    Get PDF
    Unconstrained video understanding is a difficult task. The main aim of this thesis is to recognise the nature of objects, activities and environment in a given video clip using both audio and video information. Traditionally, audio and video information has not been applied together for solving such complex task, and for the first time we propose, develop, implement and test a new framework of multi-modal (audio and video) data analysis for context understanding and labelling of unconstrained videos. The framework relies on feature selection techniques and introduces a novel algorithm (PCFS) that is faster than the well-established SFFS algorithm. We use the framework for studying the benefits of combining audio and video information in a number of different problems. We begin by developing two independent content recognition modules. The first one is based on image sequence analysis alone, and uses a range of colour, shape, texture and statistical features from image regions with a trained classifier to recognise the identity of objects, activities and environment present. The second module uses audio information only, and recognises activities and environment. Both of these approaches are preceded by detailed pre-processing to ensure that correct video segments containing both audio and video content are present, and that the developed system can be made robust to changes in camera movement, illumination, random object behaviour etc. For both audio and video analysis, we use a hierarchical approach of multi-stage classification such that difficult classification tasks can be decomposed into simpler and smaller tasks. When combining both modalities, we compare fusion techniques at different levels of integration and propose a novel algorithm that combines advantages of both feature and decision-level fusion. The analysis is evaluated on a large amount of test data comprising unconstrained videos collected for this work. We finally, propose a decision correction algorithm which shows that further steps towards combining multi-modal classification information effectively with semantic knowledge generates the best possible results

    SEMANTIC ANALYSIS AND UNDERSTANDING OF HUMAN BEHAVIOUR IN VIDEO STREAMING

    Get PDF
    This thesis investigates the semantic analysis of the human behaviour captured by video streaming, both from the theoretical and technological points of view. The video analysis based on the semantic content is in fact still an open issue for the computer vision research community, especially when real-time analysis of complex scenes is concerned. Automated video analysis can be described and performed at different abstraction levels, from the pixel analysis up to the human behaviour understanding. Similarly, the organisation of computer vision systems is often hierarchical with low-level image processing techniques feeding into tracking algorithms and, then, into higher level scene analysis and/or behaviour analysis modules. Each level of this hierarchy has its open issues, among which the main ones are: - motion and object detection: dynamic background modelling, ghosts, suddenly changes in illumination conditions; - object tracking: modelling and estimating the dynamics of moving objects, presence of occlusions; - human behaviour identification: human behaviour patterns are characterized by ambiguity, inconsistency and time-variance. Researchers proposed various approaches which partially address some aspects of the above issues from the perspective of the semantic analysis and understanding of the video streaming. Many progresses were achieved, but usually not in a comprehensive way and often without reference to the actual operating situations. A popular class of approaches has been devised to enhance the quality of the semantic analysis by exploiting some background knowledge about scene and/or the human behaviour, thus narrowing the huge variety of possible behavioural patterns by focusing on a specific narrow domain. In general, the main drawback of the existing approaches to semantic analysis of the human behaviour, even in narrow domains, is inefficiency due to the high computational complexity related to the complex models representing the dynamics of the moving objects and the patterns of the human behaviours. In this perspective this thesis explores an innovative, original approach to human behaviour analysis and understanding by using the syntactical symbolic analysis of images and video streaming described by means of strings of symbols. A symbol is associated to each area of the analysed scene. When a moving object enters an area, the corresponding symbol is appended to the string describing the motion. This approach allows for characterizing the motion of a moving object with a word composed by symbols. By studying and classifying these words we can categorize and understand the various behaviours. The main advantage of this approach consists in the simplicity of the scene and motion descriptions so that the behaviour analysis will have limited computational complexity due to the intrinsic nature both of the representations and the related operations used to manipulate them. Besides, the structure of the representations is well suited for possible parallel processing, thus allowing for speeding up the analysis when appropriate hardware architectures are used. The theoretical background, the original theoretical results underlying this approach, the human behaviour analysis methodology, the possible implementations, and the related performance are presented and discussed in the thesis. To show the effectiveness of the proposed approach, a demonstrative system has been implemented and applied to a real indoor environment with valuable results. Furthermore, this thesis proposes an innovative method to improve the overall performance of the object tracking algorithm. This method is based on using two cameras to record the same scene from different point of view without introducing any constraint on cameras\u2019 position. The image fusion task is performed by solving the correspondence problem only for few relevant points. This approach reduces the problem of partial occlusions in crowded scenes. Since this method works at a level lower than that of semantic analysis, it can be applied also in other systems for human behaviour analysis and it can be seen as an optional method to improve the semantic analysis (because it reduces the problem of partial occlusions)
    • …
    corecore