128 research outputs found

    VISUAL SALIENCY ANALYSIS, PREDICTION, AND VISUALIZATION: A DEEP LEARNING PERSPECTIVE

    Get PDF
    In the recent years, a huge success has been accomplished in prediction of human eye fixations. Several studies employed deep learning to achieve high accuracy of prediction of human eye fixations. These studies rely on pre-trained deep learning for object classification. They exploit deep learning either as a transfer-learning problem, or the weights of the pre-trained network as the initialization to learn a saliency model. The utilization of such pre-trained neural networks is due to the relatively small datasets of human fixations available to train a deep learning model. Another relatively less prioritized problem is amount of computation of such deep learning models requires expensive hardware. In this dissertation, two approaches are proposed to tackle abovementioned problems. The first approach, codenamed DeepFeat, incorporates the deep features of convolutional neural networks pre-trained for object and scene classifications. This approach is the first approach that uses deep features without further learning. Performance of the DeepFeat model is extensively evaluated over a variety of datasets using a variety of implementations. The second approach is a deep learning saliency model, codenamed ClassNet. Two main differences separate the ClassNet from other deep learning saliency models. The ClassNet model is the only deep learning saliency model that learns its weights from scratch. In addition, the ClassNet saliency model treats prediction of human fixation as a classification problem, while other deep learning saliency models treat the human fixation prediction as a regression problem or as a classification of a regression problem

    Matching and Segmentation for Multimedia Data

    Get PDF
    With the development of society, both industry and academia draw increasing attention to multimedia systems, which handle image/video data, audio data, and text data comprehensively and simultaneously. In this thesis, we mainly focus on multi-modality data understanding, combining the two subjects of Computer Vision (CV) and Natural Language Processing (NLP). Such a task is widely used in many real-world scenarios, including criminal search with language descriptions by the witness, robotic navigation with language instruction in the smart industry, terrorist tracking, missing person identification, and so on. However, such a multi-modality system still faces many challenges, limiting its performance and ability in real-life situations, including the domain gap between the modalities of vision and language, the request for high-quality datasets, and so on. Therefore, to better analyze and handle these challenges, this thesis focuses on the two fundamental tasks, including matching and segmentation. Image-Text Matching (ITM) aims to retrieve the texts (images) that describe the most relevant contents for a given image (text) query. Due to the semantic gap between the linguistic and visual domains, aligning and comparing feature representations for languages and images are still challenging. To overcome this limitation, we propose a new framework for the image-text matching task, which uses an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. As the downstream application of ITM, the language-person search is one of the specific cases where language descriptions are provided to retrieve person images, which also suffers the domain gap between linguistic and visual data. To handle this problem, we propose a transformer-based language-person search matching framework with matching conducted between words and image regions for better image-text interaction. However, collecting a large amount of training data is neither cheap nor reliable using human annotations. We further study the one-shot person Re-ID (re-identification) task aiming to match people by offering one labeled reference image for each person, where previous methods request a large number of ground-truth labels. We propose progressive sample mining and representation learning to fit the limited labels for the one-shot Re-ID task better. Referring Expression Segmentation (RES) aims to localize and segment the target according to the given language expression. Existing methods jointly consider the localization and segmentation steps, which rely on the fused visual and linguistic features for both steps. We argue that the conflict between the purpose of finding the object and generating the mask limits the RES performance. To solve this problem, we propose a parallel position-kernel-segmentation pipeline to better isolate then interact with the localization and segmentation steps. In our pipeline, linguistic information will not directly contaminate the visual feature for segmentation. Specifically, the localization step localizes the target object in the image based on the referring expression, then the visual kernel obtained from the localization step guides the segmentation step. This pipeline also enables us to train RES in a weakly-supervised way, where the pixel-level segmentation labels are replaced by click annotations on center and corner points. The position head is fully-supervised trained with the click annotations as supervision, and the segmentation head is trained with weakly-supervised segmentation losses. This thesis focus on the key limitations of the multimedia system, where the experiments prove that the proposed frameworks are effective for specific tasks. The experiments are easy to reproduce with clear details, and source codes are provided for future works aiming at these tasks

    Advances in Deep Learning Towards Fire Emergency Application : Novel Architectures, Techniques and Applications of Neural Networks

    Get PDF
    Paper IV is not published yet.With respect to copyright paper IV and paper VI was excluded from the dissertation.Deep Learning has been successfully used in various applications, and recently, there has been an increasing interest in applying deep learning in emergency management. However, there are still many significant challenges that limit the use of deep learning in the latter application domain. In this thesis, we address some of these challenges and propose novel deep learning methods and architectures. The challenges we address fall in these three areas of emergency management: Detection of the emergency (fire), Analysis of the situation without human intervention and finally Evacuation Planning. In this thesis, we have used computer vision tasks of image classification and semantic segmentation, as well as sound recognition, for detection and analysis. For evacuation planning, we have used deep reinforcement learning.publishedVersio

    Word Knowledge and Word Usage

    Get PDF
    Word storage and processing define a multi-factorial domain of scientific inquiry whose thorough investigation goes well beyond the boundaries of traditional disciplinary taxonomies, to require synergic integration of a wide range of methods, techniques and empirical and experimental findings. The present book intends to approach a few central issues concerning the organization, structure and functioning of the Mental Lexicon, by asking domain experts to look at common, central topics from complementary standpoints, and discuss the advantages of developing converging perspectives. The book will explore the connections between computational and algorithmic models of the mental lexicon, word frequency distributions and information theoretical measures of word families, statistical correlations across psycho-linguistic and cognitive evidence, principles of machine learning and integrative brain models of word storage and processing. Main goal of the book will be to map out the landscape of future research in this area, to foster the development of interdisciplinary curricula and help single-domain specialists understand and address issues and questions as they are raised in other disciplines

    A Jeffrey divergence based irregular pyramid method for pre-attentive visual segmentation

    No full text

    Object Recognition

    Get PDF
    Vision-based object recognition tasks are very familiar in our everyday activities, such as driving our car in the correct lane. We do these tasks effortlessly in real-time. In the last decades, with the advancement of computer technology, researchers and application developers are trying to mimic the human's capability of visually recognising. Such capability will allow machine to free human from boring or dangerous jobs
    • …
    corecore