10 research outputs found

    Box-level Segmentation Supervised Deep Neural Networks for Accurate and Real-time Multispectral Pedestrian Detection

    Get PDF
    Effective fusion of complementary information captured by multi-modal sensors (visible and infrared cameras) enables robust pedestrian detection under various surveillance situations (e.g. daytime and nighttime). In this paper, we present a novel box-level segmentation supervised learning framework for accurate and real-time multispectral pedestrian detection by incorporating features extracted in visible and infrared channels. Specifically, our method takes pairs of aligned visible and infrared images with easily obtained bounding box annotations as input and estimates accurate prediction maps to highlight the existence of pedestrians. It offers two major advantages over the existing anchor box based multispectral detection methods. Firstly, it overcomes the hyperparameter setting problem occurred during the training phase of anchor box based detectors and can obtain more accurate detection results, especially for small and occluded pedestrian instances. Secondly, it is capable of generating accurate detection results using small-size input images, leading to improvement of computational efficiency for real-time autonomous driving applications. Experimental results on KAIST multispectral dataset show that our proposed method outperforms state-of-the-art approaches in terms of both accuracy and speed

    MDF-Net for Abnormality Detection by Fusing X-Rays with Clinical Data

    Full text link
    This study investigates the effects of including patients' clinical information on the performance of deep learning (DL) classifiers for disease location in chest X-ray images. Although current classifiers achieve high performance using chest X-ray images alone, our interviews with radiologists indicate that clinical data is highly informative and essential for interpreting images and making proper diagnoses. In this work, we propose a novel architecture consisting of two fusion methods that enable the model to simultaneously process patients' clinical data (structured data) and chest X-rays (image data). Since these data modalities are in different dimensional spaces, we propose a spatial arrangement strategy, spatialization, to facilitate the multimodal learning process in a Mask R-CNN model. We performed an extensive experimental evaluation using MIMIC-Eye, a dataset comprising modalities: MIMIC-CXR (chest X-ray images), MIMIC IV-ED (patients' clinical data), and REFLACX (annotations of disease locations in chest X-rays). Results show that incorporating patients' clinical data in a DL model together with the proposed fusion methods improves the disease localization in chest X-rays by 12\% in terms of Average Precision compared to a standard Mask R-CNN using only chest X-rays. Further ablation studies also emphasize the importance of multimodal DL architectures and the incorporation of patients' clinical data in disease localization. The architecture proposed in this work is publicly available to promote the scientific reproducibility of our study (https://github.com/ChihchengHsieh/multimodal-abnormalities-detection

    Tell me what to track: visual object tracking and retrieval by natural language descriptions

    Get PDF
    Natural Language (NL) descriptions can be one of the most convenient ways to initialize a visual tracker. NL descriptions can also help provide information for longer-term invariance, thus helping the tracker cope better with typical visual tracking challenges, e.g. occlusion, motion blur, etc. However, deriving a formulation to combine the strengths of appearance-based tracking with the NL modality is not straightforward. In this thesis, we use deep neural networks to learn a joint representation of language and vision that can perform various tasks, such as visual tracking by NL, tracked-object retrieval by NL, and spatio-temporal video groundings by NL. First, we study the Single Object Tracking (SOT) by NL descriptions task, which requires spatial localizations of a target object in a video sequence. We propose two novel approaches. The first is a tracking-by-detection approach, which performs object detection in the video sequence via similarity matching between potential objects' pooled visual representations and NL descriptions. The second approach uses a novel Siamese Natural Language Region Proposal Network with a depth-wise cross correlation operation to replace the visual template with a language template in Siamese trackers, e.g. SiamFC, SiamRPN++, etc., and achieves state-of-the-art on standard single object tracking by NL benchmarks. Second, based on experimental results and findings from the SOT by NL task, we propose the Tracked-object Retrieval by NL (TRNL) descriptions task and collect the CityFlow-NL Benchmark for it. CityFlow-NL contains more than 6,500precise NL descriptions of tracked vehicle targets, making it the first densely annotated dataset of tracked-objects paired with NL descriptions. To highlight the novelty of our dataset, we propose two models for the retrieval by NL task: a single-stream model based on cross-modality similarity matching and a quad-stream retrieval model that models the similarity between language features and visual features, including local visual features, frame-level features, motions, and relationships between visually similar targets. We release the CityFlow-NL Benchmark together with our models as challenges in the 5th and the 6th AI City Challenge. Lastly, we focus on the most challenging yet practical task of Spatio-Temporal Video Grounding (STVG), which aims to spatially and temporally localize a target in videos with NL descriptions. We propose new evaluation protocols for the STVG task to adapt to the new challenges of CityFlow-NL that are not well-represented in prior STVG benchmarks. Three intuitive and novel approaches to the STVG task are proposed and studied in this thesis, i.e. Multi-Object Tracking (MOT) + Retrieval by NL approach, Single Object Tracking (SOT) by NL based approach, and a direct localization approach that uses a transformer network to learn a joint representation from both the NL and vision modalities

    A Comprehensive Study of Real-Time Object Detection Networks Across Multiple Domains: A Survey

    Full text link
    Deep neural network based object detectors are continuously evolving and are used in a multitude of applications, each having its own set of requirements. While safety-critical applications need high accuracy and reliability, low-latency tasks need resource and energy-efficient networks. Real-time detectors, which are a necessity in high-impact real-world applications, are continuously proposed, but they overemphasize the improvements in accuracy and speed while other capabilities such as versatility, robustness, resource and energy efficiency are omitted. A reference benchmark for existing networks does not exist, nor does a standard evaluation guideline for designing new networks, which results in ambiguous and inconsistent comparisons. We, thus, conduct a comprehensive study on multiple real-time detectors (anchor-, keypoint-, and transformer-based) on a wide range of datasets and report results on an extensive set of metrics. We also study the impact of variables such as image size, anchor dimensions, confidence thresholds, and architecture layers on the overall performance. We analyze the robustness of detection networks against distribution shifts, natural corruptions, and adversarial attacks. Also, we provide a calibration analysis to gauge the reliability of the predictions. Finally, to highlight the real-world impact, we conduct two unique case studies, on autonomous driving and healthcare applications. To further gauge the capability of networks in critical real-time applications, we report the performance after deploying the detection networks on edge devices. Our extensive empirical study can act as a guideline for the industrial community to make an informed choice on the existing networks. We also hope to inspire the research community towards a new direction in the design and evaluation of networks that focuses on a bigger and holistic overview for a far-reaching impact.Comment: Published in Transactions on Machine Learning Research (TMLR) with Survey Certificatio

    An Evaluation of Deep Learning-Based Object Identification

    Get PDF
    Identification of instances of semantic objects of a particular class, which has been heavily incorporated in people's lives through applications like autonomous driving and security monitoring, is one of the most crucial and challenging areas of computer vision. Recent developments in deep learning networks for detection have improved object detector accuracy. To provide a detailed review of the current state of object detection pipelines, we begin by analyzing the methodologies employed by classical detection models and providing the benchmark datasets used in this study. After that, we'll have a look at the one- and two-stage detectors in detail, before concluding with a summary of several object detection approaches. In addition, we provide a list of both old and new apps. It's not just a single branch of object detection that is examined. Finally, we look at how to utilize various object detection algorithms to create a system that is both efficient and effective. and identify a number of emerging patterns in order to better understand the using the most recent algorithms and doing more study
    corecore