10 research outputs found
Box-level Segmentation Supervised Deep Neural Networks for Accurate and Real-time Multispectral Pedestrian Detection
Effective fusion of complementary information captured by multi-modal sensors
(visible and infrared cameras) enables robust pedestrian detection under
various surveillance situations (e.g. daytime and nighttime). In this paper, we
present a novel box-level segmentation supervised learning framework for
accurate and real-time multispectral pedestrian detection by incorporating
features extracted in visible and infrared channels. Specifically, our method
takes pairs of aligned visible and infrared images with easily obtained
bounding box annotations as input and estimates accurate prediction maps to
highlight the existence of pedestrians. It offers two major advantages over the
existing anchor box based multispectral detection methods. Firstly, it
overcomes the hyperparameter setting problem occurred during the training phase
of anchor box based detectors and can obtain more accurate detection results,
especially for small and occluded pedestrian instances. Secondly, it is capable
of generating accurate detection results using small-size input images, leading
to improvement of computational efficiency for real-time autonomous driving
applications. Experimental results on KAIST multispectral dataset show that our
proposed method outperforms state-of-the-art approaches in terms of both
accuracy and speed
MDF-Net for Abnormality Detection by Fusing X-Rays with Clinical Data
This study investigates the effects of including patients' clinical
information on the performance of deep learning (DL) classifiers for disease
location in chest X-ray images. Although current classifiers achieve high
performance using chest X-ray images alone, our interviews with radiologists
indicate that clinical data is highly informative and essential for
interpreting images and making proper diagnoses.
In this work, we propose a novel architecture consisting of two fusion
methods that enable the model to simultaneously process patients' clinical data
(structured data) and chest X-rays (image data). Since these data modalities
are in different dimensional spaces, we propose a spatial arrangement strategy,
spatialization, to facilitate the multimodal learning process in a Mask R-CNN
model. We performed an extensive experimental evaluation using MIMIC-Eye, a
dataset comprising modalities: MIMIC-CXR (chest X-ray images), MIMIC IV-ED
(patients' clinical data), and REFLACX (annotations of disease locations in
chest X-rays).
Results show that incorporating patients' clinical data in a DL model
together with the proposed fusion methods improves the disease localization in
chest X-rays by 12\% in terms of Average Precision compared to a standard Mask
R-CNN using only chest X-rays. Further ablation studies also emphasize the
importance of multimodal DL architectures and the incorporation of patients'
clinical data in disease localization. The architecture proposed in this work
is publicly available to promote the scientific reproducibility of our study
(https://github.com/ChihchengHsieh/multimodal-abnormalities-detection
Tell me what to track: visual object tracking and retrieval by natural language descriptions
Natural Language (NL) descriptions can be one of the most convenient ways to initialize a visual tracker. NL descriptions can also help provide information for longer-term invariance, thus helping the tracker cope better with typical visual tracking challenges, e.g. occlusion, motion blur, etc. However, deriving a formulation to combine the strengths of appearance-based tracking with the NL modality is not straightforward. In this thesis, we use deep neural networks to learn a joint representation of language and vision that can perform various tasks, such as visual tracking by NL, tracked-object retrieval by NL, and spatio-temporal video groundings by NL.
First, we study the Single Object Tracking (SOT) by NL descriptions task, which requires spatial localizations of a target object in a video sequence. We propose two novel approaches. The first is a tracking-by-detection approach, which performs object detection in the video sequence via similarity matching between potential objects' pooled visual representations and NL descriptions. The second approach uses a novel Siamese Natural Language Region Proposal Network with a depth-wise cross correlation operation to replace the visual template with a language template in Siamese trackers, e.g. SiamFC, SiamRPN++, etc., and achieves state-of-the-art on standard single object tracking by NL benchmarks.
Second, based on experimental results and findings from the SOT by NL task, we propose the Tracked-object Retrieval by NL (TRNL) descriptions task and collect the CityFlow-NL Benchmark for it. CityFlow-NL contains more than 6,500precise NL descriptions of tracked vehicle targets, making it the first densely annotated dataset of tracked-objects paired with NL descriptions. To highlight the novelty of our dataset, we propose two models for the retrieval by NL task: a single-stream model based on cross-modality similarity matching and a quad-stream retrieval model that models the similarity between language features and visual features, including local visual features, frame-level features, motions, and relationships between visually similar targets. We release the CityFlow-NL Benchmark together with our models as challenges in the 5th and the 6th AI City Challenge.
Lastly, we focus on the most challenging yet practical task of Spatio-Temporal Video Grounding (STVG), which aims to spatially and temporally localize a target in videos with NL descriptions. We propose new evaluation protocols for the STVG task to adapt to the new challenges of CityFlow-NL that are not well-represented in prior STVG benchmarks. Three intuitive and novel approaches to the STVG task are proposed and studied in this thesis, i.e. Multi-Object Tracking (MOT) + Retrieval by NL approach, Single Object Tracking (SOT) by NL based approach, and a direct localization approach that uses a transformer network to learn a joint representation from both the NL and vision modalities
A Comprehensive Study of Real-Time Object Detection Networks Across Multiple Domains: A Survey
Deep neural network based object detectors are continuously evolving and are
used in a multitude of applications, each having its own set of requirements.
While safety-critical applications need high accuracy and reliability,
low-latency tasks need resource and energy-efficient networks. Real-time
detectors, which are a necessity in high-impact real-world applications, are
continuously proposed, but they overemphasize the improvements in accuracy and
speed while other capabilities such as versatility, robustness, resource and
energy efficiency are omitted. A reference benchmark for existing networks does
not exist, nor does a standard evaluation guideline for designing new networks,
which results in ambiguous and inconsistent comparisons. We, thus, conduct a
comprehensive study on multiple real-time detectors (anchor-, keypoint-, and
transformer-based) on a wide range of datasets and report results on an
extensive set of metrics. We also study the impact of variables such as image
size, anchor dimensions, confidence thresholds, and architecture layers on the
overall performance. We analyze the robustness of detection networks against
distribution shifts, natural corruptions, and adversarial attacks. Also, we
provide a calibration analysis to gauge the reliability of the predictions.
Finally, to highlight the real-world impact, we conduct two unique case
studies, on autonomous driving and healthcare applications. To further gauge
the capability of networks in critical real-time applications, we report the
performance after deploying the detection networks on edge devices. Our
extensive empirical study can act as a guideline for the industrial community
to make an informed choice on the existing networks. We also hope to inspire
the research community towards a new direction in the design and evaluation of
networks that focuses on a bigger and holistic overview for a far-reaching
impact.Comment: Published in Transactions on Machine Learning Research (TMLR) with
Survey Certificatio
An Evaluation of Deep Learning-Based Object Identification
Identification of instances of semantic objects of a particular class, which has been heavily incorporated in people's lives through applications like autonomous driving and security monitoring, is one of the most crucial and challenging areas of computer vision. Recent developments in deep learning networks for detection have improved object detector accuracy. To provide a detailed review of the current state of object detection pipelines, we begin by analyzing the methodologies employed by classical detection models and providing the benchmark datasets used in this study. After that, we'll have a look at the one- and two-stage detectors in detail, before concluding with a summary of several object detection approaches. In addition, we provide a list of both old and new apps. It's not just a single branch of object detection that is examined. Finally, we look at how to utilize various object detection algorithms to create a system that is both efficient and effective. and identify a number of emerging patterns in order to better understand the using the most recent algorithms and doing more study