390 research outputs found
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in
computer vision, has received great attention in recent years. Its development
in the past two decades can be regarded as an epitome of computer vision
history. If we think of today's object detection as a technical aesthetics
under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+
papers of object detection in the light of its technical evolution, spanning
over a quarter-century's time (from the 1990s to 2019). A number of topics have
been covered in this paper, including the milestone detectors in history,
detection datasets, metrics, fundamental building blocks of the detection
system, speed up techniques, and the recent state of the art detection methods.
This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep
analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible
publicatio
ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection
Effective feature fusion of multispectral images plays a crucial role in
multi-spectral object detection. Previous studies have demonstrated the
effectiveness of feature fusion using convolutional neural networks, but these
methods are sensitive to image misalignment due to the inherent deffciency in
local-range feature interaction resulting in the performance degradation. To
address this issue, a novel feature fusion framework of dual cross-attention
transformers is proposed to model global feature interaction and capture
complementary information across modalities simultaneously. This framework
enhances the discriminability of object features through the query-guided
cross-attention mechanism, leading to improved performance. However, stacking
multiple transformer blocks for feature enhancement incurs a large number of
parameters and high spatial complexity. To handle this, inspired by the human
process of reviewing knowledge, an iterative interaction mechanism is proposed
to share parameters among block-wise multimodal transformers, reducing model
complexity and computation cost. The proposed method is general and effective
to be integrated into different detection frameworks and used with different
backbones. Experimental results on KAIST, FLIR, and VEDAI datasets show that
the proposed method achieves superior performance and faster inference, making
it suitable for various practical scenarios. Code will be available at
https://github.com/chanchanchan97/ICAFusion.Comment: submitted to Pattern Recognition Journal, minor revisio
Real-time Aerial Detection and Reasoning on Embedded-UAVs
We present a unified pipeline architecture for a real-time detection system
on an embedded system for UAVs. Neural architectures have been the industry
standard for computer vision. However, most existing works focus solely on
concatenating deeper layers to achieve higher accuracy with run-time
performance as the trade-off. This pipeline of networks can exploit the
domain-specific knowledge on aerial pedestrian detection and activity
recognition for the emerging UAV applications of autonomous surveying and
activity reporting. In particular, our pipeline architectures operate in a
time-sensitive manner, have high accuracy in detecting pedestrians from various
aerial orientations, use a novel attention map for multi-activities
recognition, and jointly refine its detection with temporal information.
Numerically, we demonstrate our model's accuracy and fast inference speed on
embedded systems. We empirically deployed our prototype hardware with full live
feeds in a real-world open-field environment.Comment: In TGR
Sensor fusion in driving assistance systems
Mención Internacional en el título de doctorLa vida diaria en los países desarrollados y en vías de desarrollo depende en
gran medida del transporte urbano y en carretera. Esta actividad supone un
coste importante para sus usuarios activos y pasivos en términos de polución
y accidentes, muy habitualmente debidos al factor humano. Los nuevos desarrollos
en seguridad y asistencia a la conducción, llamados Advanced Driving
Assistance Systems (ADAS), buscan mejorar la seguridad en el transporte, y
a medio plazo, llegar a la conducción autónoma.
Los ADAS, al igual que la conducción humana, están basados en sensores
que proporcionan información acerca del entorno, y la fiabilidad de los sensores
es crucial para las aplicaciones ADAS al igual que las capacidades
sensoriales lo son para la conducción humana. Una de las formas de aumentar
la fiabilidad de los sensores es el uso de la Fusión Sensorial, desarrollando
nuevas estrategias para el modelado del entorno de conducción gracias al uso
de diversos sensores, y obteniendo una información mejorada a partid de los
datos disponibles.
La presente tesis pretende ofrecer una solución novedosa para la detección
y clasificación de obstáculos en aplicaciones de automoción, usando fusión
vii
sensorial con dos sensores ampliamente disponibles en el mercado: la cámara
de espectro visible y el escáner láser. Cámaras y láseres son sensores
comúnmente usados en la literatura científica, cada vez más accesibles y listos
para ser empleados en aplicaciones reales. La solución propuesta permite la
detección y clasificación de algunos de los obstáculos comúnmente presentes
en la vía, como son ciclistas y peatones.
En esta tesis se han explorado novedosos enfoques para la detección y clasificación,
desde la clasificación empleando clusters de nubes de puntos obtenidas
desde el escáner láser, hasta las técnicas de domain adaptation para la creación
de bases de datos de imágenes sintéticas, pasando por la extracción inteligente
de clusters y la detección y eliminación del suelo en nubes de puntos.Life in developed and developing countries is highly dependent on road and
urban motor transport. This activity involves a high cost for its active and passive
users in terms of pollution and accidents, which are largely attributable to
the human factor. New developments in safety and driving assistance, called
Advanced Driving Assistance Systems (ADAS), are intended to improve
security in transportation, and, in the mid-term, lead to autonomous driving.
ADAS, like the human driving, are based on sensors, which provide information
about the environment, and sensors’ reliability is crucial for ADAS
applications in the same way the sensing abilities are crucial for human driving.
One of the ways to improve reliability for sensors is the use of Sensor
Fusion, developing novel strategies for environment modeling with the help of
several sensors and obtaining an enhanced information from the combination
of the available data.
The present thesis is intended to offer a novel solution for obstacle detection
and classification in automotive applications using sensor fusion with two
highly available sensors in the market: visible spectrum camera and laser
scanner. Cameras and lasers are commonly used sensors in the scientific
literature, increasingly affordable and ready to be deployed in real world
applications. The solution proposed provides obstacle detection and classification
for some obstacles commonly present in the road, such as pedestrians and bicycles.
Novel approaches for detection and classification have been explored in this
thesis, from point cloud clustering classification for laser scanner, to domain
adaptation techniques for synthetic dataset creation, and including intelligent
clustering extraction and ground detection and removal from point clouds.Programa Oficial de Doctorado en Ingeniería Eléctrica, Electrónica y AutomáticaPresidente: Cristina Olaverri Monreal.- Secretario: Arturo de la Escalera Hueso.- Vocal: José Eugenio Naranjo Hernánde
Object Tracking and Mensuration in Surveillance Videos
This thesis focuses on tracking and mensuration in surveillance videos. The
first part of the thesis discusses several object tracking approaches based on the
different properties of tracking targets. For airborne videos, where the targets are
usually small and with low resolutions, an approach of building motion models for
foreground/background proposed in which the foreground target is simplified as a
rigid object. For relatively high resolution targets, the non-rigid models are applied.
An active contour-based algorithm has been introduced. The algorithm is based on
decomposing the tracking into three parts: estimate the affine transform parameters
between successive frames using particle filters; detect the contour deformation using
a probabilistic deformation map, and regulate the deformation by projecting the
updated model onto a trained shape subspace. The active appearance Markov chain
(AAMC). It integrates a statistical model of shape, appearance and motion. In the
AAMC model, a Markov chain represents the switching of motion phases (poses),
and several pairwise active appearance model (P-AAM) components characterize the
shape, appearance and motion information for different motion phases. The second
part of the thesis covers video mensuration, in which we have proposed a heightmeasuring
algorithm with less human supervision, more flexibility and improved
robustness. From videos acquired by an uncalibrated stationary camera, we first
recover the vanishing line and the vertical point of the scene. We then apply a single
view mensuration algorithm to each of the frames to obtain height measurements.
Finally, using the LMedS as the cost function and the Robbins-Monro stochastic
approximation (RMSA) technique to obtain the optimal estimate
Amodal Instance Segmentation and Multi-Object Tracking with Deep Pixel Embedding
This thesis extends upon the representational output of semantic instance segmentation by explicitly including both visible and occluded parts. A fully convolutional network is trained to produce consistent pixel-level embedding across two layers such that, when clustered, the results convey the full spatial extent and depth ordering of each instance. Results demonstrate that the network can accurately estimate complete masks in the presence of occlusion and outperform leading top-down bounding-box approaches.
The model is further extended to produce consistent pixel-level embeddings across two consecutive image frames from a video to simultaneously perform amodal instance segmentation and multi-object tracking. No post-processing trackers or Hungarian Algorithm is needed to perform multi-object tracking. The advantages and disadvantages of such a bounding-box-free approach are studied thoroughly. Experiments show that the proposed method outperforms the state-of-the-art bounding-box based approach on tracking animated moving objects.
Advisor: Eric T. Psota and Lance C. Pére
From pixels to people : recovering location, shape and pose of humans in images
Humans are at the centre of a significant amount of research in computer vision. Endowing machines with the ability to perceive people from visual data is an immense scientific challenge with a high degree of direct practical relevance. Success in automatic perception can be measured at different levels of abstraction, and this will depend on which intelligent behaviour we are trying to replicate: the ability to localise persons in an image or in the environment, understanding how persons are moving at the skeleton and at the surface level, interpreting their interactions with the environment including with other people, and perhaps even anticipating future actions. In this thesis we tackle different sub-problems of the broad research area referred to as "looking at people", aiming to perceive humans in images at different levels of granularity. We start with bounding box-level pedestrian detection: We present a retrospective analysis of methods published in the decade preceding our work, identifying various strands of research that have advanced the state of the art. With quantitative exper- iments, we demonstrate the critical role of developing better feature representations and having the right training distribution. We then contribute two methods based on the insights derived from our analysis: one that combines the strongest aspects of past detectors and another that focuses purely on learning representations. The latter method outperforms more complicated approaches, especially those based on hand- crafted features. We conclude our work on pedestrian detection with a forward-looking analysis that maps out potential avenues for future research. We then turn to pixel-level methods: Perceiving humans requires us to both separate them precisely from the background and identify their surroundings. To this end, we introduce Cityscapes, a large-scale dataset for street scene understanding. This has since established itself as a go-to benchmark for segmentation and detection. We additionally develop methods that relax the requirement for expensive pixel-level annotations, focusing on the task of boundary detection, i.e. identifying the outlines of relevant objects and surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and labelling to fine-grained spatial understanding. We contribute a method for recovering 3D human shape and pose, which marries the advantages of learning-based and model- based approaches. We conclude the thesis with a detailed discussion of benchmarking practices in computer vision. Among other things, we argue that the design of future datasets should be driven by the general goal of combinatorial robustness besides task-specific considerations.Der Mensch steht im Zentrum vieler Forschungsanstrengungen im Bereich des maschinellen Sehens. Es ist eine immense wissenschaftliche Herausforderung mit hohem unmittelbarem Praxisbezug, Maschinen mit der Fähigkeit auszustatten, Menschen auf der Grundlage von visuellen Daten wahrzunehmen. Die automatische Wahrnehmung kann auf verschiedenen Abstraktionsebenen erfolgen. Dies hängt davon ab, welches intelligente Verhalten wir nachbilden wollen: die Fähigkeit, Personen auf der Bildfläche oder im 3D-Raum zu lokalisieren, die Bewegungen von Körperteilen und Körperoberflächen zu erfassen, Interaktionen einer Person mit ihrer Umgebung einschließlich mit anderen Menschen zu deuten, und vielleicht sogar zukünftige Handlungen zu antizipieren. In dieser Arbeit beschäftigen wir uns mit verschiedenen Teilproblemen die dem breiten Forschungsgebiet "Betrachten von Menschen" gehören. Beginnend mit der Fußgängererkennung präsentieren wir eine Analyse von Methoden, die im Jahrzehnt vor unserem Ausgangspunkt veröffentlicht wurden, und identifizieren dabei verschiedene Forschungsstränge, die den Stand der Technik vorangetrieben haben. Unsere quantitativen Experimente zeigen die entscheidende Rolle sowohl der Entwicklung besserer Bildmerkmale als auch der Trainingsdatenverteilung. Anschließend tragen wir zwei Methoden bei, die auf den Erkenntnissen unserer Analyse basieren: eine Methode, die die stärksten Aspekte vergangener Detektoren kombiniert, eine andere, die sich im Wesentlichen auf das Lernen von Bildmerkmalen konzentriert. Letztere übertrifft kompliziertere Methoden, insbesondere solche, die auf handgefertigten Bildmerkmalen basieren. Wir schließen unsere Arbeit zur Fußgängererkennung mit einer vorausschauenden Analyse ab, die mögliche Wege für die zukünftige Forschung aufzeigt. Anschließend wenden wir uns Methoden zu, die Entscheidungen auf Pixelebene betreffen. Um Menschen wahrzunehmen, müssen wir diese sowohl praezise vom Hintergrund trennen als auch ihre Umgebung verstehen. Zu diesem Zweck führen wir Cityscapes ein, einen umfangreichen Datensatz zum Verständnis von Straßenszenen. Dieser hat sich seitdem als Standardbenchmark für Segmentierung und Erkennung etabliert. Darüber hinaus entwickeln wir Methoden, die die Notwendigkeit teurer Annotationen auf Pixelebene reduzieren. Wir konzentrieren uns hierbei auf die Aufgabe der Umgrenzungserkennung, d. h. das Erkennen der Umrisse relevanter Objekte und Oberflächen. Als nächstes machen wir den Sprung von Pixeln zu 3D-Oberflächen, vom Lokalisieren und Beschriften zum präzisen räumlichen Verständnis. Wir tragen eine Methode zur Schätzung der 3D-Körperoberfläche sowie der 3D-Körperpose bei, die die Vorteile von lernbasierten und modellbasierten Ansätzen vereint. Wir schließen die Arbeit mit einer ausführlichen Diskussion von Evaluationspraktiken im maschinellen Sehen ab. Unter anderem argumentieren wir, dass der Entwurf zukünftiger Datensätze neben aufgabenspezifischen Überlegungen vom allgemeinen Ziel der kombinatorischen Robustheit bestimmt werden sollte
Proceedings of the 2019 Joint Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory
In 2019 fand wieder der jährliche Workshop des Fraunhofer IOSB und des Lehrstuhls für Interaktive Echtzeitsysteme des Karlsruher Insitut für Technologie statt. Die Doktoranden beider Institutionen präsentierten den Fortschritt ihrer Forschung in den Themen Maschinelles Lernen, Machine Vision, Messtechnik, Netzwerksicherheit und Usage Control. Die Ideen dieses Workshops sind in diesem Buch gesammelt in der Form technischer Berichte
- …