6,202 research outputs found
A Survey on Joint Object Detection and Pose Estimation using Monocular Vision
In this survey we present a complete landscape of joint object detection and
pose estimation methods that use monocular vision. Descriptions of traditional
approaches that involve descriptors or models and various estimation methods
have been provided. These descriptors or models include chordiograms,
shape-aware deformable parts model, bag of boundaries, distance transform
templates, natural 3D markers and facet features whereas the estimation methods
include iterative clustering estimation, probabilistic networks and iterative
genetic matching. Hybrid approaches that use handcrafted feature extraction
followed by estimation by deep learning methods have been outlined. We have
investigated and compared, wherever possible, pure deep learning based
approaches (single stage and multi stage) for this problem. Comprehensive
details of the various accuracy measures and metrics have been illustrated. For
the purpose of giving a clear overview, the characteristics of relevant
datasets are discussed. The trends that prevailed from the infancy of this
problem until now have also been highlighted.Comment: Accepted at the International Joint Conference on Computer Vision and
Pattern Recognition (CCVPR) 201
A comparative evaluation of interest point detectors and local descriptors for visual SLAM
Abstract In this paper we compare the behavior of different interest points detectors and descriptors under the
conditions needed to be used as landmarks in vision-based simultaneous localization and mapping (SLAM).
We evaluate the repeatability of the detectors, as well as the invariance and distinctiveness of the descriptors,
under different perceptual conditions using sequences of images representing planar objects as well as 3D scenes.
We believe that this information will be useful when selecting an appropriat
Developing Predictive Models of Driver Behaviour for the Design of Advanced Driving Assistance Systems
World-wide injuries in vehicle accidents have been on the rise in recent
years, mainly due to driver error. The main objective of this research is to
develop a predictive system for driving maneuvers by analyzing the cognitive
behavior (cephalo-ocular) and the driving behavior of the driver (how the vehicle
is being driven). Advanced Driving Assistance Systems (ADAS) include
different driving functions, such as vehicle parking, lane departure warning,
blind spot detection, and so on. While much research has been performed on
developing automated co-driver systems, little attention has been paid to the
fact that the driver plays an important role in driving events. Therefore, it
is crucial to monitor events and factors that directly concern the driver. As
a goal, we perform a quantitative and qualitative analysis of driver behavior
to find its relationship with driver intentionality and driving-related actions.
We have designed and developed an instrumented vehicle (RoadLAB) that is
able to record several synchronized streams of data, including the surrounding
environment of the driver, vehicle functions and driver cephalo-ocular behavior,
such as gaze/head information. We subsequently analyze and study the
behavior of several drivers to find out if there is a meaningful relation between
driver behavior and the next driving maneuver
A 3D Omnidirectional Sensor For Mobile Robot Applications
International audienc
Spatio-temporal action localization with Deep Learning
Dissertação de mestrado em Engenharia InformáticaThe system that detects and identifies human activities are named human action recognition.
On the video approach, human activity is classified into four different categories, depending
on the complexity of the steps and the number of body parts involved in the action, namely
gestures, actions, interactions, and activities, which is challenging for video Human action
recognition to capture valuable and discriminative features because of the human body’s
variations. So, deep learning techniques have provided practical applications in multiple fields
of signal processing, usually surpassing traditional signal processing on a large scale.
Recently, several applications, namely surveillance, human-computer interaction, and video
recovery based on its content, have studied violence’s detection and recognition. In recent
years there has been a rapid growth in the production and consumption of a wide variety of
video data due to the popularization of high quality and relatively low-price video devices.
Smartphones and digital cameras contributed a lot to this factor. At the same time, there are
about 300 hours of video data updates every minute on YouTube. Along with the growing
production of video data, new technologies such as video captioning, answering video surveys,
and video-based activity/event detection are emerging every day. From the video input data,
the detection of human activity indicates which activity is contained in the video and locates
the regions in the video where the activity occurs.
This dissertation has conducted an experiment to identify and detect violence with spatial action localization, adapting a public dataset for effect. The idea was used an annotated
dataset of general action recognition and adapted only for violence detection.O sistema que deteta e identifica as atividades humanas é denominado reconhecimento da
ação humana. Na abordagem por vídeo, a atividade humana é classificada em quatro
categorias diferentes, dependendo da complexidade das etapas e do número de partes do
corpo envolvidas na ação, a saber, gestos, ações, interações e atividades, o que é desafiador
para o reconhecimento da ação humana do vídeo para capturar características valiosas e
discriminativas devido às variações do corpo humano. Portanto, as técnicas de deep learning
forneceram aplicações práticas em vários campos de processamento de sinal, geralmente
superando o processamento de sinal tradicional em grande escala.
Recentemente, várias aplicações, nomeadamente na vigilância, interação humano computador e recuperação de vídeo com base no seu conteúdo, estudaram a deteção e o
reconhecimento da violência. Nos últimos anos, tem havido um rápido crescimento na
produção e consumo de uma ampla variedade de dados de vídeo devido à popularização de
dispositivos de vídeo de alta qualidade e preços relativamente baixos. Smartphones e cameras
digitais contribuíram muito para esse fator. Ao mesmo tempo, há cerca de 300 horas de
atualizações de dados de vídeo a cada minuto no YouTube. Junto com a produção crescente
de dados de vídeo, novas tecnologias, como legendagem de vídeo, respostas a pesquisas de
vídeo e deteção de eventos / atividades baseadas em vídeo estão surgindo todos os dias. A
partir dos dados de entrada de vídeo, a deteção de atividade humana indica qual atividade
está contida no vídeo e localiza as regiões no vídeo onde a atividade ocorre.
Esta dissertação conduziu uma experiência para identificar e detetar violência com localização
espacial, adaptando um dataset público para efeito. A ideia foi usada um conjunto de dados
anotado de reconhecimento de ações gerais e adaptá-la apenas para deteção de violência
Computational intelligence approaches to robotics, automation, and control [Volume guest editors]
No abstract available
Confidence Estimation in Image-Based Localization
Image-based localization aims at estimating the camera position and orientation, briefly referred as camera pose, from a given image. Estimating the camera pose is needed in several applications, such as augmented reality, odometry and self-driving cars. A main challenge is to develop an algorithm for large varying environments, such as buildings or whole cities. During the past decade several algorithms have tackled this challenge and, despite the promising results, the task is far from being solved. Several applications, however, need a reliable pose estimate; in odometry applications, for example, the camera pose is used to correct the drift error accumulated by inertial sensor measurements. Based on this, it is important to be able to assess the confidence of the estimated pose and manage to discriminate between correct and incorrect poses within a prefixed error threshold. A common approach is to use the number of inliers produced in the RANSAC loop to evaluate how good an estimate is. Particularly, this is used to choose the best pose from a given image from a set of candidates. This metric, however, is not very robust, especially for indoor scenes, presenting several repetitive patterns, such as long textureless walls or similar objects. Despite some other metrics have been proposed, they aim at improving the accuracy of the algorithm, by grading candidate poses referred to the same query image; they thus recognize the best pose among a given set but cannot be used to grade the overall confidence of the final pose. In this thesis, we formalize confidence estimation as a binary classification problem and investigate how to quantify the confidence of an estimated camera pose. Opposed to the previous work, this new research question takes place after the whole visual localization pipeline and is able to compare also poses from different query images. In addition to the number of inliers, other factors such as the spatial distributions of inliers, are considered. A neural network is then used to generate a novel robust metric, able to evaluate the confidence for different query images. The proposed method is benchmarked using InLoc, a challenging dataset for indoor pose estimation. It is also shown the proposed confidence metric is independent of the dataset used for training and can be applied to different datasets and pipelines
- …