905 research outputs found
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
This paper presents a self-supervised method for visual detection of the
active speaker in a multi-person spoken interaction scenario. Active speaker
detection is a fundamental prerequisite for any artificial cognitive system
attempting to acquire language in social settings. The proposed method is
intended to complement the acoustic detection of the active speaker, thus
improving the system robustness in noisy conditions. The method can detect an
arbitrary number of possibly overlapping active speakers based exclusively on
visual information about their face. Furthermore, the method does not rely on
external annotations, thus complying with cognitive development. Instead, the
method uses information from the auditory modality to support learning in the
visual domain. This paper reports an extensive evaluation of the proposed
method using a large multi-person face-to-face interaction dataset. The results
show good performance in a speaker dependent setting. However, in a speaker
independent setting the proposed method yields a significantly lower
performance. We believe that the proposed method represents an essential
component of any artificial cognitive system or robotic platform engaging in
social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System
Perception systems for robust autonomous navigation in natural environments
2022 Spring.Includes bibliographical references.As assistive robotics continues to develop thanks to the rapid advances of artificial intelligence, smart sensors, Internet of Things, and robotics, the industry began introducing robots to perform various functions that make humans' lives more comfortable and enjoyable. While the principal purpose of deploying robots has been productivity enhancement, their usability has widely expanded. Examples include assisting people with disabilities (e.g., Toyota's Human Support Robot), providing driver-less transportation (e.g., Waymo's driver-less cars), and helping with tedious house chores (e.g., iRobot). The challenge in these applications is that the robots have to function appropriately under continuously changing environments, harsh real-world conditions, deal with significant amounts of noise and uncertainty, and operate autonomously without the intervention or supervision of an expert. To meet these challenges, a robust perception system is vital. This dissertation casts light on the perception component of autonomous mobile robots and highlights their major capabilities, and analyzes the factors that affect their performance. In short, the developed approaches in this dissertation cover the following four topics: (1) learning the detection and identification of objects in the environment in which the robot is operating, (2) estimating the 6D pose of objects of interest to the robot, (3) studying the importance of the tracking information in the motion prediction module, and (4) analyzing the performance of three motion prediction methods, comparing their performances, and highlighting their strengths and weaknesses. All techniques developed in this dissertation have been implemented and evaluated on popular public benchmarks. Extensive experiments have been conducted to analyze and validate the properties of the developed methods and demonstrate this dissertation's conclusions on the robustness, performance, and utility of the proposed approaches for intelligent mobile robots
Effect of Attentional Capture and Cross-Modal Interference in Multisensory Cognitive Processing
Despite considerable research, the effects of common types of noise on verbal and spatial information processing are still relatively unknown. Three experiments, using convenience sampling were conducted to investigate the effect of auditory interference on the cognitive performance of 24 adult men and women during the Stroop test, perception of object recognition and spatial location tasks, and the perception of object size, shape, and spatial location tasks. The data were analyzed using univariate analysis of variance and 1-way multivariate analysis of variance. The Experiment 1 findings indicated reaction time performance for gender and age group was affected by auditory interference between experimental conditions, and recognition accuracy was affected only by experimental condition. The Experiment 2a results showed reaction time performance for recognizing object features was affected by auditory interference between age groups, and recognition accuracy by experimental condition. The Experiment 2b results demonstrated reaction time performance for detecting the spatial location of objects was affected by auditory interference between age groups. In addition, reaction time was affected by the type of interference and spatial location. Further, recognition accuracy was affected by interference condition and spatial location. The Experiment 3 findings suggested reaction time performance for assessing part-whole relationships was affected by auditory interference between age groups. Further, recognition accuracy was affected by interference condition between experimental groups. This study may create social change by affecting the design of learning and workplace environments, the neurological correlates of auditory and visual stimuli, and the pathologies of adults such as attention deficit hyperactivity disorder
FETNet: Feature exchange transformer network for RGB-D object detection
In RGB-D object detection, due to the inherent difference between the RGB and
Depth modalities, it remains challenging to simultaneously leverage sensed photometric and depth information. In this paper, to address this issue, we propose a Feature
Exchange Transformer Network (FETNet), which consists of two well-designed components: the Feature Exchange Module (FEM), and the Multi-modal Vision Transformer
(MViT). Specially, we propose the FEM to exchange part of the channels between RGB
and depth features at each backbone stage, which facilitates the information flow, and
bridges the gap, between the two modalities. Inspired by the success of Vision Transformer (ViT), we develop the variant MViT to effectively fuse multi-modal features and exploit the attention between the RGB and depth features. Different from previous methods developing from specified RGB detection algorithm, our proposal is generic. Extensive experiments prove that, when the proposed modules are integrated into mainstream RGB object detection methods, their RGB-D counterparts can obtain significant performance gains. Moreover, our FETNet surpasses state-of-the-art RGB-D detectors by 7.0% mAP on SUN RGB-D and 1.7% mAP on NYU Depth v2, which also well demonstrates
the effectiveness of the proposed method
3D objects and scenes classification, recognition, segmentation, and reconstruction using 3D point cloud data: A review
Three-dimensional (3D) point cloud analysis has become one of the attractive
subjects in realistic imaging and machine visions due to its simplicity,
flexibility and powerful capacity of visualization. Actually, the
representation of scenes and buildings using 3D shapes and formats leveraged
many applications among which automatic driving, scenes and objects
reconstruction, etc. Nevertheless, working with this emerging type of data has
been a challenging task for objects representation, scenes recognition,
segmentation, and reconstruction. In this regard, a significant effort has
recently been devoted to developing novel strategies, using different
techniques such as deep learning models. To that end, we present in this paper
a comprehensive review of existing tasks on 3D point cloud: a well-defined
taxonomy of existing techniques is performed based on the nature of the adopted
algorithms, application scenarios, and main objectives. Various tasks performed
on 3D point could data are investigated, including objects and scenes
detection, recognition, segmentation and reconstruction. In addition, we
introduce a list of used datasets, we discuss respective evaluation metrics and
we compare the performance of existing solutions to better inform the
state-of-the-art and identify their limitations and strengths. Lastly, we
elaborate on current challenges facing the subject of technology and future
trends attracting considerable interest, which could be a starting point for
upcoming research studie
A Study of Attention-Free and Attentional Methods for LiDAR and 4D Radar Object Detection in Self-Driving Applications
In this thesis, we re-examine the problem of 3D object detection in the context of self driving cars with the first publicly released View of Delft (VoD) dataset [1] containing 4D radar sensor data. 4D radar is a novel sensor that provides velocity and Radar Cross Section (RCS) information in addition to position for its point cloud. State of the art architectures such as 3DETR [2] and IASSD [3] were used as a baseline. Several attention-free methods, like point cloud concatenation, feature propagation and feature fusion with MLP, as well as attentional methods utilizing cross attention, were tested to determine how we can best combine LiDAR and radar to develop a multimodal detection architecture that outperforms the baseline architectures trained only on either modality alone. Our findings indicate that while attention-free methods did not consistently surpass the baseline performance across all classes, they did lead to notable performance gains for specific classes. Furthermore, we found that attentional methods faced challenges due to the sparsity of radar point clouds and duplicated features, which limited the efficacy of the crossattention mechanism. These findings highlight potential avenues for future research to refine and improve upon attentional methods in the context of 3D object detection
LSTA: Long Short-Term Attention for Egocentric Action Recognition
Egocentric activity recognition is one of the most challenging tasks in video
analysis. It requires a fine-grained discrimination of small objects and their
manipulation. While some methods base on strong supervision and attention
mechanisms, they are either annotation consuming or do not take spatio-temporal
patterns into account. In this paper we propose LSTA as a mechanism to focus on
features from spatial relevant parts while attention is being tracked smoothly
across the video sequence. We demonstrate the effectiveness of LSTA on
egocentric activity recognition with an end-to-end trainable two-stream
architecture, achieving state of the art performance on four standard
benchmarks.Comment: Accepted to CVPR 201
- …