100 research outputs found
Tracking Objects as Points
Tracking has traditionally been the art of following interest points through
space and time. This changed with the rise of powerful deep networks. Nowadays,
tracking is dominated by pipelines that perform object detection followed by
temporal association, also known as tracking-by-detection. In this paper, we
present a simultaneous detection and tracking algorithm that is simpler,
faster, and more accurate than the state of the art. Our tracker, CenterTrack,
applies a detection model to a pair of images and detections from the prior
frame. Given this minimal input, CenterTrack localizes objects and predicts
their associations with the previous frame. That's it. CenterTrack is simple,
online (no peeking into the future), and real-time. It achieves 67.3% MOTA on
the MOT17 challenge at 22 FPS and 89.4% MOTA on the KITTI tracking benchmark at
15 FPS, setting a new state of the art on both datasets. CenterTrack is easily
extended to monocular 3D tracking by regressing additional 3D attributes. Using
monocular video input, it achieves 28.3% [email protected] on the newly released
nuScenes 3D tracking benchmark, substantially outperforming the monocular
baseline on this benchmark while running at 28 FPS.Comment: ECCV 2020 Camera-ready version. Updated track rebirth results. Code
available at https://github.com/xingyizhou/CenterTrac
Behind every domain there is a shift: adapting distortion-aware vision transformers for panoramic semantic segmentation
In this paper, we address panoramic semantic segmentation which is under-explored due to two critical challenges: (1) image
distortions and object deformations on panoramas; (2) lack of semantic annotations in the 360◦ imagery. To tackle these problems, first,
we propose the upgraded Transformer for Panoramic Semantic Segmentation, i.e., Trans4PASS+, equipped with Deformable Patch
Embedding (DPE) and Deformable MLP (DMLPv2) modules for handling object deformations and image distortions whenever (before
or after adaptation) and wherever (shallow or deep levels). Second, we enhance the Mutual Prototypical Adaptation (MPA) strategy
via pseudo-label rectification for unsupervised domain adaptive panoramic segmentation. Third, aside from Pinhole-to-Panoramic
(PIN2PAN) adaptation, we create a new dataset (SynPASS) with 9,080 panoramic images, facilitating Synthetic-to-Real (SYN2REAL)
adaptation scheme in 360◦ imagery. Extensive experiments are conducted, which cover indoor and outdoor scenarios, and each of
them is investigated with PIN2PAN and SYN2REAL regimens. Trans4PASS+ achieves state-of-the-art performances on four domain
adaptive panoramic semantic segmentation benchmarks. Code is available at https://github.com/jamycheung/Trans4PASS
The CLEAR 2007 Evaluation
Abstract. This paper is a summary of the 2007 CLEAR Evaluation on the Classification of Events, Activities, and Relationships which took place in early 2007 and culminated with a two-day workshop held in May 2007. CLEAR is an international effort to evaluate systems for the perception of people, their activities, and interactions. In its second year, CLEAR has developed a following from the computer vision and speech communities, spawning a more multimodal perspective of research eval-uation. This paper describes the evaluation tasks, including metrics and databases used, and discusses the results achieved. The CLEAR 2007 tasks comprise person, face, and vehicle tracking, head pose estimation, as well as acoustic scene analysis. These include subtasks performed in the visual, acoustic and audio-visual domains for meeting room and surveillance data.
Goal-recognition-based adaptive brain-computer interface for navigating immersive robotic systems
© 2017 IOP Publishing Ltd. Objective. This work proposes principled strategies for self-adaptations in EEG-based Brain-computer interfaces (BCIs) as a way out of the bandwidth bottleneck resulting from the considerable mismatch between the low-bandwidth interface and the bandwidth-hungry application, and a way to enable fluent and intuitive interaction in embodiment systems. The main focus is laid upon inferring the hidden target goals of users while navigating in a remote environment as a basis for possible adaptations. Approach. To reason about possible user goals, a general user-agnostic Bayesian update rule is devised to be recursively applied upon the arrival of evidences, i.e. user input and user gaze. Experiments were conducted with healthy subjects within robotic embodiment settings to evaluate the proposed method. These experiments varied along three factors: the type of the robot/environment (simulated and physical), the type of the interface (keyboard or BCI), and the way goal recognition (GR) is used to guide a simple shared control (SC) driving scheme. Main results. Our results show that the proposed GR algorithm is able to track and infer the hidden user goals with relatively high precision and recall. Further, the realized SC driving scheme benefits from the output of the GR system and is able to reduce the user effort needed to accomplish the assigned tasks. Despite the fact that the BCI requires higher effort compared to the keyboard conditions, most subjects were able to complete the assigned tasks, and the proposed GR system is additionally shown able to handle the uncertainty in user input during SSVEP-based interaction. The SC application of the belief vector indicates that the benefits of the GR module are more pronounced for BCIs, compared to the keyboard interface. Significance. Being based on intuitive heuristics that model the behavior of the general population during the execution of navigation tasks, the proposed GR method can be used without prior tuning for the individual users. The proposed methods can be easily integrated in devising more advanced SC schemes and/or strategies for automatic BCI self-adaptations
MedShapeNet -- A Large-Scale Dataset of 3D Medical Shapes for Computer Vision
Prior to the deep learning era, shape was commonly used to describe the
objects. Nowadays, state-of-the-art (SOTA) algorithms in medical imaging are
predominantly diverging from computer vision, where voxel grids, meshes, point
clouds, and implicit surface models are used. This is seen from numerous
shape-related publications in premier vision conferences as well as the growing
popularity of ShapeNet (about 51,300 models) and Princeton ModelNet (127,915
models). For the medical domain, we present a large collection of anatomical
shapes (e.g., bones, organs, vessels) and 3D models of surgical instrument,
called MedShapeNet, created to facilitate the translation of data-driven vision
algorithms to medical applications and to adapt SOTA vision algorithms to
medical problems. As a unique feature, we directly model the majority of shapes
on the imaging data of real patients. As of today, MedShapeNet includes 23
dataset with more than 100,000 shapes that are paired with annotations (ground
truth). Our data is freely accessible via a web interface and a Python
application programming interface (API) and can be used for discriminative,
reconstructive, and variational benchmarks as well as various applications in
virtual, augmented, or mixed reality, and 3D printing. Exemplary, we present
use cases in the fields of classification of brain tumors, facial and skull
reconstructions, multi-class anatomy completion, education, and 3D printing. In
future, we will extend the data and improve the interfaces. The project pages
are: https://medshapenet.ikim.nrw/ and
https://github.com/Jianningli/medshapenet-feedbackComment: 16 page
Deducing the visual focus of attention from head pose estimation in dynamic multi-view meeting scenarios
This paper presents our work on recognizing the visual focus of attention during dynamic meeting scenarios. We collected a new dataset of meetings, in which acting participants were to follow a predefined script of events, to enforce focus shifts of the remaining, unaware meeting members. Including the whole room, all in all, a total of 35 potential focus targets were annotated, of which some were moved or introduced spontaneously during the meeting. On this dynamic dataset, we present a new approach to deduce the visual focus by means of head orientation as a first clue and show, that our system recognizes the correct visual target in over 57% of all frames, compared to 47% when mapping head pose to the first-best intersecting focus target directly
Person re-identification by deep learning attribute-complementary information
Automatic person re-identification (re-id) across camera boundaries is a challenging problem. Approaches have to be robust against many factors which influence the visual appearance of a person but are not relevant to the person's identity. Examples for such factors are pose, camera angles, and lighting conditions. Person attributes are a semantic high level information which is invariant across many such influences and contain information which is often highly relevant to a person's identity. In this work we develop a re-id approach which leverages the information contained in automatically detected attributes. We train an attribute classifier on separate data and include its responses into the training process of our person re-id model which is based on convolutional neural networks (CNNs). This allows us to learn a person representation which contains information complementary to that contained within the attributes. Our approach is able to identify attributes which perform most reliably for re-id and focus on them accordingly. We demonstrate the performance improvement gained through use of the attribute information on multiple large-scale datasets and report insights into which attributes are most relevant for person re-id
3D user-perspective, voxel-based estimation of visual focus of attention in dynamic meeting scenarios
In this paper we present a new framework for the online estimation of people's visual focus of attention from their head poses in dynamic meeting scenarios. We describe a voxel based approach to reconstruct the scene composition from an observer's perspective, in order to integrate occlusion handling and visibility verification. The observer's perspective is thereby simulated with live head pose tracking over four far-field views from the room's upper corners. We integrate motion and speech activity as further scene observations in a Bayesian Surprise framework to model prior attractors of attention within the situation's context. As evaluations on a dedicated dataset with 10 meeting videos show, this allows us to predict a meeting participant's focus of attention correctly in up to 72.2% of all frames
- …