229 research outputs found

    Detect to Track and Track to Detect

    Full text link
    Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; and (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections at the video level. Our ConvNet architecture for spatiotemporal object detection is evaluated on the large-scale ImageNet VID dataset where it achieves state-of-the-art results. Our approach provides better single model performance than the winning method of the last ImageNet challenge while being conceptually much simpler. Finally, we show that by increasing the temporal stride we can dramatically increase the tracker speed.Comment: ICCV 2017. Code and models: https://github.com/feichtenhofer/Detect-Track Results: https://www.robots.ox.ac.uk/~vgg/research/detect-track

    Video analytics on the MLK Smart Corridor testbed

    Get PDF
    With the predicted boom of urban environment populations in the next 30 years, many new challenges in urban transportation will surface. In an effort to mitigate these, the Center for Urban Informatics and Progress (CUIP) has been introduced along with its testbed. One opportunity this testbed provides is the ability to utilize computer vision and video analytics to anonymously gather data on how citizens traverse the city. This thesis shall discuss an approach to real-time object tracking that serves as a basis for further analytics such as traffic flow data collection and near-miss detection. The proposed video analytics platform will aid citizens with their day-to-day commute through the corridor by deriving real-time data based on actual behavior seen in the citizens\u27 commute. Furthermore, since the testbed is ever-expanding in both hardware and size the algorithms and software proposed in this thesis are designed to prioritize scalability

    3D PersonVLAD: Learning Deep Global Representations for Video-based Person Re-identification

    Full text link
    In this paper, we introduce a global video representation to video-based person re-identification (re-ID) that aggregates local 3D features across the entire video extent. Most of the existing methods rely on 2D convolutional networks (ConvNets) to extract frame-wise deep features which are pooled temporally to generate the video-level representations. However, 2D ConvNets lose temporal input information immediately after the convolution, and a separate temporal pooling is limited in capturing human motion in shorter sequences. To this end, we present a \textit{global} video representation (3D PersonVLAD), complementary to 3D ConvNets as a novel layer to capture the appearance and motion dynamics in full-length videos. However, encoding each video frame in its entirety and computing an aggregate global representation across all frames is tremendously challenging due to occlusions and misalignments. To resolve this, our proposed network is further augmented with 3D part alignment module to learn local features through soft-attention module. These attended features are statistically aggregated to yield identity-discriminative representations. Our global 3D features are demonstrated to achieve state-of-the-art results on three benchmark datasets: MARS \cite{MARS}, iLIDS-VID \cite{VideoRanking}, and PRID 2011Comment: Accepted to appear at IEEE Transactions on Neural Networks and Learning System

    AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

    Get PDF
    This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.Comment: To appear in CVPR 2018. Check dataset page https://research.google.com/ava/ for detail
    • …
    corecore