32 research outputs found
How to Train Your Dragon: Tamed Warping Network for Semantic Video Segmentation
Real-time semantic segmentation on high-resolution videos is challenging due
to the strict requirements of speed. Recent approaches have utilized the
inter-frame continuity to reduce redundant computation by warping the feature
maps across adjacent frames, greatly speeding up the inference phase. However,
their accuracy drops significantly owing to the imprecise motion estimation and
error accumulation. In this paper, we propose to introduce a simple and
effective correction stage right after the warping stage to form a framework
named Tamed Warping Network (TWNet), aiming to improve the accuracy and
robustness of warping-based models. The experimental results on the Cityscapes
dataset show that with the correction, the accuracy (mIoU) significantly
increases from 67.3% to 71.6%, and the speed edges down from 65.5 FPS to 61.8
FPS. For non-rigid categories such as "human" and "object", the improvements of
IoU are even higher than 18 percentage points
Video Semantic Segmentation Network with Low Latency Based on Deep Learning
Recently, new advances in deep learning algorithms have yielded some fascinating results in the field of computer vision technology. As a result, it can now perform activities that formerly required the use of human vision and the brain. Classification, object identification, and semantic segmentation have all seen substantial advancements in deep learning architecture in the last few years. For still images and movies, there has been a major advancement in the field of semantic segmentation. In practical uses like autonomous vehicles, segmenting semantic video continues to be difficult due to high-performance standards, the high cost of convolutional neural networks (CNNs), and the significant need for low latency. An effective machine-learning environment will be developed to meet the performance and latency challenges outlined above. The use of deep learning architectures like SegNet and FlowNet2.0 on the CamVid dataset enables this environment to conduct pixel-wise semantic segmentation of video properties while maintaining low latency. As a result, it is ideally suited for real-world applications since it takes advantage of both SegNet and FlowNet topologies. The decision network determines whether an image frame should be processed by a segmentation network or an optical flow network based on the expected confidence score. In conjunction with adaptive scheduling of the key frame approach, this technique for decision-making can help to speed up the procedure. Using the ResNet50 SegNet model, a mean Intersection on Union (IoU) of "54.27 percent" and an average frame per second of "19.57" were observed. Aside from decision network and adaptive key frame sequencing, it was discovered that FlowNet2.0 increased the frames processed per second9(fps) to "30.19" on GPU with a mean IoU of "47.65%". Because the GPU was utilized "47.65%" of the time, this resulted. There has been an increase in the speed of the Video semantic segmentation network without sacrificing quality, as demonstrated by this improvement in performance
Dynamic Face Video Segmentation via Reinforcement Learning
For real-time semantic video segmentation, most recent works utilised a
dynamic framework with a key scheduler to make online key/non-key decisions.
Some works used a fixed key scheduling policy, while others proposed adaptive
key scheduling methods based on heuristic strategies, both of which may lead to
suboptimal global performance. To overcome this limitation, we model the online
key decision process in dynamic video segmentation as a deep reinforcement
learning problem and learn an efficient and effective scheduling policy from
expert information about decision history and from the process of maximising
global return. Moreover, we study the application of dynamic video segmentation
on face videos, a field that has not been investigated before. By evaluating on
the 300VW dataset, we show that the performance of our reinforcement key
scheduler outperforms that of various baselines in terms of both effective key
selections and running speed. Further results on the Cityscapes dataset
demonstrate that our proposed method can also generalise to other scenarios. To
the best of our knowledge, this is the first work to use reinforcement learning
for online key-frame decision in dynamic video segmentation, and also the first
work on its application on face videos.Comment: CVPR 2020. 300VW with segmentation labels is available at:
https://github.com/mapleandfire/300VW-Mas
Detect or Track: Towards Cost-Effective Video Object Detection/Tracking
State-of-the-art object detectors and trackers are developing fast. Trackers
are in general more efficient than detectors but bear the risk of drifting. A
question is hence raised -- how to improve the accuracy of video object
detection/tracking by utilizing the existing detectors and trackers within a
given time budget? A baseline is frame skipping -- detecting every N-th frames
and tracking for the frames in between. This baseline, however, is suboptimal
since the detection frequency should depend on the tracking quality. To this
end, we propose a scheduler network, which determines to detect or track at a
certain frame, as a generalization of Siamese trackers. Although being
light-weight and simple in structure, the scheduler network is more effective
than the frame skipping baselines and flow-based approaches, as validated on
ImageNet VID dataset in video object detection/tracking.Comment: Accepted to AAAI 201
Fair Latency-Aware Metric for real-time video segmentation networks
As supervised semantic segmentation is reaching satisfying results, many
recent papers focused on making segmentation network architectures faster,
smaller and more efficient. In particular, studies often aim to reach the stage
to which they can claim to be "real-time". Achieving this goal is especially
relevant in the context of real-time video operations for autonomous vehicles
and robots, or medical imaging during surgery.
The common metric used for assessing these methods is so far the same as the
ones used for image segmentation without time constraint: mean Intersection
over Union (mIoU). In this paper, we argue that this metric is not relevant
enough for real-time video as it does not take into account the processing time
(latency) of the network. We propose a similar but more relevant metric called
FLAME for video-segmentation networks, that compares the output segmentation of
the network with the ground truth segmentation of the current video frame at
the time when the network finishes the processing.
We perform experiments to compare a few networks using this metric and
propose a simple addition to network training to enhance results according to
that metric