8 research outputs found
Full Reference Screen Content Image Quality Assessment by Fusing Multi-level Structure Similarity
The screen content images (SCIs) usually comprise various content types with
sharp edges, in which the artifacts or distortions can be well sensed by the
vanilla structure similarity measurement in a full reference manner.
Nonetheless, almost all of the current SOTA structure similarity metrics are
"locally" formulated in a single-level manner, while the true human visual
system (HVS) follows the multi-level manner, and such mismatch could eventually
prevent these metrics from achieving trustworthy quality assessment. To
ameliorate, this paper advocates a novel solution to measure structure
similarity "globally" from the perspective of sparse representation. To perform
multi-level quality assessment in accordance with the real HVS, the
above-mentioned global metric will be integrated with the conventional local
ones by resorting to the newly devised selective deep fusion network. To
validate its efficacy and effectiveness, we have compared our method with 12
SOTA methods over two widely-used large-scale public SCI datasets, and the
quantitative results indicate that our method yields significantly higher
consistency with subjective quality score than the currently leading works.
Both the source code and data are also publicly available to gain widespread
acceptance and facilitate new advancement and its validation
Exploring Rich and Efficient Spatial Temporal Interactions for Real Time Video Salient Object Detection
The current main stream methods formulate their video saliency mainly from
two independent venues, i.e., the spatial and temporal branches. As a
complementary component, the main task for the temporal branch is to
intermittently focus the spatial branch on those regions with salient
movements. In this way, even though the overall video saliency quality is
heavily dependent on its spatial branch, however, the performance of the
temporal branch still matter. Thus, the key factor to improve the overall video
saliency is how to further boost the performance of these branches efficiently.
In this paper, we propose a novel spatiotemporal network to achieve such
improvement in a full interactive fashion. We integrate a lightweight temporal
model into the spatial branch to coarsely locate those spatially salient
regions which are correlated with trustworthy salient movements. Meanwhile, the
spatial branch itself is able to recurrently refine the temporal model in a
multi-scale manner. In this way, both the spatial and temporal branches are
able to interact with each other, achieving the mutual performance improvement.
Our method is easy to implement yet effective, achieving high quality video
saliency detection in real-time speed with 50 FPS
A Novel Video Salient Object Detection Method via Semi-supervised Motion Quality Perception
Previous video salient object detection (VSOD) approaches have mainly focused
on designing fancy networks to achieve their performance improvements. However,
with the slow-down in development of deep learning techniques recently, it may
become more and more difficult to anticipate another breakthrough via fancy
networks solely. To this end, this paper proposes a universal learning scheme
to get a further 3\% performance improvement for all state-of-the-art (SOTA)
methods. The major highlight of our method is that we resort the "motion
quality"---a brand new concept, to select a sub-group of video frames from the
original testing set to construct a new training set. The selected frames in
this new training set should all contain high-quality motions, in which the
salient objects will have large probability to be successfully detected by the
"target SOTA method"---the one we want to improve. Consequently, we can achieve
a significant performance improvement by using this new training set to start a
new round of network training. During this new round training, the VSOD results
of the target SOTA method will be applied as the pseudo training objectives.
Our novel learning scheme is simple yet effective, and its semi-supervised
methodology may have large potential to inspire the VSOD community in the
future
Rethinking of the Image Salient Object Detection: Object-level Semantic Saliency Re-ranking First, Pixel-wise Saliency Refinement Latter
The real human attention is an interactive activity between our visual system
and our brain, using both low-level visual stimulus and high-level semantic
information. Previous image salient object detection (SOD) works conduct their
saliency predictions in a multi-task manner, i.e., performing pixel-wise
saliency regression and segmentation-like saliency refinement at the same time,
which degenerates their feature backbones in revealing semantic information.
However, given an image, we tend to pay more attention to those regions which
are semantically salient even in the case that these regions are perceptually
not the most salient ones at first glance. In this paper, we divide the SOD
problem into two sequential tasks: 1) we propose a lightweight, weakly
supervised deep network to coarsely locate those semantically salient regions
first; 2) then, as a post-processing procedure, we selectively fuse multiple
off-the-shelf deep models on these semantically salient regions as the
pixel-wise saliency refinement. In sharp contrast to the state-of-the-art
(SOTA) methods that focus on learning pixel-wise saliency in "single image"
using perceptual clues mainly, our method has investigated the "object-level
semantic ranks between multiple images", of which the methodology is more
consistent with the real human attention mechanism. Our method is simple yet
effective, which is the first attempt to consider the salient object detection
mainly as an object-level semantic re-ranking problem
Data-Level Recombination and Lightweight Fusion Scheme for RGB-D Salient Object Detection
Existing RGB-D salient object detection methods treat depth information as an
independent component to complement its RGB part, and widely follow the
bi-stream parallel network architecture. To selectively fuse the CNNs features
extracted from both RGB and depth as a final result, the state-of-the-art
(SOTA) bi-stream networks usually consist of two independent subbranches; i.e.,
one subbranch is used for RGB saliency and the other aims for depth saliency.
However, its depth saliency is persistently inferior to the RGB saliency
because the RGB component is intrinsically more informative than the depth
component. The bi-stream architecture easily biases its subsequent fusion
procedure to the RGB subbranch, leading to a performance bottleneck. In this
paper, we propose a novel data-level recombination strategy to fuse RGB with D
(depth) before deep feature extraction, where we cyclically convert the
original 4-dimensional RGB-D into \textbf{D}GB, R\textbf{D}B and RG\textbf{D}.
Then, a newly lightweight designed triple-stream network is applied over these
novel formulated data to achieve an optimal channel-wise complementary fusion
status between the RGB and D, achieving a new SOTA performance
Depth Quality Aware Salient Object Detection
The existing fusion based RGB-D salient object detection methods usually
adopt the bi-stream structure to strike the fusion trade-off between RGB and
depth (D). The D quality usually varies from scene to scene, while the SOTA
bi-stream approaches are depth quality unaware, which easily result in
substantial difficulties in achieving complementary fusion status between RGB
and D, leading to poor fusion results in facing of low-quality D. Thus, this
paper attempts to integrate a novel depth quality aware subnet into the classic
bi-stream structure, aiming to assess the depth quality before conducting the
selective RGB-D fusion. Compared with the SOTA bi-stream methods, the major
highlight of our method is its ability to lessen the importance of those
low-quality, no-contribution, or even negative-contribution D regions during
the RGB-D fusion, achieving a much improved complementary status between RGB
and D
A Plug-and-play Scheme to Adapt Image Saliency Deep Model for Video Data
With the rapid development of deep learning techniques, image saliency deep
models trained solely by spatial information have occasionally achieved
detection performance for video data comparable to that of the models trained
by both spatial and temporal information. However, due to the lesser
consideration of temporal information, the image saliency deep models may
become fragile in the video sequences dominated by temporal information. Thus,
the most recent video saliency detection approaches have adopted the network
architecture starting with a spatial deep model that is followed by an
elaborately designed temporal deep model. However, such methods easily
encounter the performance bottleneck arising from the single stream learning
methodology, so the overall detection performance is largely determined by the
spatial deep model. In sharp contrast to the current mainstream methods, this
paper proposes a novel plug-and-play scheme to weakly retrain a pretrained
image saliency deep model for video data by using the newly sensed and coded
temporal information. Thus, the retrained image saliency deep model will be
able to maintain temporal saliency awareness, achieving much improved detection
performance. Moreover, our method is simple yet effective for adapting any
off-the-shelf pre-trained image saliency deep model to obtain high-quality
video saliency detection. Additionally, both the data and source code of our
method are publicly available.Comment: 12 pages, 10 figures, and, this paper is currently in peer review in
IEEE TCSV
Decomposition into Low-rank plus Additive Matrices for Background/Foreground Separation: A Review for a Comparative Evaluation with a Large-Scale Dataset
Recent research on problem formulations based on decomposition into low-rank
plus sparse matrices shows a suitable framework to separate moving objects from
the background. The most representative problem formulation is the Robust
Principal Component Analysis (RPCA) solved via Principal Component Pursuit
(PCP) which decomposes a data matrix in a low-rank matrix and a sparse matrix.
However, similar robust implicit or explicit decompositions can be made in the
following problem formulations: Robust Non-negative Matrix Factorization
(RNMF), Robust Matrix Completion (RMC), Robust Subspace Recovery (RSR), Robust
Subspace Tracking (RST) and Robust Low-Rank Minimization (RLRM). The main goal
of these similar problem formulations is to obtain explicitly or implicitly a
decomposition into low-rank matrix plus additive matrices. In this context,
this work aims to initiate a rigorous and comprehensive review of the similar
problem formulations in robust subspace learning and tracking based on
decomposition into low-rank plus additive matrices for testing and ranking
existing algorithms for background/foreground separation. For this, we first
provide a preliminary review of the recent developments in the different
problem formulations which allows us to define a unified view that we called
Decomposition into Low-rank plus Additive Matrices (DLAM). Then, we examine
carefully each method in each robust subspace learning/tracking frameworks with
their decomposition, their loss functions, their optimization problem and their
solvers. Furthermore, we investigate if incremental algorithms and real-time
implementations can be achieved for background/foreground separation. Finally,
experimental results on a large-scale dataset called Background Models
Challenge (BMC 2012) show the comparative performance of 32 different robust
subspace learning/tracking methods.Comment: 121 pages, 5 figures, submitted to Computer Science Review. arXiv
admin note: text overlap with arXiv:1312.7167, arXiv:1109.6297,
arXiv:1207.3438, arXiv:1105.2126, arXiv:1404.7592, arXiv:1210.0805,
arXiv:1403.8067 by other authors, Computer Science Review, November 201