112 research outputs found
Good Features to Correlate for Visual Tracking
During the recent years, correlation filters have shown dominant and
spectacular results for visual object tracking. The types of the features that
are employed in these family of trackers significantly affect the performance
of visual tracking. The ultimate goal is to utilize robust features invariant
to any kind of appearance change of the object, while predicting the object
location as properly as in the case of no appearance change. As the deep
learning based methods have emerged, the study of learning features for
specific tasks has accelerated. For instance, discriminative visual tracking
methods based on deep architectures have been studied with promising
performance. Nevertheless, correlation filter based (CFB) trackers confine
themselves to use the pre-trained networks which are trained for object
classification problem. To this end, in this manuscript the problem of learning
deep fully convolutional features for the CFB visual tracking is formulated. In
order to learn the proposed model, a novel and efficient backpropagation
algorithm is presented based on the loss function of the network. The proposed
learning framework enables the network model to be flexible for a custom
design. Moreover, it alleviates the dependency on the network trained for
classification. Extensive performance analysis shows the efficacy of the
proposed custom design in the CFB tracking framework. By fine-tuning the
convolutional parts of a state-of-the-art network and integrating this model to
a CFB tracker, which is the top performing one of VOT2016, 18% increase is
achieved in terms of expected average overlap, and tracking failures are
decreased by 25%, while maintaining the superiority over the state-of-the-art
methods in OTB-2013 and OTB-2015 tracking datasets.Comment: Accepted version of IEEE Transactions on Image Processin
Efficient MRF Energy Propagation for Video Segmentation via Bilateral Filters
Segmentation of an object from a video is a challenging task in multimedia
applications. Depending on the application, automatic or interactive methods
are desired; however, regardless of the application type, efficient computation
of video object segmentation is crucial for time-critical applications;
specifically, mobile and interactive applications require near real-time
efficiencies. In this paper, we address the problem of video segmentation from
the perspective of efficiency. We initially redefine the problem of video
object segmentation as the propagation of MRF energies along the temporal
domain. For this purpose, a novel and efficient method is proposed to propagate
MRF energies throughout the frames via bilateral filters without using any
global texture, color or shape model. Recently presented bi-exponential filter
is utilized for efficiency, whereas a novel technique is also developed to
dynamically solve graph-cuts for varying, non-lattice graphs in general linear
filtering scenario. These improvements are experimented for both automatic and
interactive video segmentation scenarios. Moreover, in addition to the
efficiency, segmentation quality is also tested both quantitatively and
qualitatively. Indeed, for some challenging examples, significant time
efficiency is observed without loss of segmentation quality.Comment: Multimedia, IEEE Transactions on (Volume:16, Issue: 5, Aug. 2014
Special issue on advances in three-dimensional television and video: Guest editorial
Cataloged from PDF version of article
Estimation of depth fields suitable for video compression using 3-D structures and motion of objects
Cataloged from PDF version of article.Intensity prediction along motion trajectories removes temporal
redundancy considerably in video compression algorithms. In threedimensional
(3-D) object-based video coding, both 3-D motion and depth
values are required for temporal prediction. The required 3-D motion
parameters for each object are found by the correspondence-based Ematrix
method. The estimation of the correspondences—two-dimensional
(2-D) motion field—between the frames and segmentation of the scene into
objects are achieved simultaneously by minimizing a Gibbs energy. The
depth field is estimated by jointly minimizing a defined distortion and bitrate
criterion using the 3-D motion parameters. The resulting depth field
is efficient in the rate-distortion sense. Bit-rate values corresponding to the
lossless encoding of the resultant depth fields are obtained using predictive
coding; prediction errors are encoded by a Lempel–Ziv algorithm. The
results are satisfactory for real-life video scenes
3-D motion estimation of rigid objects for video coding applications using an improved iterative version of the E-matrix method
Cataloged from PDF version of article.As an alternative to current two-dimensional (2-D) motion models, a robust three-dimensional (3-D) motion estimation method is proposed to be utilized in object-based video coding applications. Since the popular E-matrix method is well known for its susceptibility to input errors, a performance indicator, which tests the validity of the estimated 3-D motion parameters both explicitly and implicitly, is defined. This indicator is utilized within the RANSAC method to obtain a robust set of 2-D motion correspondences which leads to better 3-D motion parameters for each object. The experimental results support the superiority of the proposed method over direct application of the E-matrix method
Generalizable Embeddings with Cross-batch Metric Learning
Global average pooling (GAP) is a popular component in deep metric learning
(DML) for aggregating features. Its effectiveness is often attributed to
treating each feature vector as a distinct semantic entity and GAP as a
combination of them. Albeit substantiated, such an explanation's algorithmic
implications to learn generalizable entities to represent unseen classes, a
crucial DML goal, remain unclear. To address this, we formulate GAP as a convex
combination of learnable prototypes. We then show that the prototype learning
can be expressed as a recursive process fitting a linear predictor to a batch
of samples. Building on that perspective, we consider two batches of disjoint
classes at each iteration and regularize the learning by expressing the samples
of a batch with the prototypes that are fitted to the other batch. We validate
our approach on 4 popular DML benchmarks.Comment: \c{opyright} 2023 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Rate-Distortion Efficient Piecewise Planar 3D Scene Representation from 2-D Images
Cataloged from PDF version of article.In any practical application of the 2-D-to-3-D conversion
that involves storage and transmission, representation effi-
ciency has an undisputable importance that is not reflected in the
attention the topic received. In order to address this problem, a
novel algorithm, which yields efficient 3-D representations in the
rate distortion sense, is proposed. The algorithm utilizes two views
of a scene to build a mesh-based representation incrementally, via
adding new vertices, while minimizing a distortion measure. The
experimental results indicate that, in scenes that can be approximated
by planes, the proposed algorithm is superior to the dense
depth map and, in some practical situations, to the block motion
vector-based representations in the rate-distortion sense
Sayısal çoğulortam verisinin anlamsal çokkipli analizi
TÜBİTAK EEEAG Proje30.09.200
Object-based 3-d motion and structure analysis for video coding applications
Ankara : Department of Electrical and Electronics Engineering and the Institute of Engineering and Sciences of Bilkent University, 1997.Thesis (Ph.D.) -- -Bilkent University, 1997.Includes bibliographical references leaves 102-115Novel 3-D motion analysis tools, which can be used in object-based video codecs, are proposed. In these tools, the movements of the objects, which are observed through 2-D video frames, are modeled in 3-D space. Segmentation of 2-D frames into objects and 2-D dense motion vectors for each object are necessary as inputs for the proposed 3-D analysis. 2-D motion-based object segmentation is obtained by Gibbs formulation; the initialization is achieved by using a fast graph-theory based region segmentation algorithm which is further improved to utilize the motion information. Moreover, the same Gibbs formulation gives the needed dense 2-D motion vector field. The formulations for the 3-D motion models are given for both rigid and non- rigid moving objects. Deformable motion is modeled by a Markov random field which permits elastic relations between neighbors, whereas, rigid 3-D motion parameters are estimated using the E-matrix method. Some improvements on the E-matrix method are proposed to make this algorithm more robust to gross errors like the consequence of incorrect segmentation of 2-D correspondences between frames. Two algorithms are proposed to obtain dense depth estimates, which are robust to input errors and suitable for encoding, respectively. While the former of these two algorithms gives simply a MAP estimate, the latter uses rate-distortion theory. Finally, 3-D motion models are further utilized for occlusion detection and motion compensated temporal interpolation, and it is observed that for both applications 3-D motion models have superiority over their 2-D counterparts. Simulation results on artificial and real data show the advantages of the 3-D motion models in object-based video coding algorithms.Alatan, A AydinPh.D
E-VFIA : Event-Based Video Frame Interpolation with Attention
Video frame interpolation (VFI) is a fundamental vision task that aims to
synthesize several frames between two consecutive original video images. Most
algorithms aim to accomplish VFI by using only keyframes, which is an ill-posed
problem since the keyframes usually do not yield any accurate precision about
the trajectories of the objects in the scene. On the other hand, event-based
cameras provide more precise information between the keyframes of a video. Some
recent state-of-the-art event-based methods approach this problem by utilizing
event data for better optical flow estimation to interpolate for video frame by
warping. Nonetheless, those methods heavily suffer from the ghosting effect. On
the other hand, some of kernel-based VFI methods that only use frames as input,
have shown that deformable convolutions, when backed up with transformers, can
be a reliable way of dealing with long-range dependencies. We propose
event-based video frame interpolation with attention (E-VFIA), as a lightweight
kernel-based method. E-VFIA fuses event information with standard video frames
by deformable convolutions to generate high quality interpolated frames. The
proposed method represents events with high temporal resolution and uses a
multi-head self-attention mechanism to better encode event-based information,
while being less vulnerable to blurring and ghosting artifacts; thus,
generating crispier frames. The simulation results show that the proposed
technique outperforms current state-of-the-art methods (both frame and
event-based) with a significantly smaller model size.Comment: Submitted to 2023 IEEE International Conference on Robotics and
Automation (ICRA 2023
- …