846 research outputs found
Incremental refinement of image salient-point detection
Low-level image analysis systems typically detect "points of interest", i.e., areas of natural images that contain corners or edges. Most of the robust and computationally efficient detectors proposed for this task use the autocorrelation matrix of the localized image derivatives. Although the performance of such detectors and their suitability for particular applications has been studied in relevant literature, their behavior under limited input source (image) precision or limited computational or energy resources is largely unknown. All existing frameworks assume that the input image is readily available for processing and that sufficient computational and energy resources exist for the completion of the result. Nevertheless, recent advances in incremental image sensors or compressed sensing, as well as the demand for low-complexity scene analysis in sensor networks now challenge these assumptions. In this paper, we investigate an approach to compute salient points of images incrementally, i.e., the salient point detector can operate with a coarsely quantized input image representation and successively refine the result (the derived salient points) as the image precision is successively refined by the sensor. This has the advantage that the image sensing and the salient point detection can be terminated at any input image precision (e.g., bound set by the sensory equipment or by computation, or by the salient point accuracy required by the application) and the obtained salient points under this precision are readily available. We focus on the popular detector proposed by Harris and Stephens and demonstrate how such an approach can operate when the image samples are refined in a bitwise manner, i.e., the image bitplanes are received one-by-one from the image sensor. We estimate the required energy for image sensing as well as the computation required for the salient point detection based on stochastic source modeling. The computation and energy required by the proposed incremental refinement approach is compared against the conventional salient-point detector realization that operates directly on each source precision and cannot refine the result. Our experiments demonstrate the feasibility of incremental approaches for salient point detection in various classes of natural images. In addition, a first comparison between the results obtained by the intermediate detectors is presented and a novel application for adaptive low-energy image sensing based on points of saliency is presented
Prompting Visual-Language Models for Dynamic Facial Expression Recognition
This paper presents a novel visual-language model called DFER-CLIP, which is
based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a
textual part. For the visual part, based on the CLIP image encoder, a temporal model
consisting of several Transformer encoders is introduced for extracting temporal facial
expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that
is related to the classes (facial expressions) that we are interested in recognising – those
descriptions are generated using large language models, like ChatGPT. This, in contrast
to works that use only the class names and more accurately captures the relationship
between them. Alongside the textual description, we introduce a learnable token which
helps the model learn relevant context information for each expression during training.
Extensive experiments demonstrate the effectiveness of the proposed method and show
that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks. Code is
publicly available at https://github.com/zengqunzhao/DFER-CLIP
Linear Maximum Margin Classifier for Learning from Uncertain Data
In this paper, we propose a maximum margin classifier that deals with
uncertainty in data input. More specifically, we reformulate the SVM framework
such that each training example can be modeled by a multi-dimensional Gaussian
distribution described by its mean vector and its covariance matrix -- the
latter modeling the uncertainty. We address the classification problem and
define a cost function that is the expected value of the classical SVM cost
when data samples are drawn from the multi-dimensional Gaussian distributions
that form the set of the training examples. Our formulation approximates the
classical SVM formulation when the training examples are isotropic Gaussians
with variance tending to zero. We arrive at a convex optimization problem,
which we solve efficiently in the primal form using a stochastic gradient
descent approach. The resulting classifier, which we name SVM with Gaussian
Sample Uncertainty (SVM-GSU), is tested on synthetic data and five publicly
available and popular datasets; namely, the MNIST, WDBC, DEAP, TV News Channel
Commercial Detection, and TRECVID MED datasets. Experimental results verify the
effectiveness of the proposed method.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence. (c)
2017 IEEE. DOI: 10.1109/TPAMI.2017.2772235 Author's accepted version. The
final publication is available at
http://ieeexplore.ieee.org/document/8103808
Universal Foreground Segmentation Based on Deep Feature Fusion Network for Multi-Scene Videos
Foreground/background (fg/bg) classification is an important first step for several video analysis tasks such as people counting, activity recognition and anomaly detection. As is the case for several other Computer Vision problems, the advent of deep Convolutional Neural Network (CNN) methods has led to major improvements in this field. However, despite their success, CNN-based methods have difficulties in coping with multi-scene videos where the scenes change multiple times along the time sequence. In this paper, we propose a deep features fusion network based foreground segmentation method (DFFnetSeg), which is both robust to scene changes and unseen scenes comparing with competitive state-of-the-art methods. In the heart of DFFnetSeg lies a fusion network that takes as input deep features extracted from a current frame, a previous frame, and a reference frame and produces as output a segmentation mask into background and foreground objects. We show the advantages of using a fusion network and the three frames group in dealing with the unseen scene and bootstrap challenge. In addition, we show that a simple reference frame updating strategy enables DFFnetSeg to be robust to sudden scene changes inside video sequences and prepare a motion map based post-processing method which further reduces false positives. Experimental results on the test dataset generated from CDnet2014 and Lasiesta demonstrate the advantages of the DFFnetSeg method
- …