128 research outputs found
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
Active Learning for Regression with Aggregated Outputs
Due to the privacy protection or the difficulty of data collection, we cannot
observe individual outputs for each instance, but we can observe aggregated
outputs that are summed over multiple instances in a set in some real-world
applications. To reduce the labeling cost for training regression models for
such aggregated data, we propose an active learning method that sequentially
selects sets to be labeled to improve the predictive performance with fewer
labeled sets. For the selection measurement, the proposed method uses the
mutual information, which quantifies the reduction of the uncertainty of the
model parameters by observing the aggregated output. With Bayesian linear basis
functions for modeling outputs given an input, which include approximated
Gaussian processes and neural networks, we can efficiently calculate the mutual
information in a closed form. With the experiments using various datasets, we
demonstrate that the proposed method achieves better predictive performance
with fewer labeled sets than existing methods
Deep Learning for Face Anti-Spoofing: A Survey
Face anti-spoofing (FAS) has lately attracted increasing attention due to its
vital role in securing face recognition systems from presentation attacks
(PAs). As more and more realistic PAs with novel types spring up, traditional
FAS methods based on handcrafted features become unreliable due to their
limited representation capacity. With the emergence of large-scale academic
datasets in the recent decade, deep learning based FAS achieves remarkable
performance and dominates this area. However, existing reviews in this field
mainly focus on the handcrafted features, which are outdated and uninspiring
for the progress of FAS community. In this paper, to stimulate future research,
we present the first comprehensive review of recent advances in deep learning
based FAS. It covers several novel and insightful components: 1) besides
supervision with binary label (e.g., '0' for bonafide vs. '1' for PAs), we also
investigate recent methods with pixel-wise supervision (e.g., pseudo depth
map); 2) in addition to traditional intra-dataset evaluation, we collect and
analyze the latest methods specially designed for domain generalization and
open-set FAS; and 3) besides commercial RGB camera, we summarize the deep
learning applications under multi-modal (e.g., depth and infrared) or
specialized (e.g., light field and flash) sensors. We conclude this survey by
emphasizing current open issues and highlighting potential prospects.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
As the most fundamental tasks of computer vision, object detection and
segmentation have made tremendous progress in the deep learning era. Due to the
expensive manual labeling, the annotated categories in existing datasets are
often small-scale and pre-defined, i.e., state-of-the-art detectors and
segmentors fail to generalize beyond the closed-vocabulary. To resolve this
limitation, the last few years have witnessed increasing attention toward
Open-Vocabulary Detection (OVD) and Segmentation (OVS). In this survey, we
provide a comprehensive review on the past and recent development of OVD and
OVS. To this end, we develop a taxonomy according to the type of task and
methodology. We find that the permission and usage of weak supervision signals
can well discriminate different methodologies, including: visual-semantic space
mapping, novel visual feature synthesis, region-aware training,
pseudo-labeling, knowledge distillation-based, and transfer learning-based. The
proposed taxonomy is universal across different tasks, covering object
detection, semantic/instance/panoptic segmentation, 3D scene and video
understanding. In each category, its main principles, key challenges,
development routes, strengths, and weaknesses are thoroughly discussed. In
addition, we benchmark each task along with the vital components of each
method. Finally, several promising directions are provided to stimulate future
research
Recommended from our members
Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing
Advances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs.
To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakers’ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbs’ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.ESRC Doctoral Fellowship [ES/J500033/1], ERC Consolidator Grant LEXICAL [648909
- …