11,226 research outputs found
Survey on video anomaly detection in dynamic scenes with moving cameras
The increasing popularity of compact and inexpensive cameras, e.g.~dash
cameras, body cameras, and cameras equipped on robots, has sparked a growing
interest in detecting anomalies within dynamic scenes recorded by moving
cameras. However, existing reviews primarily concentrate on Video Anomaly
Detection (VAD) methods assuming static cameras. The VAD literature with moving
cameras remains fragmented, lacking comprehensive reviews to date. To address
this gap, we endeavor to present the first comprehensive survey on Moving
Camera Video Anomaly Detection (MC-VAD). We delve into the research papers
related to MC-VAD, critically assessing their limitations and highlighting
associated challenges. Our exploration encompasses three application domains:
security, urban transportation, and marine environments, which in turn cover
six specific tasks. We compile an extensive list of 25 publicly-available
datasets spanning four distinct environments: underwater, water surface,
ground, and aerial. We summarize the types of anomalies these datasets
correspond to or contain, and present five main categories of approaches for
detecting such anomalies. Lastly, we identify future research directions and
discuss novel contributions that could advance the field of MC-VAD. With this
survey, we aim to offer a valuable reference for researchers and practitioners
striving to develop and advance state-of-the-art MC-VAD methods.Comment: Under revie
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
Vision Language Models in Autonomous Driving and Intelligent Transportation Systems
The applications of Vision-Language Models (VLMs) in the fields of Autonomous
Driving (AD) and Intelligent Transportation Systems (ITS) have attracted
widespread attention due to their outstanding performance and the ability to
leverage Large Language Models (LLMs). By integrating language data, the
vehicles, and transportation systems are able to deeply understand real-world
environments, improving driving safety and efficiency. In this work, we present
a comprehensive survey of the advances in language models in this domain,
encompassing current models and datasets. Additionally, we explore the
potential applications and emerging research directions. Finally, we thoroughly
discuss the challenges and research gap. The paper aims to provide researchers
with the current work and future trends of VLMs in AD and ITS
Human-Object Interaction Prediction in Videos through Gaze Following
Understanding the human-object interactions (HOIs) from a video is essential
to fully comprehend a visual scene. This line of research has been addressed by
detecting HOIs from images and lately from videos. However, the video-based HOI
anticipation task in the third-person view remains understudied. In this paper,
we design a framework to detect current HOIs and anticipate future HOIs in
videos. We propose to leverage human gaze information since people often fixate
on an object before interacting with it. These gaze features together with the
scene contexts and the visual appearances of human-object pairs are fused
through a spatio-temporal transformer. To evaluate the model in the HOI
anticipation task in a multi-person scenario, we propose a set of person-wise
multi-label metrics. Our model is trained and validated on the VidHOI dataset,
which contains videos capturing daily life and is currently the largest video
HOI dataset. Experimental results in the HOI detection task show that our
approach improves the baseline by a great margin of 36.3% relatively. Moreover,
we conduct an extensive ablation study to demonstrate the effectiveness of our
modifications and extensions to the spatio-temporal transformer. Our code is
publicly available on https://github.com/nizhf/hoi-prediction-gaze-transformer.Comment: Accepted by CVIU https://doi.org/10.1016/j.cviu.2023.10374
An Outlook into the Future of Egocentric Vision
What will the future be? We wonder! In this survey, we explore the gap
between current research in egocentric vision and the ever-anticipated future,
where wearable computing, with outward facing cameras and digital overlays, is
expected to be integrated in our every day lives. To understand this gap, the
article starts by envisaging the future through character-based stories,
showcasing through examples the limitations of current technology. We then
provide a mapping between this future and previously defined research tasks.
For each task, we survey its seminal works, current state-of-the-art
methodologies and available datasets, then reflect on shortcomings that limit
its applicability to future research. Note that this survey focuses on software
models for egocentric vision, independent of any specific hardware. The paper
concludes with recommendations for areas of immediate explorations so as to
unlock our path to the future always-on, personalised and life-enhancing
egocentric vision.Comment: We invite comments, suggestions and corrections here:
https://openreview.net/forum?id=V3974SUk1
Sensing, interpreting, and anticipating human social behaviour in the real world
Low-level nonverbal social signals like glances, utterances, facial expressions and body language are central to human communicative situations and have been shown to be connected to important high-level constructs, such as emotions, turn-taking, rapport, or leadership. A prerequisite for the creation of social machines that are able to support humans in e.g. education, psychotherapy, or human resources is the ability to automatically sense, interpret, and anticipate human nonverbal behaviour. While promising results have been shown in controlled settings, automatically analysing unconstrained situations, e.g. in daily-life settings, remains challenging. Furthermore, anticipation of nonverbal behaviour in social situations is still largely unexplored. The goal of this thesis is to move closer to the vision of social machines in the real world. It makes fundamental contributions along the three dimensions of sensing, interpreting and anticipating nonverbal behaviour in social interactions. First, robust recognition of low-level nonverbal behaviour lays the groundwork for all further analysis steps. Advancing human visual behaviour sensing is especially relevant as the current state of the art is still not satisfactory in many daily-life situations. While many social interactions take place in groups, current methods for unsupervised eye contact detection can only handle dyadic interactions. We propose a novel unsupervised method for multi-person eye contact detection by exploiting the connection between gaze and speaking turns. Furthermore, we make use of mobile device engagement to address the problem of calibration drift that occurs in daily-life usage of mobile eye trackers. Second, we improve the interpretation of social signals in terms of higher level social behaviours. In particular, we propose the first dataset and method for emotion recognition from bodily expressions of freely moving, unaugmented dyads. Furthermore, we are the first to study low rapport detection in group interactions, as well as investigating a cross-dataset evaluation setting for the emergent leadership detection task. Third, human visual behaviour is special because it functions as a social signal and also determines what a person is seeing at a given moment in time. Being able to anticipate human gaze opens up the possibility for machines to more seamlessly share attention with humans, or to intervene in a timely manner if humans are about to overlook important aspects of the environment. We are the first to propose methods for the anticipation of eye contact in dyadic conversations, as well as in the context of mobile device interactions during daily life, thereby paving the way for interfaces that are able to proactively intervene and support interacting humans.Blick, GesichtsausdrĂŒcke, Körpersprache, oder Prosodie spielen als nonverbale Signale eine zentrale Rolle in menschlicher Kommunikation. Sie wurden durch vielzĂ€hlige Studien mit wichtigen Konzepten wie Emotionen, Sprecherwechsel, FĂŒhrung, oder der QualitĂ€t des VerhĂ€ltnisses zwischen zwei Personen in Verbindung gebracht. Damit Menschen effektiv wĂ€hrend ihres tĂ€glichen sozialen Lebens von Maschinen unterstĂŒtzt werden können, sind automatische Methoden zur Erkennung, Interpretation, und Antizipation von nonverbalem Verhalten notwendig. Obwohl die bisherige Forschung in kontrollierten Studien zu ermutigenden Ergebnissen gekommen ist, bleibt die automatische Analyse nonverbalen Verhaltens in weniger kontrollierten Situationen eine Herausforderung. DarĂŒber hinaus existieren kaum Untersuchungen zur Antizipation von nonverbalem Verhalten in sozialen Situationen. Das Ziel dieser Arbeit ist, die Vision vom automatischen Verstehen sozialer Situationen ein StĂŒck weit mehr RealitĂ€t werden zu lassen. Diese Arbeit liefert wichtige BeitrĂ€ge zur autmatischen Erkennung menschlichen Blickverhaltens in alltĂ€glichen Situationen. Obwohl viele soziale Interaktionen in Gruppen stattfinden, existieren unĂŒberwachte Methoden zur Augenkontakterkennung bisher lediglich fĂŒr dyadische Interaktionen. Wir stellen einen neuen Ansatz zur Augenkontakterkennung in Gruppen vor, welcher ohne manuelle Annotationen auskommt, indem er sich den statistischen Zusammenhang zwischen Blick- und Sprechverhalten zu Nutze macht. TĂ€gliche AktivitĂ€ten sind eine Herausforderung fĂŒr GerĂ€te zur mobile Augenbewegungsmessung, da Verschiebungen dieser GerĂ€te zur Verschlechterung ihrer Kalibrierung fĂŒhren können. In dieser Arbeit verwenden wir Nutzerverhalten an mobilen EndgerĂ€ten, um den Effekt solcher Verschiebungen zu korrigieren. Neben der Erkennung verbessert diese Arbeit auch die Interpretation sozialer Signale. Wir veröffentlichen den ersten Datensatz sowie die erste Methode zur Emotionserkennung in dyadischen Interaktionen ohne den Einsatz spezialisierter AusrĂŒstung. AuĂerdem stellen wir die erste Studie zur automatischen Erkennung mangelnder Verbundenheit in Gruppeninteraktionen vor, und fĂŒhren die erste datensatzĂŒbergreifende Evaluierung zur Detektion von sich entwickelndem FĂŒhrungsverhalten durch. Zum Abschluss der Arbeit prĂ€sentieren wir die ersten AnsĂ€tze zur Antizipation von Blickverhalten in sozialen Interaktionen. Blickverhalten hat die besondere Eigenschaft, dass es sowohl als soziales Signal als auch der Ausrichtung der visuellen Wahrnehmung dient. Somit eröffnet die FĂ€higkeit zur Antizipation von Blickverhalten Maschinen die Möglichkeit, sich sowohl nahtloser in soziale Interaktionen einzufĂŒgen, als auch Menschen zu warnen, wenn diese Gefahr laufen wichtige Aspekte der Umgebung zu ĂŒbersehen. Wir prĂ€sentieren Methoden zur Antizipation von Blickverhalten im Kontext der Interaktion mit mobilen EndgerĂ€ten wĂ€hrend tĂ€glicher AktivitĂ€ten, als auch wĂ€hrend dyadischer Interaktionen mittels Videotelefonie
- âŠ