392 research outputs found
DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization
Since American Sign Language (ASL) has no standard written form, Deaf signers
frequently share videos in order to communicate in their native language.
However, since both hands and face convey critical linguistic information in
signed languages, sign language videos cannot preserve signer privacy. While
signers have expressed interest, for a variety of applications, in sign
language video anonymization that would effectively preserve linguistic
content, attempts to develop such technology have had limited success, given
the complexity of hand movements and facial expressions. Existing approaches
rely predominantly on precise pose estimations of the signer in video footage
and often require sign language video datasets for training. These requirements
prevent them from processing videos 'in the wild,' in part because of the
limited diversity present in current sign language video datasets. To address
these limitations, our research introduces DiffSLVA, a novel methodology that
utilizes pre-trained large-scale diffusion models for zero-shot text-guided
sign language video anonymization. We incorporate ControlNet, which leverages
low-level image features such as HED (Holistically-Nested Edge Detection)
edges, to circumvent the need for pose estimation. Additionally, we develop a
specialized module dedicated to capturing facial expressions, which are
critical for conveying essential linguistic information in signed languages. We
then combine the above methods to achieve anonymization that better preserves
the essential linguistic content of the original signer. This innovative
methodology makes possible, for the first time, sign language video
anonymization that could be used for real-world applications, which would offer
significant benefits to the Deaf and Hard-of-Hearing communities. We
demonstrate the effectiveness of our approach with a series of signer
anonymization experiments.Comment: Project webpage: https://github.com/Jeffery9707/DiffSLV
Exploiting Out-of-band Motion Sensor Data to De-anonymize Virtual Reality Users
Virtual Reality (VR) is an exciting new consumer technology which offers an
immersive audio-visual experience to users through which they can navigate and
interact with a digitally represented 3D space (i.e., a virtual world) using a
headset device. By (visually) transporting users from the real or physical
world to exciting and realistic virtual spaces, VR systems can enable
true-to-life and more interactive versions of traditional applications such as
gaming, remote conferencing, social networking and virtual tourism. However, as
with any new consumer technology, VR applications also present significant
user-privacy challenges. This paper studies a new type of privacy attack
targeting VR users by connecting their activities visible in the virtual world
(enabled by some VR application/service) to their physical state sensed in the
real world. Specifically, this paper analyzes the feasibility of carrying out a
de-anonymization or identification attack on VR users by correlating visually
observed movements of users' avatars in the virtual world with some auxiliary
data (e.g., motion sensor data from mobile/wearable devices held by users)
representing their context/state in the physical world. To enable this attack,
this paper proposes a novel framework which first employs a learning-based
activity classification approach to translate the disparate visual movement
data and motion sensor data into an activity-vector to ease comparison,
followed by a filtering and identity ranking phase outputting an ordered list
of potential identities corresponding to the target visual movement data.
Extensive empirical evaluation of the proposed framework, under a comprehensive
set of experimental settings, demonstrates the feasibility of such a
de-anonymization attack
Privacy-Respecting Smart Video Surveillance Based on Usage Control Enforcement
This research introduces a conceptual framework for enforcing privacy-related restrictions in smart video surveillance systems based on danger levels and incident types to be handled. It increases the selectivity of surveillance by restricting data processing to individuals associated to incidents under investigation. Constraints are enforced by usage control, which is instantiated for video surveillance for the first time and enables tailoring such systems to comply with data protection law
Recent Advances in Digital Image and Video Forensics, Anti-forensics and Counter Anti-forensics
Image and video forensics have recently gained increasing attention due to
the proliferation of manipulated images and videos, especially on social media
platforms, such as Twitter and Instagram, which spread disinformation and fake
news. This survey explores image and video identification and forgery detection
covering both manipulated digital media and generative media. However, media
forgery detection techniques are susceptible to anti-forensics; on the other
hand, such anti-forensics techniques can themselves be detected. We therefore
further cover both anti-forensics and counter anti-forensics techniques in
image and video. Finally, we conclude this survey by highlighting some open
problems in this domain
Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting
In this paper, we introduce a neural rendering pipeline for transferring the
facial expressions, head pose, and body movements of one person in a source
video to another in a target video. We apply our method to the challenging case
of Sign Language videos: given a source video of a sign language user, we can
faithfully transfer the performed manual (e.g., handshape, palm orientation,
movement, location) and non-manual (e.g., eye gaze, facial expressions, mouth
patterns, head, and body movements) signs to a target video in a
photo-realistic manner. Our method can be used for Sign Language Anonymization,
Sign Language Production (synthesis module), as well as for reenacting other
types of full body activities (dancing, acting performance, exercising, etc.).
We conduct detailed qualitative and quantitative evaluations and comparisons,
which demonstrate the particularly promising and realistic results that we
obtain and the advantages of our method over existing approaches.Comment: Accepted at AI4CC Workshop at CVPR 202
Modeling temporal visual salience for human action recognition enabled visual anonymity preservation
This paper proposes a novel approach for visually anonymizing video clips while retaining the ability to machine-based analysis of the video clip, such as, human action recognition. The visual anonymization is achieved by proposing a novel method for generating the anonymization silhouette by modeling the frame-wise temporal visual salience. This is followed by analysing these temporal salience-based silhouettes by extracting the proposed histograms of gradients in salience ( HOG-S ) for learning the action representation in the visually anonymized domain. Since the anonymization maps are based on the temporal salience maps represented in gray scale, only the moving body parts related to the motion of the action are represented in larger gray values forming highly anonymized silhouettes, resulting in the highest mean anonymity score (MAS), the least identifiable visual appearance attributes and a high utility of human-perceived utility in action recognition. In terms of machine-based human action recognition, using the proposed HOG-S features has resulted in the highest accuracy rate in the anonymized domain compared to those achieved from the existing anonymization methods. Overall, the proposed holistic human action recognition method, i.e. , the temporal salience modeling followed by the HOG-S feature extraction, has resulted in the best human action recognition accuracy rates for datasets DHA, KTH, UIUC1, UCF Sports and HMDB51 with improvements of 3%, 1.6%, 0.8%, 1.3% and 16.7%, respectively. The proposed method outperforms both feature-based and deep learning based existing approaches
Deep Learning for Crowd Anomaly Detection
Today, public areas across the globe are monitored by an increasing amount of surveillance cameras. This widespread usage has presented an ever-growing volume of data that cannot realistically be examined in real-time. Therefore, efforts to understand crowd dynamics have brought light to automatic systems for the detection of anomalies in crowds. This thesis explores the methods used across literature for this purpose, with a focus on those fusing dense optical flow in a feature extraction stage to the crowd anomaly detection problem. To this extent, five different deep learning architectures are trained using optical flow maps estimated by three deep learning-based techniques. More specifically, a 2D convolutional network, a 3D convolutional network, and LSTM-based convolutional recurrent network, a pre-trained variant of the latter, and a ConvLSTM-based autoencoder is trained using both regular frames and optical flow maps estimated by LiteFlowNet3, RAFT, and GMA on the UCSD Pedestrian 1 dataset. The experimental results have shown that while prone to overfitting, the use of optical flow maps may improve the performance of supervised spatio-temporal architectures
Multimodal Visual Sensing: Automated Estimation of Engagement
Viele moderne Anwendungen der künstlichen Intelligenz beinhalten bis zu einem gewissen Grad ein Verständnis der menschlichen Aufmerksamkeit, Aktivität, Absicht und Kompetenz aus multimodalen visuellen Daten. Nonverbale Verhaltenshinweise, die mit Hilfe von Computer Vision und Methoden des maschinellen Lernens erkannt werden, enthalten wertvolle Informationen zum Verständnis menschlicher Verhaltensweisen, einschließlich Aufmerksamkeit und Engagement.
Der Einsatz solcher automatisierten Methoden im Bildungsbereich birgt ein enormes Potenzial. Zu den nützlichen Anwendungen gehören Analysen im Klassenzimmer zur Messung der Unterrichtsqualität und die Entwicklung von Interventionen zur Verbesserung des Unterrichts auf der Grundlage dieser Analysen sowie die Analyse von Präsentationen, um Studenten zu helfen, ihre Botschaften überzeugend und effektiv zu vermitteln.
Diese Dissertation stellt ein allgemeines Framework vor, das auf multimodaler visueller Erfassung basiert, um Engagement und verwandte Aufgaben anhand visueller Modalitäten zu analysieren.
Während sich der Großteil der Engagement-Literatur im Bereich des affektiven und sozialen Computings auf computerbasiertes Lernen und auf Lernspiele konzentriert, untersuchen wir die automatisierte Engagement-Schätzung im Klassenzimmer unter Verwendung verschiedener nonverbaler Verhaltenshinweise und entwickeln Methoden zur Extraktion von Aufmerksamkeits- und emotionalen Merkmalen. Darüber hinaus validieren wir die Effizienz der vorgeschlagenen Ansätze an realen Daten, die aus videografierten Klassen an Universitäten und weiterführenden Schulen gesammelt wurden. Zusätzlich zu den Lernaktivitäten führen wir eine Verhaltensanalyse von Studenten durch, die kurze wissenschaftliche Präsentationen unter Verwendung von multimodalen Hinweisen, einschließlich Gesichts-, Körper- und Stimmmerkmalen, halten.
Neben dem Engagement und der Präsentationskompetenz nähern wir uns dem Verständnis des menschlichen Verhaltens aus einer breiteren Perspektive, indem wir die Analyse der gemeinsamen Aufmerksamkeit in einer Gruppe von Menschen, die Wahrnehmung von Lehrern mit Hilfe von egozentrischer Kameraperspektive und mobilen Eyetrackern sowie die automatisierte Anonymisierung von audiovisuellen Daten in Studien im Klassenzimmer untersuchen.
Educational Analytics bieten wertvolle Möglichkeiten zur Verbesserung von Lernen und Lehren. Die Arbeit in dieser Dissertation schlägt einen rechnerischen Rahmen zur Einschätzung des Engagements und der Präsentationskompetenz von Schülern vor, zusammen mit unterstützenden Computer-Vision-Problemen.Many modern applications of artificial intelligence involve, to some extent, an understanding of human attention, activity, intention, and competence from multimodal visual data. Nonverbal behavioral cues detected using computer vision and machine learning methods include valuable information for understanding human behaviors, including attention and engagement.
The use of such automated methods in educational settings has a tremendous potential for good. Beneficial uses include classroom analytics to measure teaching quality and the development of interventions to improve teaching based on these analytics, as well as presentation analysis to help students deliver their messages persuasively and effectively.
This dissertation presents a general framework based on multimodal visual sensing to analyze engagement and related tasks from visual modalities.
While the majority of engagement literature in affective and social computing focuses on computer-based learning and educational games, we investigate automated engagement estimation in the classroom using different nonverbal behavioral cues and developed methods to extract attentional and emotional features. Furthermore, we validate the efficiency of proposed approaches on real-world data collected from videotaped classes at university and secondary school. In addition to learning activities, we perform behavior analysis on students giving short scientific presentations using multimodal cues, including face, body, and voice features.
Besides engagement and presentation competence, we approach human behavior understanding from a broader perspective by studying the analysis of joint attention in a group of people, teachers' perception using egocentric camera view and mobile eye trackers, and automated anonymization of audiovisual data in classroom studies.
Educational analytics present valuable opportunities to improve learning and teaching. The work in this dissertation suggests a computational framework for estimating student engagement and presentation competence, together with supportive computer vision problems
ASL video Corpora & Sign Bank: resources available through the American Sign Language Linguistic Research Project (ASLLRP)
The American Sign Language Linguistic Research Project (ASLLRP) provides Internet access to
high-quality ASL video data, generally including front and side views and a close-up of the face.
The manual and non-manual components of the signing have been linguistically annotated using
SignStream®. The recently expanded video corpora can be browsed and searched through the Data
Access Interface (DAI 2) we have designed; it is possible to carry out complex searches. The data
from our corpora can also be downloaded; annotations are available in an XML export format. We
have also developed the ASLLRP Sign Bank, which contains almost 6,000 sign entries for lexical
signs, with distinct English-based glosses, with a total of 41,830 examples of lexical signs (in addition
to about 300 gestures, over 1,000 fingerspelled signs, and 475 classifier examples). The Sign Bank is
likewise accessible and searchable on the Internet; it can also be accessed from within SignStream®
(software to facilitate linguistic annotation and analysis of visual language data) to make
annotations more accurate and efficient. Here we describe the available resources. These data have
been used for many types of research in linguistics and in computer-based sign language recognition
from video; examples of such research are provided in the latter part of this article.Published versio
- …