    DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization

    Since American Sign Language (ASL) has no standard written form, Deaf signers frequently share videos in order to communicate in their native language. However, since both hands and face convey critical linguistic information in signed languages, sign language videos cannot preserve signer privacy. While signers have expressed interest, for a variety of applications, in sign language video anonymization that would effectively preserve linguistic content, attempts to develop such technology have had limited success, given the complexity of hand movements and facial expressions. Existing approaches rely predominantly on precise pose estimations of the signer in video footage and often require sign language video datasets for training. These requirements prevent them from processing videos 'in the wild,' in part because of the limited diversity present in current sign language video datasets. To address these limitations, our research introduces DiffSLVA, a novel methodology that utilizes pre-trained large-scale diffusion models for zero-shot text-guided sign language video anonymization. We incorporate ControlNet, which leverages low-level image features such as HED (Holistically-Nested Edge Detection) edges, to circumvent the need for pose estimation. Additionally, we develop a specialized module dedicated to capturing facial expressions, which are critical for conveying essential linguistic information in signed languages. We then combine the above methods to achieve anonymization that better preserves the essential linguistic content of the original signer. This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications, which would offer significant benefits to the Deaf and Hard-of-Hearing communities. We demonstrate the effectiveness of our approach with a series of signer anonymization experiments.Comment: Project webpage: https://github.com/Jeffery9707/DiffSLV

    Exploiting Out-of-band Motion Sensor Data to De-anonymize Virtual Reality Users

    Virtual Reality (VR) is an exciting new consumer technology which offers an immersive audio-visual experience to users through which they can navigate and interact with a digitally represented 3D space (i.e., a virtual world) using a headset device. By (visually) transporting users from the real or physical world to exciting and realistic virtual spaces, VR systems can enable true-to-life and more interactive versions of traditional applications such as gaming, remote conferencing, social networking and virtual tourism. However, as with any new consumer technology, VR applications also present significant user-privacy challenges. This paper studies a new type of privacy attack targeting VR users by connecting their activities visible in the virtual world (enabled by some VR application/service) to their physical state sensed in the real world. Specifically, this paper analyzes the feasibility of carrying out a de-anonymization or identification attack on VR users by correlating visually observed movements of users' avatars in the virtual world with some auxiliary data (e.g., motion sensor data from mobile/wearable devices held by users) representing their context/state in the physical world. To enable this attack, this paper proposes a novel framework which first employs a learning-based activity classification approach to translate the disparate visual movement data and motion sensor data into an activity-vector to ease comparison, followed by a filtering and identity ranking phase outputting an ordered list of potential identities corresponding to the target visual movement data. Extensive empirical evaluation of the proposed framework, under a comprehensive set of experimental settings, demonstrates the feasibility of such a de-anonymization attack

    Privacy-Respecting Smart Video Surveillance Based on Usage Control Enforcement

    This research introduces a conceptual framework for enforcing privacy-related restrictions in smart video surveillance systems based on danger levels and incident types to be handled. It increases the selectivity of surveillance by restricting data processing to individuals associated to incidents under investigation. Constraints are enforced by usage control, which is instantiated for video surveillance for the first time and enables tailoring such systems to comply with data protection law

    Recent Advances in Digital Image and Video Forensics, Anti-forensics and Counter Anti-forensics

    Image and video forensics have recently gained increasing attention due to the proliferation of manipulated images and videos, especially on social media platforms, such as Twitter and Instagram, which spread disinformation and fake news. This survey explores image and video identification and forgery detection covering both manipulated digital media and generative media. However, media forgery detection techniques are susceptible to anti-forensics; on the other hand, such anti-forensics techniques can themselves be detected. We therefore further cover both anti-forensics and counter anti-forensics techniques in image and video. Finally, we conclude this survey by highlighting some open problems in this domain

    Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting

    In this paper, we introduce a neural rendering pipeline for transferring the facial expressions, head pose, and body movements of one person in a source video to another in a target video. We apply our method to the challenging case of Sign Language videos: given a source video of a sign language user, we can faithfully transfer the performed manual (e.g., handshape, palm orientation, movement, location) and non-manual (e.g., eye gaze, facial expressions, mouth patterns, head, and body movements) signs to a target video in a photo-realistic manner. Our method can be used for Sign Language Anonymization, Sign Language Production (synthesis module), as well as for reenacting other types of full body activities (dancing, acting performance, exercising, etc.). We conduct detailed qualitative and quantitative evaluations and comparisons, which demonstrate the particularly promising and realistic results that we obtain and the advantages of our method over existing approaches.Comment: Accepted at AI4CC Workshop at CVPR 202

    Modeling temporal visual salience for human action recognition enabled visual anonymity preservation

    This paper proposes a novel approach for visually anonymizing video clips while retaining the ability to machine-based analysis of the video clip, such as, human action recognition. The visual anonymization is achieved by proposing a novel method for generating the anonymization silhouette by modeling the frame-wise temporal visual salience. This is followed by analysing these temporal salience-based silhouettes by extracting the proposed histograms of gradients in salience ( HOG-S ) for learning the action representation in the visually anonymized domain. Since the anonymization maps are based on the temporal salience maps represented in gray scale, only the moving body parts related to the motion of the action are represented in larger gray values forming highly anonymized silhouettes, resulting in the highest mean anonymity score (MAS), the least identifiable visual appearance attributes and a high utility of human-perceived utility in action recognition. In terms of machine-based human action recognition, using the proposed HOG-S features has resulted in the highest accuracy rate in the anonymized domain compared to those achieved from the existing anonymization methods. Overall, the proposed holistic human action recognition method, i.e. , the temporal salience modeling followed by the HOG-S feature extraction, has resulted in the best human action recognition accuracy rates for datasets DHA, KTH, UIUC1, UCF Sports and HMDB51 with improvements of 3%, 1.6%, 0.8%, 1.3% and 16.7%, respectively. The proposed method outperforms both feature-based and deep learning based existing approaches

    Deep Learning for Crowd Anomaly Detection

    Today, public areas across the globe are monitored by an increasing amount of surveillance cameras. This widespread usage has presented an ever-growing volume of data that cannot realistically be examined in real-time. Therefore, efforts to understand crowd dynamics have brought light to automatic systems for the detection of anomalies in crowds. This thesis explores the methods used across literature for this purpose, with a focus on those fusing dense optical flow in a feature extraction stage to the crowd anomaly detection problem. To this extent, five different deep learning architectures are trained using optical flow maps estimated by three deep learning-based techniques. More specifically, a 2D convolutional network, a 3D convolutional network, and LSTM-based convolutional recurrent network, a pre-trained variant of the latter, and a ConvLSTM-based autoencoder is trained using both regular frames and optical flow maps estimated by LiteFlowNet3, RAFT, and GMA on the UCSD Pedestrian 1 dataset. The experimental results have shown that while prone to overfitting, the use of optical flow maps may improve the performance of supervised spatio-temporal architectures

    Multimodal Visual Sensing: Automated Estimation of Engagement

    Viele moderne Anwendungen der künstlichen Intelligenz beinhalten bis zu einem gewissen Grad ein Verständnis der menschlichen Aufmerksamkeit, Aktivität, Absicht und Kompetenz aus multimodalen visuellen Daten. Nonverbale Verhaltenshinweise, die mit Hilfe von Computer Vision und Methoden des maschinellen Lernens erkannt werden, enthalten wertvolle Informationen zum Verständnis menschlicher Verhaltensweisen, einschließlich Aufmerksamkeit und Engagement. Der Einsatz solcher automatisierten Methoden im Bildungsbereich birgt ein enormes Potenzial. Zu den nützlichen Anwendungen gehören Analysen im Klassenzimmer zur Messung der Unterrichtsqualität und die Entwicklung von Interventionen zur Verbesserung des Unterrichts auf der Grundlage dieser Analysen sowie die Analyse von Präsentationen, um Studenten zu helfen, ihre Botschaften überzeugend und effektiv zu vermitteln. Diese Dissertation stellt ein allgemeines Framework vor, das auf multimodaler visueller Erfassung basiert, um Engagement und verwandte Aufgaben anhand visueller Modalitäten zu analysieren. Während sich der Großteil der Engagement-Literatur im Bereich des affektiven und sozialen Computings auf computerbasiertes Lernen und auf Lernspiele konzentriert, untersuchen wir die automatisierte Engagement-Schätzung im Klassenzimmer unter Verwendung verschiedener nonverbaler Verhaltenshinweise und entwickeln Methoden zur Extraktion von Aufmerksamkeits- und emotionalen Merkmalen. Darüber hinaus validieren wir die Effizienz der vorgeschlagenen Ansätze an realen Daten, die aus videografierten Klassen an Universitäten und weiterführenden Schulen gesammelt wurden. Zusätzlich zu den Lernaktivitäten führen wir eine Verhaltensanalyse von Studenten durch, die kurze wissenschaftliche Präsentationen unter Verwendung von multimodalen Hinweisen, einschließlich Gesichts-, Körper- und Stimmmerkmalen, halten. Neben dem Engagement und der Präsentationskompetenz nähern wir uns dem Verständnis des menschlichen Verhaltens aus einer breiteren Perspektive, indem wir die Analyse der gemeinsamen Aufmerksamkeit in einer Gruppe von Menschen, die Wahrnehmung von Lehrern mit Hilfe von egozentrischer Kameraperspektive und mobilen Eyetrackern sowie die automatisierte Anonymisierung von audiovisuellen Daten in Studien im Klassenzimmer untersuchen. Educational Analytics bieten wertvolle Möglichkeiten zur Verbesserung von Lernen und Lehren. Die Arbeit in dieser Dissertation schlägt einen rechnerischen Rahmen zur Einschätzung des Engagements und der Präsentationskompetenz von Schülern vor, zusammen mit unterstützenden Computer-Vision-Problemen.Many modern applications of artificial intelligence involve, to some extent, an understanding of human attention, activity, intention, and competence from multimodal visual data. Nonverbal behavioral cues detected using computer vision and machine learning methods include valuable information for understanding human behaviors, including attention and engagement. The use of such automated methods in educational settings has a tremendous potential for good. Beneficial uses include classroom analytics to measure teaching quality and the development of interventions to improve teaching based on these analytics, as well as presentation analysis to help students deliver their messages persuasively and effectively. This dissertation presents a general framework based on multimodal visual sensing to analyze engagement and related tasks from visual modalities. While the majority of engagement literature in affective and social computing focuses on computer-based learning and educational games, we investigate automated engagement estimation in the classroom using different nonverbal behavioral cues and developed methods to extract attentional and emotional features. Furthermore, we validate the efficiency of proposed approaches on real-world data collected from videotaped classes at university and secondary school. In addition to learning activities, we perform behavior analysis on students giving short scientific presentations using multimodal cues, including face, body, and voice features. Besides engagement and presentation competence, we approach human behavior understanding from a broader perspective by studying the analysis of joint attention in a group of people, teachers' perception using egocentric camera view and mobile eye trackers, and automated anonymization of audiovisual data in classroom studies. Educational analytics present valuable opportunities to improve learning and teaching. The work in this dissertation suggests a computational framework for estimating student engagement and presentation competence, together with supportive computer vision problems

    ASL video Corpora & Sign Bank: resources available through the American Sign Language Linguistic Research Project (ASLLRP)

    The American Sign Language Linguistic Research Project (ASLLRP) provides Internet access to high-quality ASL video data, generally including front and side views and a close-up of the face. The manual and non-manual components of the signing have been linguistically annotated using SignStream®. The recently expanded video corpora can be browsed and searched through the Data Access Interface (DAI 2) we have designed; it is possible to carry out complex searches. The data from our corpora can also be downloaded; annotations are available in an XML export format. We have also developed the ASLLRP Sign Bank, which contains almost 6,000 sign entries for lexical signs, with distinct English-based glosses, with a total of 41,830 examples of lexical signs (in addition to about 300 gestures, over 1,000 fingerspelled signs, and 475 classifier examples). The Sign Bank is likewise accessible and searchable on the Internet; it can also be accessed from within SignStream® (software to facilitate linguistic annotation and analysis of visual language data) to make annotations more accurate and efficient. Here we describe the available resources. These data have been used for many types of research in linguistics and in computer-based sign language recognition from video; examples of such research are provided in the latter part of this article.Published versio
