1,261 research outputs found

    Temporal Sentence Grounding in Streaming Videos

    Full text link
    This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. Unlike regular videos, streaming videos are acquired continuously from a particular source, and are always desired to be processed on-the-fly in many applications such as surveillance and live-stream analysis. Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.Comment: Accepted by ACM MM 202

    2023-2024 Catalog

    Get PDF
    The 2023-2024 Governors State University Undergraduate and Graduate Catalog is a comprehensive listing of current information regarding:Degree RequirementsCourse OfferingsUndergraduate and Graduate Rules and Regulation

    Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

    Full text link
    Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.Comment: Submitted to TCSVT. 14 pages and 7 figure

    A multi-user process interface system for a process control computer

    Get PDF
    This thesis describes a system to implement a distributed multi-user process interface to allow the PDP-11/23 computer in the Electrical Engineering department at UCT to be used for process control. The use of this system is to be shared between postgraduate students for research and undergraduates for doing real-time control projects. The interface may be used concurrently by several users, and access is controlled in such a way as to prevent users' programs from interfering with one another. The process interface hardware used was a GEC Micro-Media system, which is a stand-alone process interface system communicating with a host (the PDP-11/23) via a serial line. Hardware to drive a 600-metre serial link at 9600 baud between the PDP-11/23 and the Media interface was designed and built. The software system on the host, written in RTL/2, holds-all data from the interface in a resident common data-base and continually updates it. Access to the interface by applications programs is done indirectly by reading and writing to the database, for which purpose a library of user interface routines is provided. To allow future expansion and modification of the Media interface, software (also written in RTL/2) for an LSI-11 minicomputer interfaced to the Media bus was developed which emulates the operation of the GEC proprietary Micro-Media software. A program to download this software into the LSI-11 was written. A suite of diagnostic programs enables testing of the system hardware and software at various levels. To ease testing, teaching, and applications programming, a general-purpose simulation package for the simulation of analogue systems was developed, as well as graphics routines for use with a Tektronix 4010 plotting terminal. A. real-time computing project for a class of undergraduates was run in 1983. This project made extensive use of the system and demonstrated its viability

    Fairness Testing: A Comprehensive Survey and Analysis of Trends

    Full text link
    Unfair behaviors of Machine Learning (ML) software have garnered increasing attention and concern among software engineers. To tackle this issue, extensive research has been dedicated to conducting fairness testing of ML software, and this paper offers a comprehensive survey of existing studies in this field. We collect 100 papers and organize them based on the testing workflow (i.e., how to test) and testing components (i.e., what to test). Furthermore, we analyze the research focus, trends, and promising directions in the realm of fairness testing. We also identify widely-adopted datasets and open-source tools for fairness testing

    Knowledge Distillation and Continual Learning for Optimized Deep Neural Networks

    Get PDF
    Over the past few years, deep learning (DL) has been achieving state-of-theart performance on various human tasks such as speech generation, language translation, image segmentation, and object detection. While traditional machine learning models require hand-crafted features, deep learning algorithms can automatically extract discriminative features and learn complex knowledge from large datasets. This powerful learning ability makes deep learning models attractive to both academia and big corporations. Despite their popularity, deep learning methods still have two main limitations: large memory consumption and catastrophic knowledge forgetting. First, DL algorithms use very deep neural networks (DNNs) with many billion parameters, which have a big model size and a slow inference speed. This restricts the application of DNNs in resource-constraint devices such as mobile phones and autonomous vehicles. Second, DNNs are known to suffer from catastrophic forgetting. When incrementally learning new tasks, the model performance on old tasks significantly drops. The ability to accommodate new knowledge while retaining previously learned knowledge is called continual learning. Since the realworld environments in which the model operates are always evolving, a robust neural network needs to have this continual learning ability for adapting to new changes

    Learning Scene Flow With Skeleton Guidance For 3D Action Recognition

    Full text link
    Among the existing modalities for 3D action recognition, 3D flow has been poorly examined, although conveying rich motion information cues for human actions. Presumably, its susceptibility to noise renders it intractable, thus challenging the learning process within deep models. This work demonstrates the use of 3D flow sequence by a deep spatiotemporal model and further proposes an incremental two-level spatial attention mechanism, guided from skeleton domain, for emphasizing motion features close to the body joint areas and according to their informativeness. Towards this end, an extended deep skeleton model is also introduced to learn the most discriminant action motion dynamics, so as to estimate an informativeness score for each joint. Subsequently, a late fusion scheme is adopted between the two models for learning the high level cross-modal correlations. Experimental results on the currently largest and most challenging dataset NTU RGB+D, demonstrate the effectiveness of the proposed approach, achieving state-of-the-art results.Comment: 18 pages, 3 figures, 3 tables, conferenc

    Domain-Agnostic Multi-Modal Video Retrieval

    Get PDF
    The rapid proliferation of multimedia content has necessitated the development of efficient video retrieval systems. Multi-modal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and videos. Traditional approaches for multi-modal retrieval often rely on domain-specific techniques and models, limiting their generalizability across different domains. This thesis aims to develop a domain-agnostic approach for multi-modal video retrieval, enabling effective retrieval irrespective of the specific domain or data modality. The research explores techniques such as transfer learning, where pre-trained models from different domains are fine-tuned using domain-agnostic strategies. Additionally, attention mechanisms and fusion techniques are investigated to leverage cross-modal interactions and capture relevant information from diverse modalities. An important aspect of the research is to find robust methods for audio-video integration as both of them individually provide retrieval cues for the text query. To this end, the loss functions and the architectural design of the model is developed with a strong focus on increasing the mutual information between text and audio-video feature. The proposed approach is quantitatively evaluated on various video benchmark datasets such as MSR-VTT and YouCook2. The results showcase that the approach not only holds its own against state-of-the-art methods but also outperforms them in certain scenarios, with a notable 6% improvement in R@5 and R@10 metrics in the best-performing cases. Qualitative evaluations further illuminated the utility of audio, especially in instances where there's a direct word match between text and audio, exemplified by queries like "A man is calling his colleagues" aligning with video audio containing the word "colleague". In essence, the findings of this research pave the way for a versatile and integrated solution for multi-modal retrieval, with potential applications spanning a wide range of domains

    KARTAL: Web Application Vulnerability Hunting Using Large Language Models

    Get PDF
    Broken Access Control is the most serious web application security risk as published by Open Worldwide Application Security Project (OWASP). This category has highly complex vulnerabilities such as Broken Object Level Authorization (BOLA) and Exposure of Sensitive Information. Finding such critical vulnerabilities in large software systems requires intelligent and automated tools. State-of-the-art (SOTA) research including hybrid application security testing tools, algorithmic bruteforcers, and artificial intelligence has shown great promise in detection. Nevertheless, there exists a gap in research for reliably identifying logical and context-dependant Broken Access Control vulnerabilities. We propose KARTAL, a novel method for web application vulnerability detection using a Large Language Model (LLM). It consists of 3 components: Fuzzer, Prompter, and Detector. The Fuzzer is responsible for methodically collecting application behaviour. The Prompter processes the data from the Fuzzer and formulates a prompt. The Detector uses an LLM which we have finetuned for detecting vulnerabilities. In the study, we investigate the performance, key factors, and limitations of the proposed method. We experiment with finetuning three types of decoder-only pre-trained transformers for detecting two sophisticated vulnerabilities. Our best model attained an accuracy of 87.19%, with an F1 score of 0.82. By using hardware acceleration on a consumer-grade laptop, our fastest model can make up to 539 predictions per second. The experiments on varying the training sample size demonstrated the great learning capabilities of our model. Every 400 samples added to training resulted in an average MCC score improvement of 19.58%. Furthermore, the dynamic properties of KARTAL enable inference-time adaption to the application domain, resulting in reduced false positives

    Novel neural architectures & algorithms for efficient inference

    Get PDF
    In the last decade, the machine learning universe embraced deep neural networks (DNNs) wholeheartedly with the advent of neural architectures such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), transformers, etc. These models have empowered many applications, such as ChatGPT, Imagen, etc., and have achieved state-of-the-art (SOTA) performance on many vision, speech, and language modeling tasks. However, SOTA performance comes with various issues, such as large model size, compute-intensive training, increased inference latency, higher working memory, etc. This thesis aims at improving the resource efficiency of neural architectures, i.e., significantly reducing the computational, storage, and energy consumption of a DNN without any significant loss in performance. Towards this goal, we explore novel neural architectures as well as training algorithms that allow low-capacity models to achieve near SOTA performance. We divide this thesis into two dimensions: \textit{Efficient Low Complexity Models}, and \textit{Input Hardness Adaptive Models}. Along the first dimension, i.e., \textit{Efficient Low Complexity Models}, we improve DNN performance by addressing instabilities in the existing architectures and training methods. We propose novel neural architectures inspired by ordinary differential equations (ODEs) to reinforce input signals and attend to salient feature regions. In addition, we show that carefully designed training schemes improve the performance of existing neural networks. We divide this exploration into two parts: \textsc{(a) Efficient Low Complexity RNNs.} We improve RNN resource efficiency by addressing poor gradients, noise amplifications, and BPTT training issues. First, we improve RNNs by solving ODEs that eliminate vanishing and exploding gradients during the training. To do so, we present Incremental Recurrent Neural Networks (iRNNs) that keep track of increments in the equilibrium surface. Next, we propose Time Adaptive RNNs that mitigate the noise propagation issue in RNNs by modulating the time constants in the ODE-based transition function. We empirically demonstrate the superiority of ODE-based neural architectures over existing RNNs. Finally, we propose Forward Propagation Through Time (FPTT) algorithm for training RNNs. We show that FPTT yields significant gains compared to the more conventional Backward Propagation Through Time (BPTT) scheme. \textsc{(b) Efficient Low Complexity CNNs.} Next, we improve CNN architectures by reducing their resource usage. They require greater depth to generate high-level features, resulting in computationally expensive models. We design a novel residual block, the Global layer, that constrains the input and output features by approximately solving partial differential equations (PDEs). It yields better receptive fields than traditional convolutional blocks and thus results in shallower networks. Further, we reduce the model footprint by enforcing a novel inductive bias that formulates the output of a residual block as a spatial interpolation between high-compute anchor pixels and low-compute cheaper pixels. This results in spatially interpolated convolutional blocks (SI-CNNs) that have better compute and performance trade-offs. Finally, we propose an algorithm that enforces various distributional constraints during training in order to achieve better generalization. We refer to this scheme as distributionally constrained learning (DCL). In the second dimension, i.e., \textit{Input Hardness Adaptive Models}, we introduce the notion of the hardness of any input relative to any architecture. In the first dimension, a neural network allocates the same resources, such as compute, storage, and working memory, for all the inputs. It inherently assumes that all examples are equally hard for a model. In this dimension, we challenge this assumption using input hardness as our reasoning that some inputs are relatively easy for a network to predict compared to others. Input hardness enables us to create selective classifiers wherein a low-capacity network handles simple inputs while abstaining from a prediction on the complex inputs. Next, we create hybrid models that route the hard inputs from the low-capacity abstaining network to a high-capacity expert model. We design various architectures that adhere to this hybrid inference style. Further, input hardness enables us to selectively distill the knowledge of a high-capacity model into a low-capacity model by cleverly discarding hard inputs during the distillation procedure. Finally, we conclude this thesis by sketching out various interesting future research directions that emerge as an extension of different ideas explored in this work
    • …
    corecore