6 research outputs found

    Towards AI enabled automated tracking of multiple boxers

    Full text link
    Continuous tracking of boxers across multiple training sessions helps quantify traits required for the well-known ten-point-must system. However, continuous tracking of multiple athletes across multiple training sessions remains a challenge, because it is difficult to precisely segment bout boundaries in a recorded video stream. Furthermore, re-identification of the same athlete over different period or even within the same bout remains a challenge. Difficulties are further compounded when a single fixed view video is captured in top-view. This work summarizes our progress in creating a system in an economically single fixed top-view camera. Specifically, we describe improved algorithm for bout transition detection and in-bout continuous player identification without erroneous ID updation or ID switching. From our custom collected data of ~11 hours (athlete count: 45, bouts: 189), our transition detection algorithm achieves 90% accuracy and continuous ID tracking achieves IDU=0, IDS=0

    Modified Deep Pattern Classifier on Indonesian Traditional Dance Spatio-Temporal Data

    Get PDF
    Traditional dances, like those of Indonesia, have complex and unique patterns requiring accurate cultural preservation and documentation classification. However, traditional dance classification methods often rely on manual analysis and subjective judgment, which leads to inconsistencies and limitations. This research explores a modified deep pattern classifier of traditional dance movements in videos, including Gambyong, Remo, and Topeng, using a Convolutional Neural Network (CNN). Evaluation model's performance using a testing spatio-temporal dataset in Indonesian traditional dance videos is performed. The videos are processed through frame-level segmentation, enabling the CNN to capture nuances in posture, footwork, and facial expressions exhibited by dancers. Then, the obtained confusion matrix enables the calculation of performance metrics such as accuracy, precision, sensitivity, and F1-score. The results showcase a high accuracy of 97.5%, indicating the reliable classification of the dataset. Furthermore, future research directions are suggested, including investigating advanced CNN architectures, incorporating temporal information through recurrent neural networks, exploring transfer learning techniques, and integrating user feedback for iterative refinement of the model. The proposed method has the potential to advance dance analysis and find applications in dance education, choreography, and cultural preservation

    Self-supervised learning to detect key frames in videos

    Get PDF
    © 2020 by the authors. Licensee MDPI, Basel, Switzerland. Detecting key frames in videos is a common problem in many applications such as video classification, action recognition and video summarization. These tasks can be performed more efficiently using only a handful of key frames rather than the full video. Existing key frame detection approaches are mostly designed for supervised learning and require manual labelling of key frames in a large corpus of training data to train the models. Labelling requires human annotators from different backgrounds to annotate key frames in videos which is not only expensive and time consuming but also prone to subjective errors and inconsistencies between the labelers. To overcome these problems, we propose an automatic self-supervised method for detecting key frames in a video. Our method comprises a two-stream ConvNet and a novel automatic annotation architecture able to reliably annotate key frames in a video for self-supervised learning of the ConvNet. The proposed ConvNet learns deep appearance and motion features to detect frames that are unique. The trained network is then able to detect key frames in test videos. Extensive experiments on UCF101 human action and video summarization VSUMM datasets demonstrates the effectiveness of our proposed method

    Deep Learning for Semantic Video Understanding

    Get PDF
    The field of computer vision has long strived to extract understanding from images and videos sequences. The recent flood of video data along with massive increments in computing power have provided the perfect environment to generate advanced research to extract intelligence from video data. Video data is ubiquitous, occurring in numerous everyday activities such as surveillance, traffic, movies, sports, etc. This massive amount of video needs to be analyzed and processed efficiently to extract semantic features towards video understanding. Such capabilities could benefit surveillance, video analytics and visually challenged people. While watching a long video, humans have the uncanny ability to bypass unnecessary information and concentrate on the important events. These key events can be used as a higher-level description or summary of a long video. Inspired by the human visual cortex, this research affords such abilities in computers using neural networks. Useful or interesting events are first extracted from a video and then deep learning methodologies are used to extract natural language summaries for each video sequence. Previous approaches of video description either have been domain specific or use a template based approach to fill detected objects such as verbs or actions to constitute a grammatically correct sentence. This work involves exploiting temporal contextual information for sentence generation while working on wide domain datasets. Current state-of- the-art video description methodologies are well suited for small video clips whereas this research can also be applied to long sequences of video. This work proposes methods to generate visual summaries of long videos, and in addition proposes techniques to annotate and generate textual summaries of the videos using recurrent networks. End to end video summarization immensely depends on abstractive summarization of video descriptions. State-of- the-art neural language & attention joint models have been used to generate textual summaries. Interesting segments of long video are extracted based on image quality as well as cinematographic and consumer preference. This novel approach will be a stepping stone for a variety of innovative applications such as video retrieval, automatic summarization for visually impaired persons, automatic movie review generation, video question and answering systems

    Multi-task Learning for Visual Perception in Automated Driving

    Full text link
    Every year, 1.2 million people die, and up to 50 million people are injured in accidents worldwide. Automated driving can significantly reduce that number. Automated driving also has several economic and societal benefits that include convenient and efficient transportation, enhanced mobility for the disabled and elderly population, etc. Visual perception is the ability to perceive the environment, which is a critical component in decision-making that builds safer automated driving. Recent progress in computer vision and deep learning paired with high-quality sensors like cameras and LiDARs fueled mature visual perception solutions. The main bottleneck for these solutions is the limited processing power available to build real-time applications. This bottleneck often leads to a trade-off between performance and run-time efficiency. To address these bottlenecks, we focus on: 1) building optimized architectures for different visual perception tasks like semantic segmentation, panoptic segmentation, etc. using convolutional neural networks that have high performance and low computational complexity, 2) using multi-task learning to overcome computational bottlenecks by sharing the initial convolutional layers between different tasks while developing advanced learning strategies that achieve balanced learning between tasks.PhDCollege of Engineering & Computer ScienceUniversity of Michigan-Dearbornhttp://deepblue.lib.umich.edu/bitstream/2027.42/167355/1/Sumanth Chennupati Final Dissertation.pd
    corecore