206 research outputs found

    Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification

    Full text link
    In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS

    Efficient Techniques for Management and Delivery of Video Data

    Get PDF
    The rapid advances in electronic imaging, storage, data compression telecommunications, and networking technology have resulted in a vast creation and use of digital videos in many important applications such as digital libraries, distance learning, public information systems, electronic commerce, movie on demand, etc. This brings about the need for management as well as delivery of video data. Organizing and managing video data, however, is much more complex than managing conventional text data due to their semantically rich and unstructured contents. Also, the enormous size of video files requires high communication bandwidth for data delivery. In this dissertation, I present the following techniques for video data management and delivery. Decomposing video into meaningful pieces (i.e., shots) is a very fundamental step to handling the complicated contents of video data. Content-based video parsing techniques are presented and analyzed. In order to reduce the computation cost substantially, a non-sequential approach to shot boundary detection is investigated. Efficient browsing and indexing of video data are essential for video data management. Non-linear browsing and cost-effective indexing schemes for video data based on their contents are described and evaluated. In order to satisfy various user requests, delivering long videos through the limited capacity of bandwidth is challenging work. To reduce the demand on this bandwidth, a hybrid of two effective approaches, periodic broadcast and scheduled multicast, is discussed and simulated. The current techniques related to the above works are discussed thoroughly to explain their advantages and disadvantages, and to make the new improved schemes. The substantial amount of experiments and simulations as well as the concepts are provided to compare the introduced techniques with the other existing ones. The results indicate that they outperform recent techniques by a significant margin. I conclude the dissertation with a discussing of future research directions

    Feature based dynamic intra-video indexing

    Get PDF
    A thesis submitted in partial fulfillment for the degree of Doctor of PhilosophyWith the advent of digital imagery and its wide spread application in all vistas of life, it has become an important component in the world of communication. Video content ranging from broadcast news, sports, personal videos, surveillance, movies and entertainment and similar domains is increasing exponentially in quantity and it is becoming a challenge to retrieve content of interest from the corpora. This has led to an increased interest amongst the researchers to investigate concepts of video structure analysis, feature extraction, content annotation, tagging, video indexing, querying and retrieval to fulfil the requirements. However, most of the previous work is confined within specific domain and constrained by the quality, processing and storage capabilities. This thesis presents a novel framework agglomerating the established approaches from feature extraction to browsing in one system of content based video retrieval. The proposed framework significantly fills the gap identified while satisfying the imposed constraints of processing, storage, quality and retrieval times. The output entails a framework, methodology and prototype application to allow the user to efficiently and effectively retrieved content of interest such as age, gender and activity by specifying the relevant query. Experiments have shown plausible results with an average precision and recall of 0.91 and 0.92 respectively for face detection using Haar wavelets based approach. Precision of age ranges from 0.82 to 0.91 and recall from 0.78 to 0.84. The recognition of gender gives better precision with males (0.89) compared to females while recall gives a higher value with females (0.92). Activity of the subject has been detected using Hough transform and classified using Hiddell Markov Model. A comprehensive dataset to support similar studies has also been developed as part of the research process. A Graphical User Interface (GUI) providing a friendly and intuitive interface has been integrated into the developed system to facilitate the retrieval process. The comparison results of the intraclass correlation coefficient (ICC) shows that the performance of the system closely resembles with that of the human annotator. The performance has been optimised for time and error rate
    corecore