4,805 research outputs found
Keyword Based Keyframe Extraction in Online Video Collections
Keyframe extraction methods aim to find in a video sequence the most significant frames, according to specific criteria. In this paper we propose a new method to search, in a video database, for frames that are
related to a given keyword, and to extract the best ones, according to a proposed quality factor. We first exploit a speech to text algorithm to extract automatic captions from all the video in a specific domain
database. Then we select only those sequences (clips), whose captions include a given keyword, thus discarding a lot of information that is useless for our purposes. Each retrieved clip is then divided into shots,
using a video segmentation method, that is based on the SURF descriptors and keypoints. The sentence of
the caption is projected onto the segmented clip, and we select the shot that includes the input keyword. The
selected shot is further inspected to find good quality and stable parts, and the frame which maximizes a
quality metric is selected as the best and the most significant frame. We compare the proposed algorithm with another keyframe extraction method based on local features, in terms of Significance and Quality
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator
Text-to-video is a rapidly growing research area that aims to generate a
semantic, identical, and temporal coherence sequence of frames that accurately
align with the input text prompt. This study focuses on zero-shot text-to-video
generation considering the data- and cost-efficient. To generate a
semantic-coherent video, exhibiting a rich portrayal of temporal semantics such
as the whole process of flower blooming rather than a set of "moving images",
we propose a novel Free-Bloom pipeline that harnesses large language models
(LLMs) as the director to generate a semantic-coherence prompt sequence, while
pre-trained latent diffusion models (LDMs) as the animator to generate the high
fidelity frames. Furthermore, to ensure temporal and identical coherence while
maintaining semantic coherence, we propose a series of annotative modifications
to adapting LDMs in the reverse process, including joint noise sampling,
step-aware attention shift, and dual-path interpolation. Without any video data
and training requirements, Free-Bloom generates vivid and high-quality videos,
awe-inspiring in generating complex scenes with semantic meaningful frame
sequences. In addition, Free-Bloom is naturally compatible with LDMs-based
extensions.Comment: NeurIPS 2023; Project available at:
https://github.com/SooLab/Free-Bloo
Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective
This paper takes a problem-oriented perspective and presents a comprehensive
review of transfer learning methods, both shallow and deep, for cross-dataset
visual recognition. Specifically, it categorises the cross-dataset recognition
into seventeen problems based on a set of carefully chosen data and label
attributes. Such a problem-oriented taxonomy has allowed us to examine how
different transfer learning approaches tackle each problem and how well each
problem has been researched to date. The comprehensive problem-oriented review
of the advances in transfer learning with respect to the problem has not only
revealed the challenges in transfer learning for visual recognition, but also
the problems (e.g. eight of the seventeen problems) that have been scarcely
studied. This survey not only presents an up-to-date technical review for
researchers, but also a systematic approach and a reference for a machine
learning practitioner to categorise a real problem and to look up for a
possible solution accordingly
The COST292 experimental framework for TRECVID 2007
In this paper, we give an overview of the four tasks submitted to TRECVID 2007 by COST292. In shot boundary (SB) detection task, four SB detectors have been developed and the results are merged using two merging algorithms. The framework developed for the high-level feature extraction task comprises four systems. The first system transforms a set of low-level descriptors into the semantic space using
Latent Semantic Analysis and utilises neural networks for feature detection. The second system uses a Bayesian classifier trained with a “bag of subregions”. The third system uses a multi-modal classifier based on SVMs and several descriptors. The fourth system uses two image classifiers based on ant colony optimisation and particle swarm optimisation respectively. The system submitted to the search task is
an interactive retrieval application combining retrieval functionalities in various modalities with a user interface supporting automatic and interactive search over all queries submitted. Finally, the rushes task submission is based on a video summarisation and browsing system comprising two different interest curve algorithms and three features
Computationally Efficient Algorithm for Detecting Moving Objects with Moving Background
Abstract— The area of moving object detection has been a constant topic of research in more than a decade, where a research community have witnessed various significant contribution in the past that mitigates the problem of real-time and moving object detection. In our prior studies, we have addressed such issues using various sophisticated technique yielding superior results. But, it is felt that some light weight algorithm is required for the purpose of performing moving object detection with complete retention of object detection accuracy. This paper have presented a very simple algorithm that uses visual descriptor for extracting the dynamic features during fast transition of frames. The proposed algorithm is tested with one of the most significant work done recently on same purpose with respect to precision and recall rate along with analysis of processing time of proposed algorithm
Segmentation and Classification of Multimodal Imagery
Segmentation and classification are two important computer vision tasks that transform input data into a compact representation that allow fast and efficient analysis. Several challenges exist in generating accurate segmentation or classification results. In a video, for example, objects often change the appearance and are partially occluded, making it difficult to delineate the object from its surroundings. This thesis proposes video segmentation and aerial image classification algorithms to address some of the problems and provide accurate results.
We developed a gradient driven three-dimensional segmentation technique that partitions a video into spatiotemporal objects. The algorithm utilizes the local gradient computed at each pixel location together with the global boundary map acquired through deep learning methods to generate initial pixel groups by traversing from low to high gradient regions. A local clustering method is then employed to refine these initial pixel groups. The refined sub-volumes in the homogeneous regions of video are selected as initial seeds and iteratively combined with adjacent groups based on intensity similarities. The volume growth is terminated at the color boundaries of the video. The over-segments obtained from the above steps are then merged hierarchically by a multivariate approach yielding a final segmentation map for each frame. In addition, we also implemented a streaming version of the above algorithm that requires a lower computational memory. The results illustrate that our proposed methodology compares favorably well, on a qualitative and quantitative level, in segmentation quality and computational efficiency with the latest state of the art techniques.
We also developed a convolutional neural network (CNN)-based method to efficiently combine information from multisensor remotely sensed images for pixel-wise semantic classification. The CNN features obtained from multiple spectral bands are fused at the initial layers of deep neural networks as opposed to final layers. The early fusion architecture has fewer parameters and thereby reduces the computational time and GPU memory during training and inference. We also introduce a composite architecture that fuses features throughout the network. The methods were validated on four different datasets: ISPRS Potsdam, Vaihingen, IEEE Zeebruges, and Sentinel-1, Sentinel-2 dataset. For the Sentinel-1,-2 datasets, we obtain the ground truth labels for three classes from OpenStreetMap. Results on all the images show early fusion, specifically after layer three of the network, achieves results similar to or better than a decision level fusion mechanism. The performance of the proposed architecture is also on par with the state-of-the-art results
ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain
Transformer design is the de facto standard for natural language processing
tasks. The success of the transformer design in natural language processing has
lately piqued the interest of researchers in the domain of computer vision.
When compared to Convolutional Neural Networks (CNNs), Vision Transformers
(ViTs) are becoming more popular and dominant solutions for many vision
problems. Transformer-based models outperform other types of networks, such as
convolutional and recurrent neural networks, in a range of visual benchmarks.
We evaluate various vision transformer models in this work by dividing them
into distinct jobs and examining their benefits and drawbacks. ViTs can
overcome several possible difficulties with convolutional neural networks
(CNNs). The goal of this survey is to show the first use of ViTs in CV. In the
first phase, we categorize various CV applications where ViTs are appropriate.
Image classification, object identification, image segmentation, video
transformer, image denoising, and NAS are all CV applications. Our next step
will be to analyze the state-of-the-art in each area and identify the models
that are currently available. In addition, we outline numerous open research
difficulties as well as prospective research possibilities.Comment: ICCD-2023. arXiv admin note: substantial text overlap with
arXiv:2208.04309 by other author
Virtual Reality Games for Motor Rehabilitation
This paper presents a fuzzy logic based method to track user satisfaction without the need for devices to monitor users physiological conditions. User satisfaction is the key to any product’s acceptance; computer applications and video games provide a unique opportunity to provide a tailored environment for each user to better suit their needs. We have implemented a non-adaptive fuzzy logic model of emotion, based on the emotional component of the Fuzzy Logic Adaptive Model of Emotion (FLAME) proposed by El-Nasr, to estimate player emotion in UnrealTournament 2004. In this paper we describe the implementation of this system and present the results of one of several play tests. Our research contradicts the current literature that suggests physiological measurements are needed. We show that it is possible to use a software only method to estimate user emotion
Novel perspectives and approaches to video summarization
The increasing volume of videos requires efficient and effective techniques to index and structure videos. Video summarization is such a technique that extracts the essential information from a video, so that tasks such as comprehension by users and video content analysis can be conducted more effectively and efficiently. The research presented in this thesis investigates three novel perspectives of the video summarization problem and provides approaches to such perspectives. Our first perspective is to employ local keypoint to perform keyframe selection. Two criteria, namely Coverage and Redundancy, are introduced to guide the keyframe selection process in order to identify those representing maximum video content and sharing minimum redundancy. To efficiently deal with long videos, a top-down strategy is proposed, which splits the summarization problem to two sub-problems: scene identification and scene summarization. Our second perspective is to formulate the task of video summarization to the problem of sparse dictionary reconstruction. Our method utilizes the true sparse constraint L0 norm, instead of the relaxed constraint L2,1 norm, such that keyframes are directly selected as a sparse dictionary that can reconstruct the video frames. In addition, a Percentage Of Reconstruction (POR) criterion is proposed to intuitively guide users in selecting an appropriate length of the summary. In addition, an L2,0 constrained sparse dictionary selection model is also proposed to further verify the effectiveness of sparse dictionary reconstruction for video summarization. Lastly, we further investigate the multi-modal perspective of multimedia content summarization and enrichment. There are abundant images and videos on the Web, so it is highly desirable to effectively organize such resources for textual content enrichment. With the support of web scale images, our proposed system, namely StoryImaging, is capable of enriching arbitrary textual stories with visual content
- …