58 research outputs found
Recommended from our members
User-centred video abstraction
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonThe rapid growth of digital video content in recent years has imposed the need for the development of technologies with the capability to produce condensed but semantically rich versions of the input video stream in an effective manner. Consequently, the topic of Video Summarisation is becoming increasingly popular in multimedia community and numerous video abstraction approaches have been proposed accordingly. These recommended techniques can be divided into two major categories of automatic and semi-automatic in accordance with the required level of human intervention in summarisation process. The fully-automated methods mainly adopt the low-level visual, aural and textual features alongside the mathematical and statistical algorithms in furtherance to extract the most significant segments of original video. However, the effectiveness of this type of techniques is restricted by a number of factors such as domain-dependency, computational expenses and the inability to understand the semantics of videos from low-level features. The second category of techniques however, attempts to alleviate the quality of summaries by involving humans in the abstraction process to bridge the semantic gap. Nonetheless, a single userâs subjectivity and other external contributing factors such as distraction will potentially deteriorate the performance of this group of approaches. Accordingly, in this thesis we have focused on the development of three user-centred effective video summarisation techniques that could be applied to different video categories and generate satisfactory results. According to our first proposed approach, a novel mechanism for a user-centred video summarisation has been presented for the scenarios in which multiple actors are employed in the video summarisation process in order to minimise the negative effects of sole user adoption. Based on our recommended algorithm, the video frames were initially scored by a group of video annotators âon the flyâ. This was followed by averaging these assigned scores in order to generate a singular saliency score for each video frame and, finally, the highest scored video frames alongside the corresponding audio and textual contents were extracted to be included into the final summary. The effectiveness of our approach has been assessed by comparing the video summaries generated based on our approach against the results obtained from three existing automatic summarisation tools that adopt different modalities for abstraction purposes. The experimental results indicated that our proposed method is capable of delivering remarkable outcomes in terms of Overall Satisfaction and Precision with an acceptable Recall rate, indicating the usefulness of involving user input in the video summarisation process. In an attempt to provide a better user experience, we have proposed our personalised video summarisation method with an ability to customise the generated summaries in accordance with the viewersâ preferences. Accordingly, the end-userâs priority levels towards different video scenes were captured and utilised for updating the average scores previously assigned by the video annotators. Finally, our earlier proposed summarisation method was adopted to extract the most significant audio-visual content of the video. Experimental results indicated the capability of this approach to deliver superior outcomes compared with our previously proposed method and the three other automatic summarisation tools. Finally, we have attempted to reduce the required level of audience involvement for personalisation purposes by proposing a new method for producing personalised video summaries. Accordingly, SIFT visual features were adopted to identify the video scenesâ semantic categories. Fusing this retrieved data with pre-built usersâ profiles, personalised video abstracts can be created. Experimental results showed the effectiveness of this method in delivering superior outcomes comparing to our previously recommended algorithm and the three other automatic summarisation techniques
Utilization of multimodal interaction signals for automatic summarisation of academic presentations
Multimedia archives are expanding rapidly. For these, there exists a shortage of retrieval and summarisation techniques for accessing and browsing content where the main information exists in the audio stream. This thesis describes an investigation into the development of novel feature extraction and summarisation techniques for audio-visual recordings of academic presentations.
We report on the development of a multimodal dataset of academic presentations. This dataset is labelled by human annotators to the concepts of presentation ratings, audience engagement levels, speaker emphasis, and audience comprehension. We investigate the automatic classification of speaker ratings and audience engagement by extracting audio-visual features from video of the presenter and audience and training classifiers to predict speaker ratings and engagement levels. Following this, we investigate automatic identi�cation of areas of emphasised speech. By analysing all human annotated areas of emphasised speech, minimum speech pitch and gesticulation are identified as indicating emphasised speech when occurring together.
Investigations are conducted into the speaker's potential to be comprehended by the audience. Following crowdsourced annotation of comprehension levels during academic presentations, a set of audio-visual features considered most likely to affect comprehension levels are extracted. Classifiers are trained on these features and comprehension levels could be predicted over a 7-class scale to an accuracy of 49%, and over a binary distribution to an accuracy of 85%.
Presentation summaries are built by segmenting speech transcripts into phrases, and using keywords extracted from the transcripts in conjunction with extracted paralinguistic features. Highest ranking segments are then extracted to build presentation summaries. Summaries are evaluated by performing eye-tracking experiments as participants watch presentation videos. Participants were found to be consistently more engaged for presentation summaries than for full presentations. Summaries were also found to contain a higher concentration of new information than full presentations
A Motion-Driven Approach for Fine-Grained Temporal Segmentation of User-Generated Videos
This paper presents an algorithm for the temporal segmentation of user-generated videos into visually coherent parts that correspond to individual video capturing activities. The latter include camera pan and tilt, change in focal length and camera displacement. The proposed approach identifies the aforementioned activities by extracting and evaluating the region-level spatio-temporal distribution of the optical flow over sequences of neighbouring video frames. The performance of the algorithm was evaluated with the help of a newly constructed ground-truth dataset, against several state-of-the-art techniques and variations of them. Extensive evaluation indicates the competitiveness of the proposed approach in terms of detection accuracy, and highlight its suitability for analysing large collections of data in a time-efficient manner
A review of computer vision-based approaches for physical rehabilitation and assessment
The computer vision community has extensively researched the area of human motion analysis, which primarily focuses on pose estimation, activity recognition, pose or gesture recognition and so on. However for many applications, like monitoring of functional rehabilitation of patients with musculo skeletal or physical impairments, the requirement is to comparatively evaluate human motion. In this survey, we capture important literature on vision-based monitoring and physical rehabilitation that focuses on comparative evaluation of human motion during the past two decades and discuss the state of current research in this area. Unlike other reviews in this area, which are written from a clinical objective, this article presents research in this area from a computer vision application perspective. We propose our own taxonomy of computer vision-based rehabilitation and assessment research which are further divided into sub-categories to capture novelties of each research. The review discusses the challenges of this domain due to the wide ranging human motion abnormalities and difficulty in automatically assessing those abnormalities. Finally, suggestions on the future direction of research are offered
Video Fragmentation and Reverse Search on the Web
This chapter is focused on methods and tools for video fragmentation and reverse search on the web. These technologies can assist journalists when they are dealing with fake newsâwhich nowadays are being rapidly spread via social media platformsâthat rely on the reuse of a previously posted video from a past event with the intention to mislead the viewers about a contemporary event. The fragmentation of a video into visually and temporally coherent parts and the extraction of a representative keyframe for each defined fragment enables the provision of a complete and concise keyframe-based summary of the video. Contrary to straightforward approaches that sample video frames with a constant step, the generated summary through video fragmentation and keyframe extraction is considerably more effective for discovering the video content and performing a fragment-level search for the video on the web. This chapter starts by explaining the nature and characteristics of this type of reuse-based fake news in its introductory part, and continues with an overview of existing approaches for temporal fragmentation of single-shot videos into sub-shots (the most appropriate level of temporal granularity when dealing with user-generated videos) and tools for performing reverse search of a video on the web. Subsequently, it describes two state-of-the-art methods for video sub-shot fragmentationâone relying on the assessment of the visual coherence over sequences of frames, and another one that is based on the identification of camera activity during the video recordingâand presents the InVID web application that enables the fine-grained (at the fragment-level) reverse search for near-duplicates of a given video on the web. In the sequel, the chapter reports the findings of a series of experimental evaluations regarding the efficiency of the above-mentioned technologies, which indicate their competence to generate a concise and complete keyframe-based summary of the video content, and the use of this fragment-level representation for fine-grained reverse video search on the web. Finally, it draws conclusions about the effectiveness of the presented technologies and outlines our future plans for further advancing them
LifeLogging: personal big data
We have recently observed a convergence of technologies to foster the emergence of lifelogging as a mainstream activity. Computer storage has become significantly cheaper, and advancements in sensing technology allows for the efficient sensing of personal activities, locations and the environment. This is best seen in the growing popularity of the quantified self movement, in which life activities are tracked using wearable sensors in the hope of better understanding human performance in a variety of tasks. This review aims to provide a comprehensive summary of lifelogging, to cover its research history, current technologies, and applications. Thus far, most of the lifelogging research has focused predominantly on visual lifelogging in order to capture life details of life activities, hence we maintain this focus in this review. However, we also reflect on the challenges lifelogging poses to an information retrieval scientist. This review is a suitable reference for those seeking a information retrieval scientistâs perspective on lifelogging and the quantified self
Attention Driven Solutions for Robust Digital Watermarking Within Media
As digital technologies have dramatically expanded within the last decade, content recognition now plays a major role within the control of media. Of the current recent systems available, digital watermarking provides a robust maintainable solution to enhance media security. The two main properties of digital watermarking, imperceptibility and robustness, are complimentary to each other but by employing visual attention based mechanisms within the watermarking framework, highly robust watermarking solutions are obtainable while also maintaining high media quality. This thesis firstly provides suitable bottom-up saliency models for raw image and video. The image and video saliency algorithms are estimated directly from within the wavelet domain for enhanced compatibility with the watermarking framework. By combining colour, orientation and intensity contrasts for the image model and globally compensated object motion in the video model, novel wavelet-based visual saliency algorithms are provided. The work extends these saliency models into a unique visual attention-based watermarking scheme by increasing the watermark weighting parameter within visually uninteresting regions. An increased watermark robustness, up to 40%, against various filtering attacks, JPEG2000 and H.264/AVC compression is obtained while maintaining the media quality, verified by various objective and subjective evaluation tools. As most video sequences are stored in an encoded format, this thesis studies watermarking schemes within the compressed domain. Firstly, the work provides a compressed domain saliency model formulated directly within the HEVC codec, utilizing various coding decisions such as block partition size, residual magnitude, intra frame angular prediction mode and motion vector difference magnitude. Large computational savings, of 50% or greater, are obtained compared with existing methodologies, as the saliency maps are generated from partially decoded bitstreams. Finally, the saliency maps formulated within the compressed HEVC domain are studied within the watermarking framework. A joint encoder and a
frame domain watermarking scheme are both proposed by embedding data into the quantised transform residual data or wavelet coefficients, respectively, which exhibit low visual salience
- âŚ