4,700 research outputs found
Aesthetics assessment of videos through visual descriptors and automatic polarity annotation
En un mundo en el que las nuevas tecnologías están cada vez más ligadas a la información multimedia, el desarrollo de herramientas que permitan manejar fácilmente este tipo de datos se ha convertido en una tarea imprescindible, que ha despertado el interés científico en los últimos años. De entre las líneas de investigación que han empezado a desarrollarse recientemente, el estudio de características subjetivas en material audiovisual a partir de datos objetivos es de especial interés por cuanto puede ser aplicado a sistemas de clasificación y de recomendación. Este documento presenta un trabajo de investigación centrado en el estudio de modelos que permitan predecir automáticamente la satisfacción o interés que despierta un vídeo, concretamente un anuncio publicitario de un coche, en los usuarios de YouTube que lo ven, a partir de los descriptores de bajo nivel del v ́ıdeo. Un aspecto novedoso de este trabajo es el planteamiento de una solución para este tipo de problemas basada en un procedimiento para obtener automáticamente el etiquetado de los vídeos mediante técnicas de aprendizaje no supervisado.
Para ello, se ha adquirido un conjunto de anuncios de coches junto con los metadatos asociados a cada vídeo que proporcionan los usuarios y que ofrecen información referente a la satisfacción que perciben estos cuando los visualizan en YouTube. Estos metadatos han permitido diseñar tres estrategias de análisis cluster para anotar automáticamente los vídeos, utilizando cada una de ellas un conjunto de metadatos diferente, de acuerdo a la manera en que los mismos son proporcionados por los usuarios. Por otro lado, se ha extraído, mediante técnicas de procesamiento de imagen y vídeo, un conjunto descriptores visuales de cada vídeo para posteriormente entrenar un sistema de aprendizaje de máquina que ha permitido el estudio de la relevancia y utilidad de este conjunto de descriptores para predecir el valor estético de los vídeos percibido por los usuarios.Grado en Ingeniería de Sistemas Audiovisuale
Highly efficient low-level feature extraction for video representation and retrieval.
PhDWitnessing the omnipresence of digital video media, the research community has
raised the question of its meaningful use and management. Stored in immense
multimedia databases, digital videos need to be retrieved and structured in an
intelligent way, relying on the content and the rich semantics involved. Current
Content Based Video Indexing and Retrieval systems face the problem of the semantic
gap between the simplicity of the available visual features and the richness of user
semantics.
This work focuses on the issues of efficiency and scalability in video indexing and
retrieval to facilitate a video representation model capable of semantic annotation. A
highly efficient algorithm for temporal analysis and key-frame extraction is developed.
It is based on the prediction information extracted directly from the compressed domain
features and the robust scalable analysis in the temporal domain. Furthermore,
a hierarchical quantisation of the colour features in the descriptor space is presented.
Derived from the extracted set of low-level features, a video representation model that
enables semantic annotation and contextual genre classification is designed.
Results demonstrate the efficiency and robustness of the temporal analysis algorithm
that runs in real time maintaining the high precision and recall of the detection task.
Adaptive key-frame extraction and summarisation achieve a good overview of the
visual content, while the colour quantisation algorithm efficiently creates hierarchical
set of descriptors. Finally, the video representation model, supported by the genre
classification algorithm, achieves excellent results in an automatic annotation system by
linking the video clips with a limited lexicon of related keywords
A comprehensive survey of multi-view video summarization
[EN] There has been an exponential growth in the amount of visual data on a daily basis acquired from single or multi-view surveillance camera networks. This massive amount of data requires efficient mechanisms such as video summarization to ensure that only significant data are reported and the redundancy is reduced. Multi-view video summarization (MVS) is a less redundant and more concise way of providing information from the video content of all the cameras in the form of either keyframes or video segments. This paper presents an overview of the existing strategies proposed for MVS, including their advantages and drawbacks. Our survey covers the genericsteps in MVS, such as the pre-processing of video data, feature extraction, and post-processing followed by summary generation. We also describe the datasets that are available for the evaluation of MVS. Finally, we examine the major current issues related to MVS and put forward the recommendations for future research(1). (C) 2020 Elsevier Ltd. All rights reserved.This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1A2B5B01070067)Hussain, T.; Muhammad, K.; Ding, W.; Lloret, J.; Baik, SW.; De Albuquerque, VHC. (2021). A comprehensive survey of multi-view video summarization. Pattern Recognition. 109:1-15. https://doi.org/10.1016/j.patcog.2020.10756711510
Deep Learning for Video Object Segmentation:A Review
As one of the fundamental problems in the field of video understanding, video object segmentation aims at segmenting objects of interest throughout the given video sequence. Recently, with the advancements of deep learning techniques, deep neural networks have shown outstanding performance improvements in many computer vision applications, with video object segmentation being one of the most advocated and intensively investigated. In this paper, we present a systematic review of the deep learning-based video segmentation literature, highlighting the pros and cons of each category of approaches. Concretely, we start by introducing the definition, background concepts and basic ideas of algorithms in this field. Subsequently, we summarise the datasets for training and testing a video object segmentation algorithm, as well as common challenges and evaluation metrics. Next, previous works are grouped and reviewed based on how they extract and use spatial and temporal features, where their architectures, contributions and the differences among each other are elaborated. At last, the quantitative and qualitative results of several representative methods on a dataset with many remaining challenges are provided and analysed, followed by further discussions on future research directions. This article is expected to serve as a tutorial and source of reference for learners intended to quickly grasp the current progress in this research area and practitioners interested in applying the video object segmentation methods to their problems. A public website is built to collect and track the related works in this field: https://github.com/gaomingqi/VOS-Review
Recommended from our members
Classification videos reveal the visual information driving complex real-world speeded decisions
Humans can rapidly discriminate complex scenarios as they unfold in real time, for example during law enforcement or, more prosaically, driving and sport. Such decision-making improves with experience, as new sources of information are exploited. For example, sports experts are able to predict the outcome of their opponent’s next action (e.g. a tennis stroke) based on kinematic cues “read” from preparatory body movements. Here, we explore the use of psychophysical classification-image techniques to reveal how participants interpret complex scenarios. We used sport as a test case, filming tennis players serving and hitting ground strokes, each with two possible directions. These videos were presented to novices and club-level amateurs, running from 0.8 seconds before to 0.2 seconds after racquet-ball contact. During practice, participants anticipated shot direction under a time limit targeting 90% accuracy. Participants then viewed videos through Gaussian windows ("bubbles") placed at random in the temporal, spatial or spatiotemporal domains. Comparing bubbles from correct and incorrect trials revealed how information from different regions contributed toward a correct response. Temporally, only later frames of the videos supported accurate responding (from ~0.05 seconds before ball contact to 0.1+ seconds afterwards). Spatially, information was accrued from the ball’s trajectory and from the opponent’s head. Spatiotemporal bubbles again highlighted ball trajectory information, but seemed susceptible to an attentional cuing artefact, which may caution against their wider use. Overall, bubbles proved effective in revealing regions of information accrual, and could thus be applied to help understand choice behavior in a range of ecologically valid situations
Tracking technical refinement in elite performers: The good, the better, and the ugly
This study extends coaching research examining the practical implementation of technical refinement in elite-level golfers. In doing so, we provide an initial check of precepts pertaining to the Five-A Model and, examine the dynamics between coaching, psychomotor, biomechanical and psychological inputs to the process. Three case studies of golfers attempting refinements to their already well-established techniques are reported. Kinematic data were supplemented with intra-individual movement variability and self-perceptions of mental effort as measures of tracking behaviour and motor control. Results showed different levels of success in refining technique and subsequent ability to return to executing under largely subconscious control. In one case, the technique was refined as intended but without consistent reduction of conscious attention, in another, both were successfully apparent, whereas in the third case neither was achieved. Implications of these studies are discussed with reference to the process’ interdisciplinary nature and importance of the initial and final stages
A high speed Tri-Vision system for automotive applications
Purpose: Cameras are excellent ways of non-invasively monitoring the interior and exterior of vehicles. In particular, high speed stereovision and multivision systems are important for transport applications such as driver eye tracking or collision avoidance. This paper addresses the synchronisation problem which arises when multivision camera systems are used to capture the high speed motion common in such applications.
Methods: An experimental, high-speed tri-vision camera system intended for real-time driver eye-blink and saccade measurement was designed, developed, implemented and tested using prototype, ultra-high dynamic range, automotive-grade image sensors specifically developed by E2V (formerly Atmel) Grenoble SA as part of the European FP6 project – sensation (advanced sensor development for attention stress, vigilance and sleep/wakefulness monitoring).
Results : The developed system can sustain frame rates of 59.8 Hz at the full stereovision resolution of 1280 × 480 but this can reach 750 Hz when a 10 k pixel Region of Interest (ROI) is used, with a maximum global shutter speed of 1/48000 s and a shutter efficiency of 99.7%. The data can be reliably transmitted uncompressed over standard copper Camera-Link® cables over 5 metres. The synchronisation error between the left and right stereo images is less than 100 ps and this has been verified both electrically and optically. Synchronisation is automatically established at boot-up and maintained during resolution changes. A third camera in the set can be configured independently. The dynamic range of the 10bit sensors exceeds 123 dB with a spectral sensitivity extending well into the infra-red range.
Conclusion: The system was subjected to a comprehensive testing protocol, which confirms that the salient requirements for the driver monitoring application are adequately met and in some respects, exceeded. The synchronisation technique presented may also benefit several other automotive stereovision applications including near and far-field obstacle detection and collision avoidance, road condition monitoring and others.Partially funded by the EU FP6 through the IST-507231 SENSATION project.peer-reviewe
Recommended from our members
User-centred video abstraction
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonThe rapid growth of digital video content in recent years has imposed the need for the development of technologies with the capability to produce condensed but semantically rich versions of the input video stream in an effective manner. Consequently, the topic of Video Summarisation is becoming increasingly popular in multimedia community and numerous video abstraction approaches have been proposed accordingly. These recommended techniques can be divided into two major categories of automatic and semi-automatic in accordance with the required level of human intervention in summarisation process. The fully-automated methods mainly adopt the low-level visual, aural and textual features alongside the mathematical and statistical algorithms in furtherance to extract the most significant segments of original video. However, the effectiveness of this type of techniques is restricted by a number of factors such as domain-dependency, computational expenses and the inability to understand the semantics of videos from low-level features. The second category of techniques however, attempts to alleviate the quality of summaries by involving humans in the abstraction process to bridge the semantic gap. Nonetheless, a single user’s subjectivity and other external contributing factors such as distraction will potentially deteriorate the performance of this group of approaches. Accordingly, in this thesis we have focused on the development of three user-centred effective video summarisation techniques that could be applied to different video categories and generate satisfactory results. According to our first proposed approach, a novel mechanism for a user-centred video summarisation has been presented for the scenarios in which multiple actors are employed in the video summarisation process in order to minimise the negative effects of sole user adoption. Based on our recommended algorithm, the video frames were initially scored by a group of video annotators ‘on the fly’. This was followed by averaging these assigned scores in order to generate a singular saliency score for each video frame and, finally, the highest scored video frames alongside the corresponding audio and textual contents were extracted to be included into the final summary. The effectiveness of our approach has been assessed by comparing the video summaries generated based on our approach against the results obtained from three existing automatic summarisation tools that adopt different modalities for abstraction purposes. The experimental results indicated that our proposed method is capable of delivering remarkable outcomes in terms of Overall Satisfaction and Precision with an acceptable Recall rate, indicating the usefulness of involving user input in the video summarisation process. In an attempt to provide a better user experience, we have proposed our personalised video summarisation method with an ability to customise the generated summaries in accordance with the viewers’ preferences. Accordingly, the end-user’s priority levels towards different video scenes were captured and utilised for updating the average scores previously assigned by the video annotators. Finally, our earlier proposed summarisation method was adopted to extract the most significant audio-visual content of the video. Experimental results indicated the capability of this approach to deliver superior outcomes compared with our previously proposed method and the three other automatic summarisation tools. Finally, we have attempted to reduce the required level of audience involvement for personalisation purposes by proposing a new method for producing personalised video summaries. Accordingly, SIFT visual features were adopted to identify the video scenes’ semantic categories. Fusing this retrieved data with pre-built users’ profiles, personalised video abstracts can be created. Experimental results showed the effectiveness of this method in delivering superior outcomes comparing to our previously recommended algorithm and the three other automatic summarisation techniques
An investigation into feature effectiveness for multimedia hyperlinking
The increasing amount of archival multimedia content available online is creating increasing opportunities for users who are interested in exploratory search behaviour such as browsing. The user experience with online collections could therefore be improved by enabling navigation and recommendation within multimedia archives, which can be supported by allowing a user to follow a set of hyperlinks created within or across documents. The main goal of this study is to compare the performance of dierent multimedia features for automatic hyperlink generation. In our work we construct multimedia hyperlinks by indexing and searching textual and visual features extracted from the blip.tv dataset. A user-driven evaluation strategy is then proposed by applying the Amazon Mechanical Turk (AMT) crowdsourcing platform, since we believe that AMT workers represent a good example of "real world" users. We conclude that textual features exhibit better performance than visual features for multimedia hyperlink construction. In general, a combination of ASR transcripts and metadata provides the best results
- …