9 research outputs found

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Semantic Annotation for Retrieval of Visual Resources

    Get PDF
    Beeldmateriaal speelt een steeds grotere rol in onze cultuur, maar ook in de wetenschap en in het onderwijs. Zoeken in grote collecties beeldmateriaal blijft echter een moeizaam proces. Het kost een eindgebruiker veel tijd en moeite om juist dat ene beeld te vinden. Daarom zijn er efficiënte zoekmethoden nodig om de groeiende collecties doorzoekbaar te maken en te houden. Laura Hollink onderzoekt de problemen bij het zoeken naar beeldmateriaal en de mogelijke oplossingen daarvoor, in drie uiteenlopende collecties: schilderijen, foto’s van organische cellen en nieuwsuitzendingen.Schreiber, A.T. [Promotor]Wielinga, B.J. [Promotor]Worring, M. [Copromotor

    Semantic multimedia modelling & interpretation for annotation

    Get PDF
    The emergence of multimedia enabled devices, particularly the incorporation of cameras in mobile phones, and the accelerated revolutions in the low cost storage devices, boosts the multimedia data production rate drastically. Witnessing such an iniquitousness of digital images and videos, the research community has been projecting the issue of its significant utilization and management. Stored in monumental multimedia corpora, digital data need to be retrieved and organized in an intelligent way, leaning on the rich semantics involved. The utilization of these image and video collections demands proficient image and video annotation and retrieval techniques. Recently, the multimedia research community is progressively veering its emphasis to the personalization of these media. The main impediment in the image and video analysis is the semantic gap, which is the discrepancy among a user’s high-level interpretation of an image and the video and the low level computational interpretation of it. Content-based image and video annotation systems are remarkably susceptible to the semantic gap due to their reliance on low-level visual features for delineating semantically rich image and video contents. However, the fact is that the visual similarity is not semantic similarity, so there is a demand to break through this dilemma through an alternative way. The semantic gap can be narrowed by counting high-level and user-generated information in the annotation. High-level descriptions of images and or videos are more proficient of capturing the semantic meaning of multimedia content, but it is not always applicable to collect this information. It is commonly agreed that the problem of high level semantic annotation of multimedia is still far from being answered. This dissertation puts forward approaches for intelligent multimedia semantic extraction for high level annotation. This dissertation intends to bridge the gap between the visual features and semantics. It proposes a framework for annotation enhancement and refinement for the object/concept annotated images and videos datasets. The entire theme is to first purify the datasets from noisy keyword and then expand the concepts lexically and commonsensical to fill the vocabulary and lexical gap to achieve high level semantics for the corpus. This dissertation also explored a novel approach for high level semantic (HLS) propagation through the images corpora. The HLS propagation takes the advantages of the semantic intensity (SI), which is the concept dominancy factor in the image and annotation based semantic similarity of the images. As we are aware of the fact that the image is the combination of various concepts and among the list of concepts some of them are more dominant then the other, while semantic similarity of the images are based on the SI and concept semantic similarity among the pair of images. Moreover, the HLS exploits the clustering techniques to group similar images, where a single effort of the human experts to assign high level semantic to a randomly selected image and propagate to other images through clustering. The investigation has been made on the LabelMe image and LabelMe video dataset. Experiments exhibit that the proposed approaches perform a noticeable improvement towards bridging the semantic gap and reveal that our proposed system outperforms the traditional systems

    Multimedia Retrieval

    Get PDF


    Get PDF
    The nowadays ubiquitous and effortless digital data capture and processing capabilities offered by the majority of devices, lead to an unprecedented penetration of multimedia content in our everyday life. To make the most of this phenomenon, the rapidly increasing volume and usage of digitised content requires constant re-evaluation and adaptation of multimedia methodologies, in order to meet the relentless change of requirements from both the user and system perspectives. Advances in Multimedia provides readers with an overview of the ever-growing field of multimedia by bringing together various research studies and surveys from different subfields that point out such important aspects. Some of the main topics that this book deals with include: multimedia management in peer-to-peer structures & wireless networks, security characteristics in multimedia, semantic gap bridging for multimedia content and novel multimedia applications

    Audio-visual football video analysis, from structure detection to attention analysis

    Get PDF
    Sport video is an important video genre. Content-based sports video analysis attracts great interest from both industry and academic fields. A sports video is characterised by repetitive temporal structures, relatively plain contents, and strong spatio-temporal variations, such as quick camera switches and swift local motions. It is necessary to develop specific techniques for content-based sports video analysis to utilise these characteristics. For an efficient and effective sports video analysis system, there are three fundamental questions: (1) what are key stories for sports videos; (2) what incurs viewer’s interest; and (3) how to identify game highlights. This thesis is developed around these questions. We approached these questions from two different perspectives and in turn three research contributions are presented, namely, replay detection, attack temporal structure decomposition, and attention-based highlight identification. Replay segments convey the most important contents in sports videos. It is an efficient approach to collect game highlights by detecting replay segments. However, replay is an artefact of editing, which improves with advances in video editing tools. The composition of replay is complex, which includes logo transitions, slow motions, viewpoint switches and normal speed video clips. Since logo transition clips are pervasive in game collections of FIFA World Cup 2002, FIFA World Cup 2006 and UEFA Championship 2006, we take logo transition detection as an effective replacement of replay detection. A two-pass system was developed, including a five-layer adaboost classifier and a logo template matching throughout an entire video. The five-layer adaboost utilises shot duration, average game pitch ratio, average motion, sequential colour histogram and shot frequency between two neighbouring logo transitions, to filter out logo transition candidates. Subsequently, a logo template is constructed and employed to find all transition logo sequences. The precision and recall of this system in replay detection is 100% in a five-game evaluation collection. An attack structure is a team competition for a score. Hence, this structure is a conceptually fundamental unit of a football video as well as other sports videos. We review the literature of content-based temporal structures, such as play-break structure, and develop a three-step system for automatic attack structure decomposition. Four content-based shot classes, namely, play, focus, replay and break were identified by low level visual features. A four-state hidden Markov model was trained to simulate transition processes among these shot classes. Since attack structures are the longest repetitive temporal unit in a sports video, a suffix tree is proposed to find the longest repetitive substring in the label sequence of shot class transitions. These occurrences of this substring are regarded as a kernel of an attack hidden Markov process. Therefore, the decomposition of attack structure becomes a boundary likelihood comparison between two Markov chains. Highlights are what attract notice. Attention is a psychological measurement of “notice ”. A brief survey of attention psychological background, attention estimation from vision and auditory, and multiple modality attention fusion is presented. We propose two attention models for sports video analysis, namely, the role-based attention model and the multiresolution autoregressive framework. The role-based attention model is based on the perception structure during watching video. This model removes reflection bias among modality salient signals and combines these signals by reflectors. The multiresolution autoregressive framework (MAR) treats salient signals as a group of smooth random processes, which follow a similar trend but are filled with noise. This framework tries to estimate a noise-less signal from these coarse noisy observations by a multiple resolution analysis. Related algorithms are developed, such as event segmentation on a MAR tree and real time event detection. The experiment shows that these attention-based approach can find goal events at a high precision. Moreover, results of MAR-based highlight detection on the final game of FIFA 2002 and 2006 are highly similar to professionally labelled highlights by BBC and FIFA

    Abstract The MediaMill TRECVID 2005 Semantic Video Search Engine Draft Version

    No full text
    UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information. Experiments indicate that average precision results increase drastically, especially for pan (+51%) and tilt (+28%). For concept detection we propose a generic approach using our semantic pathfinder. Most important novelty compared to last years system is the improved visual analysis using proto-concepts based on Wiccest features. In addition, the path selection mechanism was extended. Based on the semantic pathfinder architecture we are currently able to detect an unprecedented lexicon of 101 semantic concepts in a generic fashion. We performed a large set of experiments (runid: B vA). The results show that an optimal strategy for generic multimedia analysis is one that learns from the training set on a per-concept basis which tactic to follow. Experiments also indicate that our visual analysis approach is highly promising. The lexicon of 101 semantic concepts forms the basis for our search experiments (runid: B 2 A-MM). We participated in automatic, manual (using only visual information), and interactive search. The lexicon-driven retrieval paradigm aids substantially in all search tasks. When coupled with interaction, exploiting several novel browsing schemes of our semantic video search engine, results are excellent. We obtain a top-3 result for 19 out of 24 search topics. In addition, we obtain the highest mean average precision of all search participants. We exploited the technology developed for the above tasks to explore the BBC rushes. Most intriguing result is that from the lexicon of 101 visual-only models trained for news data 25 concepts perform reasonably well on BBC data also.

    Quality-aware Content Adaptation in Digital Video Streaming

    Get PDF
    User-generated video has attracted a lot of attention due to the success of Video Sharing Sites such as YouTube and Online Social Networks. Recently, a shift towards live consumption of these videos is observable. The content is captured and instantly shared over the Internet using smart mobile devices such as smartphones. Large-scale platforms arise such as YouTube.Live, YouNow or Facebook.Live which enable the smartphones of users to livestream to the public. These platforms achieve the distribution of tens of thousands of low resolution videos to remote viewers in parallel. Nonetheless, the providers are not capable to guarantee an efficient collection and distribution of high-quality video streams. As a result, the user experience is often degraded, and the needed infrastructure installments are huge. Efficient methods are required to cope with the increasing demand for these video streams; and an understanding is needed how to capture, process and distribute the videos to guarantee a high-quality experience for viewers. This thesis addresses the quality awareness of user-generated videos by leveraging the concept of content adaptation. Two types of content adaptation, the adaptive video streaming and the video composition, are discussed in this thesis. Then, a novel approach for the given scenario of a live upload from mobile devices, the processing of video streams and their distribution is presented. This thesis demonstrates that content adaptation applied to each step of this scenario, ranging from the upload to the consumption, can significantly improve the quality for the viewer. At the same time, if content adaptation is planned wisely, the data traffic can be reduced while keeping the quality for the viewers high. The first contribution of this thesis is a better understanding of the perceived quality in user-generated video and its influencing factors. Subjective studies are performed to understand what affects the human perception, leading to the first of their kind quality models. Developed quality models are used for the second contribution of this work: novel quality assessment algorithms. A unique attribute of these algorithms is the usage of multiple features from different sensors. Whereas classical video quality assessment algorithms focus on the visual information, the proposed algorithms reduce the runtime by an order of magnitude when using data from other sensors in video capturing devices. Still, the scalability for quality assessment is limited by executing algorithms on a single server. This is solved with the proposed placement and selection component. It allows the distribution of quality assessment tasks to mobile devices and thus increases the scalability of existing approaches by up to 33.71% when using the resources of only 15 mobile devices. These three contributions are required to provide a real-time understanding of the perceived quality of the video streams produced on mobile devices. The upload of video streams is the fourth contribution of this work. It relies on content and mechanism adaptation. The thesis introduces the first prototypically evaluated adaptive video upload protocol (LiViU) which transcodes multiple video representations in real-time and copes with changing network conditions. In addition, a mechanism adaptation is integrated into LiViU to react to changing application scenarios such as streaming high-quality videos to remote viewers or distributing video with a minimal delay to close-by recipients. A second type of content adaptation is discussed in the fifth contribution of this work. An automatic video composition application is presented which enables live composition from multiple user-generated video streams. The proposed application is the first of its kind, allowing the in-time composition of high-quality video streams by inspecting the quality of individual video streams, recording locations and cinematographic rules. As a last contribution, the content-aware adaptive distribution of video streams to mobile devices is introduced by the Video Adaptation Service (VAS). The VAS analyzes the video content streamed to understand which adaptations are most beneficial for a viewer. It maximizes the perceived quality for each video stream individually and at the same time tries to produce as little data traffic as possible - achieving data traffic reduction of more than 80%