31 research outputs found
VideoAnalysis4ALL: An On-line Tool for the Automatic Fragmentation and Concept-based Annotation, and the Interactive Exploration of Videos.
This paper presents the VideoAnalysis4ALL tool that supports the automatic fragmentation and concept-based annotation of videos, and the exploration of the annotated video fragments through an interactive user interface. The developed web application decomposes the video into two different granularities, namely shots and scenes, and annotates each fragment by evaluating the existence of a number (several hundreds) of high-level visual concepts in the keyframes extracted from these fragments. Through the analysis the tool enables the identification and labeling of semantically coherent video fragments, while its user interfaces allow the discovery of these fragments with the help of human-interpretable concepts. The integrated state-of-the-art video analysis technologies perform very well and, by exploiting the processing capabilities of multi-thread / multi-core architectures, reduce the time required for analysis to approximately one third of the video’s duration, thus making the analysis three times faster than real-time processing
Improving Small Object Proposals for Company Logo Detection
Many modern approaches for object detection are two-staged pipelines. The
first stage identifies regions of interest which are then classified in the
second stage. Faster R-CNN is such an approach for object detection which
combines both stages into a single pipeline. In this paper we apply Faster
R-CNN to the task of company logo detection. Motivated by its weak performance
on small object instances, we examine in detail both the proposal and the
classification stage with respect to a wide range of object sizes. We
investigate the influence of feature map resolution on the performance of those
stages.
Based on theoretical considerations, we introduce an improved scheme for
generating anchor proposals and propose a modification to Faster R-CNN which
leverages higher-resolution feature maps for small objects. We evaluate our
approach on the FlickrLogos dataset improving the RPN performance from 0.52 to
0.71 (MABO) and the detection performance from 0.52 to 0.67 (mAP).Comment: 8 Pages, ICMR 201
3D facial video retrieval and management for decision support in speech and language therapy
3D video is introducing great changes in many health related areas. The realism of such information provides health professionals with strong evidence analysis tools to facilitate clinical decision processes. Speech and language therapy aims to help subjects in correcting several disorders. The assessment of the patient by the speech and language therapist (SLT), requires several visual and audio analysis procedures that can interfere with the patient's production of speech. In this context, the main contribution of this paper is a 3D video system to improve health information management processes in speech and language therapy. The 3D video retrieval and management system supports multimodal health records and provides the SLTs with tools to support their work in many ways: (i) it allows SLTs to easily maintain a database of patients' orofacial and speech exercises; (ii) supports three-dimensional orofacial measurement and analysis in a non-intrusive way; and (iii) search patient speech-exercises by similar facial characteristics, using facial image analysis techniques. The second contribution is a dataset with 3D videos of patients performing orofacial speech exercises. The whole system was evaluated successfully in a user study involving 22 SLTs. The user study illustrated the importance of the retrieval by similar orofacial speech exercise.info:eu-repo/semantics/publishedVersio
Estimating the information gap between textual and visual representations
Photos, drawings, figures, etc. supplement textual information in
various kinds of media, for example, in web news or scientific pub- lications. In this respect, the
intended effect of an image can be quite different, e.g., providing additional information,
focusing on certain details of surrounding text, or simply being a general il- lustration of a
topic. As a consequence, the semantic correlation between information of different modalities can
vary noticeably, too. Moreover, cross-modal interrelations are often hard to describe in a precise
way. The variety of possible interrelations of textual and graphical information and the question,
how they can be de- scribed and automatically estimated have not been addressed yet by previous
work. In this paper, we present several contributions to close this gap. First, we introduce two
measures to describe cross- modal interrelations: cross-modal mutual information (CMI) and semantic
correlation (SC). Second, a novel approach relying on deep learning is suggested to estimate CMI
and SC of textual and visual information. Third, three diverse datasets are leveraged to learn an
appropriate deep neural network model for the demanding task. The system has been evaluated on a
challenging test set and the experimental results demonstrate the feasibility of the approach
Light curve analysis from Kepler spacecraft collected data
The authors would like to thank CNPq-Brazil and the University of St Andrews for their kind support.Although scarce, previous work on the application of machine learning and data mining techniques on large corpora of astronomical data has produced promising results. For example, on the task of detecting so-called Kepler objects of interest (KOIs), a range of different ‘off the shelf’ classifiers has demonstrated outstanding performance. These rather preliminary research efforts motivate further exploration of this data domain. In the present work we focus on the analysis of threshold crossing events (TCEs) extracted from photometric data acquired by the Kepler spacecraft. We show that the task of classifying TCEs as being erected by actual planetary transits as opposed to confounding astrophysical phenomena is significantly more challenging than that of KOI detection, with different classifiers exhibiting vastly different performances. Nevertheless,the best performing classifier type, the random forest, achieved excellent accuracy, correctly predicting in approximately 96% of the cases. Our results and analysis should illuminate further efforts into the development of more sophisticated, automatic techniques, and encourage additional work in the area.Postprin
Characterization and classification of semantic image-text relations
The beneficial, complementary nature of visual and textual information to convey information is widely known, for example, in entertainment, news, advertisements, science, or education. While the complex interplay of image and text to form semantic meaning has been thoroughly studied in linguistics and communication sciences for several decades, computer vision and multimedia research remained on the surface of the problem more or less. An exception is previous work that introduced the two metrics Cross-Modal Mutual Information and Semantic Correlation in order to model complex image-text relations. In this paper, we motivate the necessity of an additional metric called Status in order to cover complex image-text relations more completely. This set of metrics enables us to derive a novel categorization of eight semantic image-text classes based on three dimensions. In addition, we demonstrate how to automatically gather and augment a dataset for these classes from the Web. Further, we present a deep learning system to automatically predict either of the three metrics, as well as a system to directly predict the eight image-text classes. Experimental results show the feasibility of the approach, whereby the predict-all approach outperforms the cascaded approach of the metric classifiers
MediaEval 2018: Predicting Media Memorability Task
In this paper, we present the Predicting Media Memorability task, which is
proposed as part of the MediaEval 2018 Benchmarking Initiative for Multimedia
Evaluation. Participants are expected to design systems that automatically
predict memorability scores for videos, which reflect the probability of a
video being remembered. In contrast to previous work in image memorability
prediction, where memorability was measured a few minutes after memorization,
the proposed dataset comes with short-term and long-term memorability
annotations. All task characteristics are described, namely: the task's
challenges and breakthrough, the released data set and ground truth, the
required participant runs and the evaluation metrics
Multimodal news analytics using measures of cross-modal entity and context consistency
The World Wide Web has become a popular source to gather information and news. Multimodal information, e.g., supplement text with photographs, is typically used to convey the news more effectively or to attract attention. The photographs can be decorative, depict additional details, but might also contain misleading information. The quantification of the cross-modal consistency of entity representations can assist human assessors’ evaluation of the overall multimodal message. In some cases such measures might give hints to detect fake news, which is an increasingly important topic in today’s society. In this paper, we present a multimodal approach to quantify the entity coherence between image and text in real-world news. Named entity linking is applied to extract persons, locations, and events from news texts. Several measures are suggested to calculate the cross-modal similarity of the entities in text and photograph by exploiting state-of-the-art computer vision approaches. In contrast to previous work, our system automatically acquires example data from the Web and is applicable to real-world news. Moreover, an approach that quantifies contextual image-text relations is introduced. The feasibility is demonstrated on two datasets that cover different languages, topics, and domains
Multimodal news analytics using measures of cross-modal entity and context consistency
The World Wide Web has become a popular source to gather information and news. Multimodal information, e.g., supplement text with photographs, is typically used to convey the news more effectively or to attract attention. The photographs can be decorative, depict additional details, but might also contain misleading information. The quantification of the cross-modal consistency of entity representations can assist human assessors’ evaluation of the overall multimodal message. In some cases such measures might give hints to detect fake news, which is an increasingly important topic in today’s society. In this paper, we present a multimodal approach to quantify the entity coherence between image and text in real-world news. Named entity linking is applied to extract persons, locations, and events from news texts. Several measures are suggested to calculate the cross-modal similarity of the entities in text and photograph by exploiting state-of-the-art computer vision approaches. In contrast to previous work, our system automatically acquires example data from the Web and is applicable to real-world news. Moreover, an approach that quantifies contextual image-text relations is introduced. The feasibility is demonstrated on two datasets that cover different languages, topics, and domains. © 2021, The Author(s)