177 research outputs found

    Temporal multimodal video and lifelog retrieval

    Get PDF
    The past decades have seen exponential growth of both consumption and production of data, with multimedia such as images and videos contributing significantly to said growth. The widespread proliferation of smartphones has provided everyday users with the ability to consume and produce such content easily. As the complexity and diversity of multimedia data has grown, so has the need for more complex retrieval models which address the information needs of users. Finding relevant multimedia content is central in many scenarios, from internet search engines and medical retrieval to querying one's personal multimedia archive, also called lifelog. Traditional retrieval models have often focused on queries targeting small units of retrieval, yet users usually remember temporal context and expect results to include this. However, there is little research into enabling these information needs in interactive multimedia retrieval. In this thesis, we aim to close this research gap by making several contributions to multimedia retrieval with a focus on two scenarios, namely video and lifelog retrieval. We provide a retrieval model for complex information needs with temporal components, including a data model for multimedia retrieval, a query model for complex information needs, and a modular and adaptable query execution model which includes novel algorithms for result fusion. The concepts and models are implemented in vitrivr, an open-source multimodal multimedia retrieval system, which covers all aspects from extraction to query formulation and browsing. vitrivr has proven its usefulness in evaluation campaigns and is now used in two large-scale interdisciplinary research projects. We show the feasibility and effectiveness of our contributions in two ways: firstly, through results from user-centric evaluations which pit different user-system combinations against one another. Secondly, we perform a system-centric evaluation by creating a new dataset for temporal information needs in video and lifelog retrieval with which we quantitatively evaluate our models. The results show significant benefits for systems that enable users to specify more complex information needs with temporal components. Participation in interactive retrieval evaluation campaigns over multiple years provides insight into possible future developments and challenges of such campaigns

    Multimodal Video Analysis and Modeling

    Get PDF
    From recalling long forgotten experiences based on a familiar scent or on a piece of music, to lip reading aided conversation in noisy environments or travel sickness caused by mismatch of the signals from vision and the vestibular system, the human perception manifests countless examples of subtle and effortless joint adoption of the multiple senses provided to us by evolution. Emulating such multisensory (or multimodal, i.e., comprising multiple types of input modes or modalities) processing computationally offers tools for more effective, efficient, or robust accomplishment of many multimedia tasks using evidence from the multiple input modalities. Information from the modalities can also be analyzed for patterns and connections across them, opening up interesting applications not feasible with a single modality, such as prediction of some aspects of one modality based on another. In this dissertation, multimodal analysis techniques are applied to selected video tasks with accompanying modalities. More specifically, all the tasks involve some type of analysis of videos recorded by non-professional videographers using mobile devices.Fusion of information from multiple modalities is applied to recording environment classification from video and audio as well as to sport type classification from a set of multi-device videos, corresponding audio, and recording device motion sensor data. The environment classification combines support vector machine (SVM) classifiers trained on various global visual low-level features with audio event histogram based environment classification using k nearest neighbors (k-NN). Rule-based fusion schemes with genetic algorithm (GA)-optimized modality weights are compared to training a SVM classifier to perform the multimodal fusion. A comprehensive selection of fusion strategies is compared for the task of classifying the sport type of a set of recordings from a common event. These include fusion prior to, simultaneously with, and after classification; various approaches for using modality quality estimates; and fusing soft confidence scores as well as crisp single-class predictions. Additionally, different strategies are examined for aggregating the decisions of single videos to a collective prediction from the set of videos recorded concurrently with multiple devices. In both tasks multimodal analysis shows clear advantage over separate classification of the modalities.Another part of the work investigates cross-modal pattern analysis and audio-based video editing. This study examines the feasibility of automatically timing shot cuts of multi-camera concert recordings according to music-related cutting patterns learnt from professional concert videos. Cut timing is a crucial part of automated creation of multicamera mashups, where shots from multiple recording devices from a common event are alternated with the aim at mimicing a professionally produced video. In the framework, separate statistical models are formed for typical patterns of beat-quantized cuts in short segments, differences in beats between consecutive cuts, and relative deviation of cuts from exact beat times. Based on music meter and audio change point analysis of a new recording, the models can be used for synthesizing cut times. In a user study the proposed framework clearly outperforms a baseline automatic method with comparably advanced audio analysis and wins 48.2 % of comparisons against hand-edited videos

    Learning from limited labeled data - Zero-Shot and Few-Shot Learning

    Get PDF
    Human beings have the remarkable ability to recognize novel visual concepts after observing only few or zero examples of them. Deep learning, however, often requires a large amount of labeled data to achieve a good performance. Labeled instances are expensive, difficult and even infeasible to obtain because the distribution of training instances among labels naturally exhibits a long tail. Therefore, it is of great interest to investigate how to learn efficiently from limited labeled data. This thesis concerns an important subfield of learning from limited labeled data, namely, low-shot learning. The setting assumes the availability of many labeled examples from known classes and the goal is to learn novel classes from only a few~(few-shot learning) or zero~(zero-shot learning) training examples of them. To this end, we have developed a series of multi-modal learning approaches to facilitate the knowledge transfer from known classes to novel classes for a wide range of visual recognition tasks including image classification, semantic image segmentation and video action recognition. More specifically, this thesis mainly makes the following contributions. First, as there is no agreed upon zero-shot image classification benchmark, we define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets. Second, in order to tackle the labeled data scarcity, we propose feature generation frameworks that synthesize data in the visual feature space for novel classes. Third, we extend zero-shot learning and few-shot learning to the semantic segmentation task and propose a challenging benchmark for it. We show that incorporating semantic information into a semantic segmentation network is effective in segmenting novel classes. Finally, we develop better video representation for the few-shot video classification task and leverage weakly-labeled videos by an efficient retrieval method.Menschen haben die bemerkenswerte Fähigkeit, neuartige visuelle Konzepte zu erkennen, nachdem sie nur wenige oder gar keine Beispiele davon beobachtet haben. Tiefes Lernen erfordert jedoch oft eine große Menge an beschrifteten Daten, um eine gute Leistung zu erzielen. Etikettierte Instanzen sind teuer, schwierig und sogar undurchführbar, weil die Verteilung der Trainingsinstanzen auf die Etiketten naturgemäß einen langen Schwanz aufweist. Daher ist es von großem Interesse zu untersuchen, wie man effizient aus begrenzten gelabelten Daten lernen kann. Diese These betrifft einen wichtigen Teilbereich des Lernens aus begrenzt gelabelten Daten, nämlich das Low-Shot-Lernen. Das Setting setzt die Verfügbarkeit vieler gelabelter Beispiele aus bekannten Klassen voraus, und das Ziel ist es, neuartige Klassen aus nur wenigen (few-shot learning) oder null (zero-shot learning) Trainingsbeispielen davon zu lernen. Zu diesem Zweck haben wir eine Reihe von multimodalen Lernansätzen entwickelt, um den Wissenstransfer von bekannten Klassen zu neuartigen Klassen für ein breites Spektrum von visuellen Erkennungsaufgaben zu erleichtern, darunter Bildklassifizierung, semantische Bildsegmentierung und Videoaktionserkennung. Genauer gesagt, leistet diese Arbeit hauptsächlich die folgenden Beiträge. Da es keinen vereinbarten Benchmark für die Zero-Shot- Bildklassifikation gibt, definieren wir zunächst einen neuen Benchmark, indem wir sowohl die Evaluierungsprotokolle als auch die Datensplits öffentlich zugänglicher Datensätze vereinheitlichen. Zweitens schlagen wir zur Bewältigung der etikettierten Datenknappheit einen Rahmen für die Generierung von Merkmalen vor, der Daten im visuellen Merkmalsraum für neuartige Klassen synthetisiert. Drittens dehnen wir das Zero-Shot-Lernen und das few-Shot-Lernen auf die semantische Segmentierungsaufgabe aus und schlagen dafür einen anspruchsvollen Benchmark vor. Wir zeigen, dass die Einbindung semantischer Informationen in ein semantisches Segmentierungsnetz bei der Segmentierung neuartiger Klassen effektiv ist. Schließlich entwickeln wir eine bessere Videodarstellung für die Klassifizierungsaufgabe ”few-shot video” und nutzen schwach markierte Videos durch eine effiziente Abrufmethode.Max Planck Institute Informatic

    Integrating Deep Learning with Correlation-based Multimedia Semantic Concept Detection

    Get PDF
    The rapid advances in technologies make the explosive growth of multimedia data possible and available to the public. Multimedia data can be defined as data collection, which is composed of various data types and different representations. Due to the fact that multimedia data carries knowledgeable information, it has been widely adopted to different genera, like surveillance event detection, medical abnormality detection, and many others. To fulfil various requirements for different applications, it is important to effectively classify multimedia data into semantic concepts across multiple domains. In this dissertation, a correlation-based multimedia semantic concept detection framework is seamlessly integrated with the deep learning technique. The framework aims to explore implicit and explicit correlations among features and concepts while adopting different Convolutional Neural Network (CNN) architectures accordingly. First, the Feature Correlation Maximum Spanning Tree (FC-MST) is proposed to remove the redundant and irrelevant features based on the correlations between the features and positive concepts. FC-MST identifies the effective features and decides the initial layer\u27s dimension in CNNs. Second, the Negative-based Sampling method is proposed to alleviate the data imbalance issue by keeping only the representative negative instances in the training process. To adjust dierent sizes of training data, the number of iterations for the CNN is determined adaptively and automatically. Finally, an Indirect Association Rule Mining (IARM) approach and a correlation-based re-ranking method are proposed to reveal the implicit relationships from the correlations among concepts, which are further utilized together with the classification scores to enhance the re-ranking process. The framework is evaluated using two benchmark multimedia data sets, TRECVID and NUS-WIDE, which contain large amounts of multimedia data and various semantic concepts

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Representations and representation learning for image aesthetics prediction and image enhancement

    Get PDF
    With the continual improvement in cell phone cameras and improvements in the connectivity of mobile devices, we have seen an exponential increase in the images that are captured, stored and shared on social media. For example, as of July 1st 2017 Instagram had over 715 million registered users which had posted just shy of 35 billion images. This represented approximately seven and nine-fold increase in the number of users and photos present on Instagram since 2012. Whether the images are stored on personal computers or reside on social networks (e.g. Instagram, Flickr), the sheer number of images calls for methods to determine various image properties, such as object presence or appeal, for the purpose of automatic image management and curation. One of the central problems in consumer photography centers around determining the aesthetic appeal of an image and motivates us to explore questions related to understanding aesthetic preferences, image enhancement and the possibility of using such models on devices with constrained resources. In this dissertation, we present our work on exploring representations and representation learning approaches for aesthetic inference, composition ranking and its application to image enhancement. Firstly, we discuss early representations that mainly consisted of expert features, and their possibility to enhance Convolutional Neural Networks (CNN). Secondly, we discuss the ability of resource-constrained CNNs, and the different architecture choices (inputs size and layer depth) in solving various aesthetic inference tasks: binary classification, regression, and image cropping. We show that if trained for solving fine-grained aesthetics inference, such models can rival the cropping performance of other aesthetics-based croppers, however they fall short in comparison to models trained for composition ranking. Lastly, we discuss our work on exploring and identifying the design choices in training composition ranking functions, with the goal of using them for image composition enhancement

    Soft Biometric Analysis: MultiPerson and RealTime Pedestrian Attribute Recognition in Crowded Urban Environments

    Get PDF
    Traditionally, recognition systems were only based on human hard biometrics. However, the ubiquitous CCTV cameras have raised the desire to analyze human biometrics from far distances, without people attendance in the acquisition process. Highresolution face closeshots are rarely available at far distances such that facebased systems cannot provide reliable results in surveillance applications. Human soft biometrics such as body and clothing attributes are believed to be more effective in analyzing human data collected by security cameras. This thesis contributes to the human soft biometric analysis in uncontrolled environments and mainly focuses on two tasks: Pedestrian Attribute Recognition (PAR) and person reidentification (reid). We first review the literature of both tasks and highlight the history of advancements, recent developments, and the existing benchmarks. PAR and person reid difficulties are due to significant distances between intraclass samples, which originate from variations in several factors such as body pose, illumination, background, occlusion, and data resolution. Recent stateoftheart approaches present endtoend models that can extract discriminative and comprehensive feature representations from people. The correlation between different regions of the body and dealing with limited learning data is also the objective of many recent works. Moreover, class imbalance and correlation between human attributes are specific challenges associated with the PAR problem. We collect a large surveillance dataset to train a novel gender recognition model suitable for uncontrolled environments. We propose a deep residual network that extracts several posewise patches from samples and obtains a comprehensive feature representation. In the next step, we develop a model for multiple attribute recognition at once. Considering the correlation between human semantic attributes and class imbalance, we respectively use a multitask model and a weighted loss function. We also propose a multiplication layer on top of the backbone features extraction layers to exclude the background features from the final representation of samples and draw the attention of the model to the foreground area. We address the problem of person reid by implicitly defining the receptive fields of deep learning classification frameworks. The receptive fields of deep learning models determine the most significant regions of the input data for providing correct decisions. Therefore, we synthesize a set of learning data in which the destructive regions (e.g., background) in each pair of instances are interchanged. A segmentation module determines destructive and useful regions in each sample, and the label of synthesized instances are inherited from the sample that shared the useful regions in the synthesized image. The synthesized learning data are then used in the learning phase and help the model rapidly learn that the identity and background regions are not correlated. Meanwhile, the proposed solution could be seen as a data augmentation approach that fully preserves the label information and is compatible with other data augmentation techniques. When reid methods are learned in scenarios where the target person appears with identical garments in the gallery, the visual appearance of clothes is given the most importance in the final feature representation. Clothbased representations are not reliable in the longterm reid settings as people may change their clothes. Therefore, developing solutions that ignore clothing cues and focus on identityrelevant features are in demand. We transform the original data such that the identityrelevant information of people (e.g., face and body shape) are removed, while the identityunrelated cues (i.e., color and texture of clothes) remain unchanged. A learned model on the synthesized dataset predicts the identityunrelated cues (shortterm features). Therefore, we train a second model coupled with the first model and learns the embeddings of the original data such that the similarity between the embeddings of the original and synthesized data is minimized. This way, the second model predicts based on the identityrelated (longterm) representation of people. To evaluate the performance of the proposed models, we use PAR and person reid datasets, namely BIODI, PETA, RAP, Market1501, MSMTV2, PRCC, LTCC, and MIT and compared our experimental results with stateoftheart methods in the field. In conclusion, the data collected from surveillance cameras have low resolution, such that the extraction of hard biometric features is not possible, and facebased approaches produce poor results. In contrast, soft biometrics are robust to variations in data quality. So, we propose approaches both for PAR and person reid to learn discriminative features from each instance and evaluate our proposed solutions on several publicly available benchmarks.This thesis was prepared at the University of Beria Interior, IT Instituto de Telecomunicações, Soft Computing and Image Analysis Laboratory (SOCIA Lab), Covilhã Delegation, and was submitted to the University of Beira Interior for defense in a public examination session

    Exploring Sparse, Unstructured Video Collections of Places

    Get PDF
    The abundance of mobile devices and digital cameras with video capture makes it easy to obtain large collections of video clips that contain the same location, environment, or event. However, such an unstructured collection is difficult to comprehend and explore. We propose a system that analyses collections of unstructured but related video data to create a Videoscape: a data structure that enables interactive exploration of video collections by visually navigating — spatially and/or temporally — between different clips. We automatically identify transition opportunities, or portals. From these portals, we construct the Videoscape, a graph whose edges are video clips and whose nodes are portals between clips. Now structured, the videos can be interactively explored by walking the graph or by geographic map. Given this system, we gauge preference for different video transition styles in a user study, and generate heuristics that automatically choose an appropriate transition style. We evaluate our system using three further user studies, which allows us to conclude that Videoscapes provides significant benefits over related methods. Our system leads to previously unseen ways of interactive spatio-temporal exploration of casually captured videos, and we demonstrate this on several video collections

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF
    corecore