1,133 research outputs found

    Multimodal Data Analytics and Fusion for Data Science

    Get PDF
    Advances in technologies have rapidly accumulated a zettabyte of ā€œnewā€ data every two years. The huge amount of data have a powerful impact on various areas in science and engineering and generates enormous research opportunities, which calls for the design and development of advanced approaches in data analytics. Given such demands, data science has become an emerging hot topic in both industry and academia, ranging from basic business solutions, technological innovations, and multidisciplinary research to political decisions, urban planning, and policymaking. Within the scope of this dissertation, a multimodal data analytics and fusion framework is proposed for data-driven knowledge discovery and cross-modality semantic concept detection. The proposed framework can explore useful knowledge hidden in different formats of data and incorporate representation learning from data in multimodalities, especial for disaster information management. First, a Feature Affinity-based Multiple Correspondence Analysis (FA-MCA) method is presented to analyze the correlations between low-level features from different features, and an MCA-based Neural Network (MCA-NN) ispro- posedto capture the high-level features from individual FA-MCA models and seamlessly integrate the semantic data representations for video concept detection. Next, a genetic algorithm-based approach is presented for deep neural network selection. Furthermore, the improved genetic algorithm is integrated with deep neural networks to generate populations for producing optimal deep representation learning models. Then, the multimodal deep representation learning framework is proposed to incorporate the semantic representations from data in multiple modalities efficiently. At last, fusion strategies are applied to accommodate multiple modalities. In this framework, cross-modal mapping strategies are also proposed to organize the features in a better structure to improve the overall performance

    Speech-based recognition of self-reported and observed emotion in a dimensional space

    Get PDF
    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance

    Spatio-Temporal Multimedia Big Data Analytics Using Deep Neural Networks

    Get PDF
    With the proliferation of online services and mobile technologies, the world has stepped into a multimedia big data era, where new opportunities and challenges appear with the high diversity multimedia data together with the huge amount of social data. Nowadays, multimedia data consisting of audio, text, image, and video has grown tremendously. With such an increase in the amount of multimedia data, the main question raised is how one can analyze this high volume and variety of data in an efficient and effective way. A vast amount of research work has been done in the multimedia area, targeting different aspects of big data analytics, such as the capture, storage, indexing, mining, and retrieval of multimedia big data. However, there is insufficient research that provides a comprehensive framework for multimedia big data analytics and management. To address the major challenges in this area, a new framework is proposed based on deep neural networks for multimedia semantic concept detection with a focus on spatio-temporal information analysis and rare event detection. The proposed framework is able to discover the pattern and knowledge of multimedia data using both static deep data representation and temporal semantics. Specifically, it is designed to handle data with skewed distributions. The proposed framework includes the following components: (1) a synthetic data generation component based on simulation and adversarial networks for data augmentation and deep learning training, (2) an automatic sampling model to overcome the imbalanced data issue in multimedia data, (3) a deep representation learning model leveraging novel deep learning techniques to generate the most discriminative static features from multimedia data, (4) an automatic hyper-parameter learning component for faster training and convergence of the learning models, (5) a spatio-temporal deep learning model to analyze dynamic features from multimedia data, and finally (6) a multimodal deep learning fusion model to integrate different data modalities. The whole framework has been evaluated using various large-scale multimedia datasets that include the newly collected disaster-events video dataset and other public datasets

    A Fusion-Based Framework for Wireless Multimedia Sensor Networks in Surveillance Applications

    Get PDF
    Multimedia sensors enable monitoring applications to obtain more accurate and detailed information. However, the development of efficient and lightweight solutions for managing data traffic over wireless multimedia sensor networks (WMSNs) has become vital because of the excessive volume of data produced by multimedia sensors. As part of this motivation, this paper proposes a fusion-based WMSN framework that reduces the amount of data to be transmitted over the network by intra-node processing. This framework explores three main issues: 1) the design of a wireless multimedia sensor (WMS) node to detect objects using machine learning techniques; 2) a method for increasing the accuracy while reducing the amount of information transmitted by the WMS nodes to the base station, and; 3) a new cluster-based routing algorithm for the WMSNs that consumes less power than the currently used algorithms. In this context, a WMS node is designed and implemented using commercially available components. In order to reduce the amount of information to be transmitted to the base station and thereby extend the lifetime of a WMSN, a method for detecting and classifying objects on three different layers has been developed. A new energy-efficient cluster-based routing algorithm is developed to transfer the collected information/data to the sink. The proposed framework and the cluster-based routing algorithm are applied to our WMS nodes and tested experimentally. The results of the experiments clearly demonstrate the feasibility of the proposed WMSN architecture in the real-world surveillance applications

    Combining heterogeneous sources in an interactive multimedia content retrieval model

    Get PDF
    Interactive multimodal information retrieval systems (IMIR) increase the capabilities of traditional search systems, by adding the ability to retrieve information of different types (modes) and from different sources. This article describes a formal model for interactive multimodal information retrieval. This model includes formal and widespread definitions of each component of an IMIR system. A use case that focuses on information retrieval regarding sports validates the model, by developing a prototype that implements a subset of the features of the model. Adaptive techniques applied to the retrieval functionality of IMIR systems have been defined by analysing past interactions using decision trees, neural networks, and clustering techniques. This model includes a strategy for selecting sources and combining the results obtained from every source. After modifying the strategy of the prototype for selecting sources, the system is reevaluated using classification techniques.This work was partially supported by eGovernAbility-Access project (TIN2014-52665-C2-2-R)

    Multimedia Big Data Analytics and Fusion for Data Science

    Get PDF
    Title from PDF of title page, viewed May 24, 2023Dissertation advisor: Shu-Ching ChenVitaIncludes bibliographical references (pages 178-212)Dissertation (Ph.D.)--Department of Computer Science and Electrical Engineering. University of Missouri--Kansas City, 2023Big data is becoming increasingly prevalent in people's everyday lives due to the enormous quantity of data generated from social and economic activities worldwide. As a result, extensive research has been undertaken to support the big data revolution. However, as data grows in volume, traditional data analytic methods face various challengesā€”especially when raw data comes in multiple forms and formats. This dissertation proposes a multimodal big data analytics and fusion framework that addresses several challenges in data science for handling and learning from multimodal big data. The proposed framework addresses issues during a standard data science project workflow, including data fusion, spatio-temporal deep feature extraction, and model training optimization strategy. First, a hierarchical graph fusion network is presented to capture the inter-modality correlations among modalities. The network hierarchy models the modality-wise combinations with gradually increased complexity to explore all n-modality interactions. Next, an adaptive spatio-temporal graph network is proposed to capture the hidden patterns from spatio-temporal data. It exploits local and global node correlations by improving the pre-defined graph Laplacian and automatically generates the graph adjacency matrix based on a data-driven method. In addition, a dynamic multi-task learning method is introduced to optimize the model training progress by dynamically adjusting the loss weights assigned to each task. It systematically monitors the sample-level prediction errors, task-level weight parameter changing rate, and iteration-level total loss to adjust the weight balance among tasks. The proposed framework has been evaluated on various datasets, including disaster event videos, social media, traffic flow, and other public datasets.Introduction -- Related work -- Overview of the framework -- Dynamic multi-task learning -- Hierarchical graph fusion -- Spatio-temporal graph network -- Conclusions and future wor

    Improved depth recovery in consumer depth cameras via disparity space fusion within cross-spectral stereo.

    Get PDF
    We address the issue of improving depth coverage in consumer depth cameras based on the combined use of cross-spectral stereo and near infra-red structured light sensing. Specifically we show that fusion of disparity over these modalities, within the disparity space image, prior to disparity optimization facilitates the recovery of scene depth information in regions where structured light sensing fails. We show that this joint approach, leveraging disparity information from both structured light and cross-spectral sensing, facilitates the joint recovery of global scene depth comprising both texture-less object depth, where conventional stereo otherwise fails, and highly reflective object depth, where structured light (and similar) active sensing commonly fails. The proposed solution is illustrated using dense gradient feature matching and shown to outperform prior approaches that use late-stage fused cross-spectral stereo depth as a facet of improved sensing for consumer depth cameras

    Bridging Vision and Language over Time with Neural Cross-modal Embeddings

    Get PDF
    Giving computers the ability to understand multimedia content is one of the goals of Artificial Intelligence systems. While humans excel at this task, it remains a challenge, requiring bridging vision and language, which inherently have heterogeneous computational representations. Cross-modal embeddings are used to tackle this challenge, by learning a common space that uni es these representations. However, to grasp the semantics of an image, one must look beyond the pixels and consider its semantic and temporal context, with the latter being de ned by imagesā€™ textual descriptions and time dimension, respectively. As such, external causes (e.g. emerging events) change the way humans interpret and describe the same visual element over time, leading to the evolution of visual-textual correlations. In this thesis we investigate models that capture patterns of visual and textual interactions over time, by incorporating time in cross-modal embeddings: 1) in a relative manner, where by using pairwise temporal correlations to aid data structuring, we obtained a model that provides better visual-textual correspondences on dynamic corpora, and 2) in a diachronic manner, where the temporal dimension is fully preserved, thus capturing visual-textual correlations evolution under a principled approach that jointly models vision+language+time. Rich insights stemming from data evolution were extracted from a 20 years large-scale dataset. Additionally, towards improving the e ectiveness of these embedding learning models, we proposed a novel loss function that increases the expressiveness of the standard triplet-loss, by making it adaptive to the data at hand. With our adaptive triplet-loss, in which triplet speci c constraints are inferred and scheduled, we achieved state-of-the-art performance on the standard cross-modal retrieval task
    • ā€¦
    corecore