398,229 research outputs found

    Characterizing and Subsetting Big Data Workloads

    Full text link
    Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates hese challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/simulatorversion/.Comment: 11 pages, 6 figures, 2014 IEEE International Symposium on Workload Characterizatio

    Structuring User-Generated Content on Social Media with Multimodal Aspect-Based Sentiment Analysis

    Full text link
    People post their opinions and experiences on social media, yielding rich databases of end users' sentiments. This paper shows to what extent machine learning can analyze and structure these databases. An automated data analysis pipeline is deployed to provide insights into user-generated content for researchers in other domains. First, the domain expert can select an image and a term of interest. Then, the pipeline uses image retrieval to find all images showing similar contents and applies aspect-based sentiment analysis to outline users' opinions about the selected term. As part of an interdisciplinary project between architecture and computer science researchers, an empirical study of Hamburg's Elbphilharmonie was conveyed on 300 thousand posts from the platform Flickr with the hashtag 'hamburg'. Image retrieval methods generated a subset of slightly more than 1.5 thousand images displaying the Elbphilharmonie. We found that these posts mainly convey a neutral or positive sentiment towards it. With this pipeline, we suggest a new big data analysis method that offers new insights into end-users opinions, e.g., for architecture domain experts.Comment: 9 pages, 5 figures, short paper version to be published at 9th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT2022

    Real-time probabilistic reasoning system using Lambda architecture

    Get PDF
    Thesis (MTech (Information Technology))--Cape Peninsula University of Technology, 2019The proliferation of data from sources like social media, and sensor devices has become overwhelming for traditional data storage and analysis technologies to handle. This has prompted a radical improvement in data management techniques, tools and technologies to meet the increasing demand for effective collection, storage and curation of large data set. Most of the technologies are open-source. Big data is usually described as very large dataset. However, a major feature of big data is its velocity. Data flow in as continuous stream and require to be actioned in real-time to enable meaningful, relevant value. Although there is an explosion of technologies to handle big data, they are usually targeted at processing large dataset (historic) and real-time big data independently. Thus, the need for a unified framework to handle high volume dataset and real-time big data. This resulted in the development of models such as the Lambda architecture. Effective decision-making requires processing of historic data as well as real-time data. Some decision-making involves complex processes, depending on the likelihood of events. To handle uncertainty, probabilistic systems were designed. Probabilistic systems use probabilistic models developed with probability theories such as hidden Markov models with inference algorithms to process data and produce probabilistic scores. However, development of these models requires extensive knowledge of statistics and machine learning, making it an uphill task to model real-life circumstances. A new research area called probabilistic programming has been introduced to alleviate this bottleneck. This research proposes the combination of modern open-source big data technologies with probabilistic programming and Lambda architecture on easy-to-get hardware to develop a highly fault-tolerant, and scalable processing tool to process both historic and real-time big data in real-time; a common solution. This system will empower decision makers with the capacity to make better informed resolutions especially in the face of uncertainty. The outcome of this research will be a technology product, built and assessed using experimental evaluation methods. This research will utilize the Design Science Research (DSR) methodology as it describes guidelines for the effective and rigorous construction and evaluation of an artefact. Probabilistic programming in the big data domain is still at its infancy, however, the developed artefact demonstrated an important potential of probabilistic programming combined with Lambda architecture in the processing of big data

    Statistical analysis driven optimized deep learning system for intrusion detection

    Get PDF
    Attackers have developed ever more sophisticated and intelligent ways to hack information and communication technology systems. The extent of damage an individual hacker can carry out upon infiltrating a system is well understood. A potentially catastrophic scenario can be envisaged where a nation-state intercepting encrypted financial data gets hacked. Thus, intelligent cybersecurity systems have become inevitably important for improved protection against malicious threats. However, as malware attacks continue to dramatically increase in volume and complexity, it has become ever more challenging for traditional analytic tools to detect and mitigate threat. Furthermore, a huge amount of data produced by large networks has made the recognition task even more complicated and challenging. In this work, we propose an innovative statistical analysis driven optimized deep learning system for intrusion detection. The proposed intrusion detection system (IDS) extracts optimized and more correlated features using big data visualization and statistical analysis methods (human-in-the-loop), followed by a deep autoencoder for potential threat detection. Specifically, a pre-processing module eliminates the outliers and converts categorical variables into one-hot-encoded vectors. The feature extraction module discard features with null values and selects the most significant features as input to the deep autoencoder model (trained in a greedy-wise manner). The NSL-KDD dataset from the Canadian Institute for Cybersecurity is used as a benchmark to evaluate the feasibility and effectiveness of the proposed architecture. Simulation results demonstrate the potential of our proposed system and its outperformance as compared to existing state-of-the-art methods and recently published novel approaches. Ongoing work includes further optimization and real-time evaluation of our proposed IDS.Comment: To appear in the 9th International Conference on Brain Inspired Cognitive Systems (BICS 2018

    Maximizing Insight from Modern Economic Analysis

    Full text link
    The last decade has seen a growing trend of economists exploring how to extract different economic insight from "big data" sources such as the Web. As economists move towards this model of analysis, their traditional workflow starts to become infeasible. The amount of noisy data from which to draw insights presents data management challenges for economists and limits their ability to discover meaningful information. This leads to economists needing to invest a great deal of energy in training to be data scientists (a catch-all role that has grown to describe the usage of statistics, data mining, and data management in the big data age), with little time being spent on applying their domain knowledge to the problem at hand. We envision an ideal workflow that generates accurate and reliable results, where results are generated in near-interactive time, and systems handle the "heavy lifting" required for working with big data. This dissertation presents several systems and methodologies that bring economists closer to this ideal workflow, helping them address many of the challenges faced in transitioning to working with big data sources like the Web. To help users generate accurate and reliable results, we present approaches to identifying relevant predictors in nowcasting applications, as well as methods for identifying potentially invalid nowcasting models and their inputs. We show how a streamlined workflow, combined with pruning and shared computation, can help handle the heavy lifting of big data analysis, allowing users to generate results in near-interactive time. We also present a novel user model and architecture for helping users avoid undesirable bias when doing data preparation: users interactively define constraints for transformation code and the data that the code produces, and an explain-and-repair system satisfies these constraints as best it can, also providing an explanation for any problems along the way. These systems combined represent a unified effort to streamline the transition for economists to this new big data workflow.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144007/1/dol_1.pd

    Vision Based Activity Recognition Using Machine Learning and Deep Learning Architecture

    Get PDF
    Human Activity recognition, with wide application in fields like video surveillance, sports, human interaction, elderly care has shown great influence in upbringing the standard of life of people. With the constant development of new architecture, models, and an increase in the computational capability of the system, the adoption of machine learning and deep learning for activity recognition has shown great improvement with high performance in recent years. My research goal in this thesis is to design and compare machine learning and deep learning models for activity recognition through videos collected from different media in the field of sports. Human activity recognition (HAR) mostly is to recognize the action performed by a human through the data collected from different sources automatically. Based on the literature review, most data collected for analysis is based on time series data collected through different sensors and video-based data collected through the camera. So firstly, our research analyzes and compare different machine learning and deep learning architecture with sensor-based data collected from an accelerometer of a smartphone place at different position of the human body. Without any hand-crafted feature extraction methods, we found that deep learning architecture outperforms most of the machine learning architecture and the use of multiple sensors has higher accuracy than a dataset collected from a single sensor. Secondly, as collecting data from sensors in real-time is not feasible in all the fields such as sports, we study the activity recognition by using the video dataset. For this, we used two state-of-the-art deep learning architectures previously trained on the big, annotated dataset using transfer learning methods for activity recognition in three different sports-related publicly available datasets. Extending the study to the different activities performed on a single sport, and to avoid the current trend of using special cameras and expensive set up around the court for data collection, we developed our video dataset using sports coverage of basketball games broadcasted through broadcasting media. The detailed analysis and experiments based on different criteria such as range of shots taken, scoring activities is presented for 8 different activities using state-of-art deep learning architecture for video classification

    The medical science DMZ: a network design pattern for data-intensive medical science

    Get PDF
    Abstract: Objective We describe a detailed solution for maintaining high-capacity, data-intensive network flows (eg, 10, 40, 100 Gbps+) in a scientific, medical context while still adhering to security and privacy laws and regulations. Materials and Methods High-end networking, packet-filter firewalls, network intrusion-detection systems. Results We describe a “Medical Science DMZ” concept as an option for secure, high-volume transport of large, sensitive datasets between research institutions over national research networks, and give 3 detailed descriptions of implemented Medical Science DMZs. Discussion The exponentially increasing amounts of “omics” data, high-quality imaging, and other rapidly growing clinical datasets have resulted in the rise of biomedical research “Big Data.” The storage, analysis, and network resources required to process these data and integrate them into patient diagnoses and treatments have grown to scales that strain the capabilities of academic health centers. Some data are not generated locally and cannot be sustained locally, and shared data repositories such as those provided by the National Library of Medicine, the National Cancer Institute, and international partners such as the European Bioinformatics Institute are rapidly growing. The ability to store and compute using these data must therefore be addressed by a combination of local, national, and industry resources that exchange large datasets. Maintaining data-intensive flows that comply with the Health Insurance Portability and Accountability Act (HIPAA) and other regulations presents a new challenge for biomedical research. We describe a strategy that marries performance and security by borrowing from and redefining the concept of a Science DMZ, a framework that is used in physical sciences and engineering research to manage high-capacity data flows. Conclusion By implementing a Medical Science DMZ architecture, biomedical researchers can leverage the scale provided by high-performance computer and cloud storage facilities and national high-speed research networks while preserving privacy and meeting regulatory requirements

    Fog and Cloud Computing Assisted IoT Model Based Personal Emergency Monitoring and Diseases Prediction Services

    Get PDF
    Along with the rapid development of modern high-tech and the change of people's awareness of healthy life, the demand for personal healthcare services is gradually increasing. The rapid progress of information and communication technology and medical and bio technology not only improves personal healthcare services, but also brings the fact that the human being has entered the era of longevity. At present, there are many researches focused on various wearable sensing devices and implant devices and Internet of Things in order to capture personal daily life health information more conveniently and effectively, and significant results have been obtained, such as fog computing. To provide personal healthcare services, the fog and cloud computing is an effective solution for sharing health information. The health big data analysis model can provide personal health situation reports on a daily basis, and the gene sequencing can provide hereditary disease prediction. However, the injury mortality and emergency diseases since long ago caused death and great pain for the family. And there are no effective rescue methods to save precious lives and no methods to predict the disease morbidity likelihood. The purpose of this research is to capture personal daily health information based on sensors and monitoring emergency situations with the help of fog computing and mobile applications, and disease prediction based on cloud computing and big data analysis. Through the comparison of test results it was proved that the proposed emergency monitoring based on fog and cloud computing and the diseases prediction model based on big data analysis not only gain more of the rescue time than the traditional emergency treatment method, but they also accumulate lots of different personal healthcare related experience. The Taian 960 hospital of PLA and the Yanbian Hospital as IM testbed were joined to provide emergency monitoring tests, and to ensure the CVD and CVA morbidity likelihood medical big data analysis, the people around Taian city participated in personal health tests. Through the project, the five network layers architecture and integrated MAPE-K Model based EMDPS platform not only made the cooperation between hospitals feasible to deal with emergency situations, but also the Internet medicine for the disease prediction was built
    • …
    corecore