4,019 research outputs found

    Parallel RDF generation from heterogeneous big data

    Get PDF
    To unlock the value of increasingly available data in high volumes, we need flexible ways to integrate data across different sources. While semantic integration can be provided through RDF generation, current generators insufficiently scale in terms of volume. Generators are limited by memory constraints. Therefore, we developed the RMLStreamer, a generator that parallelizes the ingestion and mapping tasks of RDF generation across multiple instances. In this paper, we analyze what aspects are parallelizable and we introduce an approach for parallel RDF generation. We describe how we implemented our proposed approach, in the frame of the RMLStreamer, and how the resulting scaling behavior compares to other RDF generators. The RMLStreamer ingests data at 50% faster rate than existing generators through parallel ingestion

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Mobile Edge Computing Empowers Internet of Things

    Full text link
    In this paper, we propose a Mobile Edge Internet of Things (MEIoT) architecture by leveraging the fiber-wireless access technology, the cloudlet concept, and the software defined networking framework. The MEIoT architecture brings computing and storage resources close to Internet of Things (IoT) devices in order to speed up IoT data sharing and analytics. Specifically, the IoT devices (belonging to the same user) are associated to a specific proxy Virtual Machine (VM) in the nearby cloudlet. The proxy VM stores and analyzes the IoT data (generated by its IoT devices) in real-time. Moreover, we introduce the semantic and social IoT technology in the context of MEIoT to solve the interoperability and inefficient access control problem in the IoT system. In addition, we propose two dynamic proxy VM migration methods to minimize the end-to-end delay between proxy VMs and their IoT devices and to minimize the total on-grid energy consumption of the cloudlets, respectively. Performance of the proposed methods are validated via extensive simulations

    Structuring visual exploratory analysis of skill demand

    No full text
    The analysis of increasingly large and diverse data for meaningful interpretation and question answering is handicapped by human cognitive limitations. Consequently, semi-automatic abstraction of complex data within structured information spaces becomes increasingly important, if its knowledge content is to support intuitive, exploratory discovery. Exploration of skill demand is an area where regularly updated, multi-dimensional data may be exploited to assess capability within the workforce to manage the demands of the modern, technology- and data-driven economy. The knowledge derived may be employed by skilled practitioners in defining career pathways, to identify where, when and how to update their skillsets in line with advancing technology and changing work demands. This same knowledge may also be used to identify the combination of skills essential in recruiting for new roles. To address the challenges inherent in exploring the complex, heterogeneous, dynamic data that feeds into such applications, we investigate the use of an ontology to guide structuring of the information space, to allow individuals and institutions to interactively explore and interpret the dynamic skill demand landscape for their specific needs. As a test case we consider the relatively new and highly dynamic field of Data Science, where insightful, exploratory data analysis and knowledge discovery are critical. We employ context-driven and task-centred scenarios to explore our research questions and guide iterative design, development and formative evaluation of our ontology-driven, visual exploratory discovery and analysis approach, to measure where it adds value to users’ analytical activity. Our findings reinforce the potential in our approach, and point us to future paths to build on

    Improving lifecycle query in integrated toolchains using linked data and MQTT-based data warehousing

    Full text link
    The development of increasingly complex IoT systems requires large engineering environments. These environments generally consist of tools from different vendors and are not necessarily integrated well with each other. In order to automate various analyses, queries across resources from multiple tools have to be executed in parallel to the engineering activities. In this paper, we identify the necessary requirements on such a query capability and evaluate different architectures according to these requirements. We propose an improved lifecycle query architecture, which builds upon the existing Tracked Resource Set (TRS) protocol, and complements it with the MQTT messaging protocol in order to allow the data in the warehouse to be kept updated in real-time. As part of the case study focusing on the development of an IoT automated warehouse, this architecture was implemented for a toolchain integrated using RESTful microservices and linked data.Comment: 12 pages, worksho
    • …
    corecore