5,916 research outputs found

    Impliance: A Next Generation Information Management Appliance

    Full text link
    ably successful in building a large market and adapting to the changes of the last three decades, its impact on the broader market of information management is surprisingly limited. If we were to design an information management system from scratch, based upon today's requirements and hardware capabilities, would it look anything like today's database systems?" In this paper, we introduce Impliance, a next-generation information management system consisting of hardware and software components integrated to form an easy-to-administer appliance that can store, retrieve, and analyze all types of structured, semi-structured, and unstructured information. We first summarize the trends that will shape information management for the foreseeable future. Those trends imply three major requirements for Impliance: (1) to be able to store, manage, and uniformly query all data, not just structured records; (2) to be able to scale out as the volume of this data grows; and (3) to be simple and robust in operation. We then describe four key ideas that are uniquely combined in Impliance to address these requirements, namely the ideas of: (a) integrating software and off-the-shelf hardware into a generic information appliance; (b) automatically discovering, organizing, and managing all data - unstructured as well as structured - in a uniform way; (c) achieving scale-out by exploiting simple, massive parallel processing, and (d) virtualizing compute and storage resources to unify, simplify, and streamline the management of Impliance. Impliance is an ambitious, long-term effort to define simpler, more robust, and more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Distributed and scalable parsing solution for telecom network data

    Get PDF
    The growing usage of mobile devices and the introduction of 5G networks have increased the significance of network data for the telecom business. The success of telecom organizations can depend on employing efficient data engineering techniques for transforming raw network data into useful information by analytics and machine learning (ML). Elisa Oyj., a Finnish telecommunications company, receives massive amounts of network data from network equipment manufactured by various vendors. The effectiveness of data analytics depends on efficient data engineering processes. This thesis presents a scalable data parsing solution that leverages Spark, a distributed programming framework, for parallelizing parsing routines from an existing parsing solution. We design and deploy this solution as a component of the organization's data engineering pipeline to enable automation of data-centric operations. Experimental results indicate that the efficiency of the proposed solution is heavily dependent on the individual file size distribution. The proposed parsing solution demonstrates reliability, scalability, and speed during empirical evaluation and processes a 24-hour network data within 3 hours. The main outcome of the project is an optimized setup with the minimum number of data partitions to ensure zero failures and thus minimum execution time. A smaller execution time leads to lower costs of the continuously running infrastructure provisioned on the cloud

    Management and Security of IoT systems using Microservices

    Get PDF
    Devices that assist the user with some task or help them to make an informed decision are called smart devices. A network of such devices connected to internet are collectively called as Internet of Things (IoT). The applications of IoT are expanding exponentially and are becoming a part of our day to day lives. The rise of IoT led to new security and management issues. In this project, we propose a solution for some major problems faced by the IoT devices, including the problem of complexity due to heterogeneous platforms and the lack of IoT device monitoring for security and fault tolerance. We aim to solve the above issues in a microservice architecture. We build a data pipeline for IoT devices to send data through a messaging platform Kafka and monitor the devices using the collected data by making real time dashboards and a machine learning model to give better insights of the data. For proof of concept, we test the proposed solution on a heterogeneous cluster, including Raspberry Pi’s and IoT devices from different vendors. We validate our design by presenting some simple experimental results

    Enhancing Big Data Warehousing and Analytics for Spatio-Temporal Massive Data

    Get PDF
    The increasing amount of data generated by earth observation missions like Copernicus, NASA Earth Data, and climate stations is overwhelming. Every day, terabytes of data are collected from these resources for different environment applications. Thus, this massive amount of data should be effectively managed and processed to support decision-makers. In this paper, we propose an information system-based on a low latency spatio-temporal data warehouse which aims to improve drought monitoring analytics and to support the decision-making process. The proposed framework consists of 4 main modules: (1) data collection, (2) data preprocessing, (3) data loading and storage, and (4) the visualization and interpretation module. The used data are multi-source and heterogeneous collected from various sources like remote sensing sensors, biophysical sensors, and climate sensors. Hence, this allows us to study drought in different dimensions. Experiments were carried out on a real case of drought monitoring in China between 2000 and 2020

    Security Implications of Adopting a New Data Storage and Access Model in Big Data and Cloud Computing

    Get PDF
    This article examines the security implications of using cloud computing and Big Data. It employs a mixed methodology of qualitative and quantitative research and takes a critical realist epistemological approach. The objective is to identify the components of a theory for predicting and explaining [1, 4] the security implications associated with adopting the services provided by cloud computing and Big Data. The integration of various information sources and the widespread use of computing across diverse fields have resulted in a significant increase in data volume, scale, quantity, and diversity. Consequently, data management, storage, retrieval, and access have undergone significant changes. The latest developments in IT have brought forth novel technologies such as Cloud Computing and Big Data. Big Data comprises of technologies that rely on NoSQL (Not only SQL) databases, which enable the growth of data volumes, numbers, and types on a large scale. The new NoSQL systems are seen as solutions for meeting scalability requirements of large IT firms. Multiple open-source and pay-as-you-go NoSQL models are available for purchase
    corecore