615 research outputs found

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    Change Management Systems for Seamless Evolution in Data Centers

    Get PDF
    Revenue for data centers today is highly dependent on the satisfaction of their enterprise customers. These customers often require various features to migrate their businesses and operations to the cloud. Thus, clouds today introduce new features at a swift pace to onboard new customers and to meet the needs of existing ones. This pace of innovation continues to grow on super linearly, e.g., Amazon deployed 1400 new features in 2017. However, such a rapid pace of evolution adds complexities both for users and the cloud. Clouds struggle to keep up with the deployment speed, and users struggle to learn which features they need and how to use them. The pace of these evolutions has brought us to a tipping point: we can no longer use rules of thumb to deploy new features, and customers need help to identify which features they need. We have built two systems: Janus and Cherrypick, to address these problems. Janus helps data center operators roll out new changes to the data center network. It automatically adapts to the data center topology, routing, traffic, and failure settings. The system reduces the risk of new deployments for network operators as they can now pick deployment strategies which are less likely to impact users’ performance. Cherrypick finds near-optimal cloud configurations for big data analytics. It adapts allows users to search through the new machine types the clouds are constantly introducing and find ones with a near-optimal performance that meets their budget. Cherrypick can adapt to new big-data frameworks and applications as well as the new machine types the clouds are constantly introducing. As the pace of cloud innovations increases, it is critical to have tools that allow operators to deploy new changes as well as those that would enable users to adapt to achieve good performance at low cost. The tools and algorithms discussed in this thesis help accomplish these goals

    Revisiting Ralph Sprague’s Framework for Developing Decision Support Systems

    Get PDF
    Ralph H. Sprague Jr. was a leader in the MIS field and helped develop the conceptual foundation for decision support systems (DSS). In this paper, I pay homage to Sprague and his DSS contributions. I take a personal perspective based on my years of working with Sprague. I explore the history of DSS and its evolution. I also present and discuss Sprague’s DSS development framework with its dialog, data, and models (DDM) paradigm and characteristics. At its core, the development framework remains valid in today’s world of business intelligence and big data analytics. I present and discuss a contemporary reference architecture for business intelligence and analytics (BI/A) in the context of Sprague’s DSS development framework. The practice of decision support continues to evolve and can be described by a maturity model with DSS, enterprise data warehousing, real-time data warehousing, big data analytics, and the emerging cognitive as successive generations. I use a DSS perspective to describe and provide examples of what the forthcoming cognitive generation will bring

    Requirements engineering: foundation for software quality

    Get PDF

    Implementation of a data virtualization layer applied to insurance data

    Get PDF
    This work focuses on the introduction of a data virtualization layer to read and consolidate data from heterogeneous sources (Hadoop system, a data mart and a data warehouse) and provide a single point of data access to all data consumers

    Analyzing data in the Internet of Things

    Get PDF
    The Internet of Things (IoT) is growing fast. According to the International Data Corporation (IDC), more than 28 billion things will be connected to the Internet by 2020—from smartwatches and other wearables to smart cities, smart homes, and smart cars. This O’Reilly report dives into the IoT industry through a series of illuminating talks and case studies presented at 2015 Strata + Hadoop World Conferences in San Jose, New York, and Singapore. Among the topics in this report, you’ll explore the use of sensors to generate predictions, using data to create predictive maintenance applications, and modeling the smart and connected city of the future with Kafka and Spark. Case studies include: Using Spark Streaming for proactive maintenance and accident prevention in railway equipment Monitoring subway and expressway traffic in Singapore using telco data Managing emergency vehicles through situation awareness of traffic and weather in the smart city pilot in Oulu, Finland Capturing and routing device-based health data to reduce cardiovascular disease Using data analytics to reduce human space flight risk in NASA’s Orion program This report concludes with a discussion of ethics related to algorithms that control things in the IoT. You’ll explore decisions related to IoT data, as well as opportunities to influence the moral implications involved in using the IoT

    Performance modelling, analysis and prediction of Spark jobs in Hadoop cluster : a thesis by publications presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical & Computational Sciences, Massey University, Auckland, New Zealand

    Get PDF
    Big Data frameworks have received tremendous attention from the industry and from academic research over the past decade. The advent of distributed computing frameworks such as Hadoop MapReduce and Spark are powerful frameworks that offer an efficient solution for analysing large-scale datasets running under the Hadoop cluster. Spark has been established as one of the most popular large-scale data processing engines because of its speed, low latency in-memory computation, and advanced analytics. Spark computational performance heavily depends on the selection of suitable parameters, and the configuration of these parameters is a challenging task. Although Spark has default parameters and can deploy applications without much effort, a significant drawback of default parameter selection is that it is not always the best for cluster performance. A major limitation for Spark performance prediction using existing models is that it requires either large input data or system configuration that is time-consuming. Therefore, an analytical model could be a better solution for performance prediction and for establishing appropriate job configurations. This thesis proposes two distinct parallelisation models for performance prediction: the 2D-Plate model and the Fully-Connected Node model. Both models were constructed based on serial boundaries for a certain arrangement of executors and size of the data. In order to evaluate the cluster performance, various HiBench workloads were used, and workload’s empirical data were fitted with the models for performance prediction analysis. The developed models were benchmarked with the existing models such as Amdahl’s, Gustafson, ERNEST, and machine learning. Our experimental results show that the two proposed models can quickly and accurately predict performance in terms of runtime, and they can outperform the accuracy of machine learning models when extrapolating predictions
    • …
    corecore