545 research outputs found

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    An approach to solve a Small File problem in Hadoop by using Dynamic Merging and Indexing Scheme

    Get PDF
    Size of the data application in now’s enterprises has been spreading at a excessive frequent from last few donkey’s years. Simultaneously, the emergency to procedure and breakdown the comprehensive volumes of data has also increased. Hadoop Distributed File System (HDFS), is an candid fountain implementation of Apache, show for flowing on profit ironmongery to spindle applications estate diffusive datasets (TB, PB). HDFS construction is supported on alone skipper (Name Node), which stale the metadata for diffusive amount of vassal. To get highest ability, Name Node supply all of the metadata in its RAM. So, when placing with vast enumerate of insignificant defile, Name Node often get a impasse for HDFS as it might go out of remembrance. Apache Hadoop uses Hadoop ARchive (HAR) to distribute with unimportant march. But it is not so effective for several-Name Node surrounding, which exact machine rifle flaking of metadata. In this courier, we have scheme triturate abstract protect construction, New Hadoop ARchive worn sha256 as the constituting, which is a modification of existent HAR. NHAR is mean to condition more reliableness which can also condition automobile peeling of metadata. Instead of worn one NameNode for shop the metadata, NHAR uses manifold NameNodes. Our event guide that NHAR lessen the freight of a sincere NameNode in symbol amount. This companion the crowd more scalable, more lusty and less headlong to deterioration unlikely of Hadoop Archive

    MOLNs: A cloud platform for interactive, reproducible and scalable spatial stochastic computational experiments in systems biology using PyURDME

    Full text link
    Computational experiments using spatial stochastic simulations have led to important new biological insights, but they require specialized tools, a complex software stack, as well as large and scalable compute and data analysis resources due to the large computational cost associated with Monte Carlo computational workflows. The complexity of setting up and managing a large-scale distributed computation environment to support productive and reproducible modeling can be prohibitive for practitioners in systems biology. This results in a barrier to the adoption of spatial stochastic simulation tools, effectively limiting the type of biological questions addressed by quantitative modeling. In this paper, we present PyURDME, a new, user-friendly spatial modeling and simulation package, and MOLNs, a cloud computing appliance for distributed simulation of stochastic reaction-diffusion models. MOLNs is based on IPython and provides an interactive programming platform for development of sharable and reproducible distributed parallel computational experiments

    Research on High-performance and Scalable Data Access in Parallel Big Data Computing

    Get PDF
    To facilitate big data processing, many dedicated data-intensive storage systems such as Google File System(GFS), Hadoop Distributed File System(HDFS) and Quantcast File System(QFS) have been developed. Currently, the Hadoop Distributed File System(HDFS) [20] is the state-of-art and most popular open-source distributed file system for big data processing. It is widely deployed as the bedrock for many big data processing systems/frameworks, such as the script-based pig system, MPI-based parallel programs, graph processing systems and scala/java-based Spark frameworks. These systems/applications employ parallel processes/executors to speed up data processing within scale-out clusters. Job or task schedulers in parallel big data applications such as mpiBLAST and ParaView can maximize the usage of computing resources such as memory and CPU by tracking resource consumption/availability for task assignment. However, since these schedulers do not take the distributed I/O resources and global data distribution into consideration, the data requests from parallel processes/executors in big data processing will unfortunately be served in an imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file systems such as HDFS store each data unit, referred to as chunk or block file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher the probability that the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth. Because of this, the makespan of the entire program could be significantly prolonged and the overall I/O performance will degrade. The first part of my dissertation seeks to address aspects of these problems by creating an I/O middleware system and designing matching-based algorithms to optimize data access in parallel big data processing. To address the problem of remote data movement, we develop an I/O middleware system, called SLAM, which allows MPI-based analysis and visualization programs to benefit from locality read, i.e, each MPI process can access its required data from a local or nearby storage node. This can greatly improve the execution performance by reducing the amount of data movement over network. Furthermore, to address the problem of imbalanced data access, we propose a method called Opass, which models the data read requests that are issued by parallel applications to cluster nodes as a graph data structure where edges weights encode the demands of load capacity. We then employ matching-based algorithms to map processes to data to achieve data access in a balanced fashion. The final part of my dissertation focuses on optimizing sub-dataset analyses in parallel big data processing. Our proposed methods can benefit different analysis applications with various computational requirements and the experiments on different cluster testbeds show their applicability and scalability

    Concurrent software architectures for exploratory data analysis

    Get PDF
    Decades ago, increased volume of data made manual analysis obsolete and prompted the use of computational tools with interactive user interfaces and rich palette of data visualizations. Yet their classic, desktop-based architectures can no longer cope with the ever-growing size and complexity of data. Next-generation systems for explorative data analysis will be developed on client–server architectures, which already run concurrent software for data analytics but are not tailored to for an engaged, interactive analysis of data and models. In explorative data analysis, the key is the responsiveness of the system and prompt construction of interactive visualizations that can guide the users to uncover interesting data patterns. In this study, we review the current software architectures for distributed data analysis and propose a list of features to be included in the next generation frameworks for exploratory data analysis. The new generation of tools for explorative data analysis will need to address integrated data storage and processing, fast prototyping of data analysis pipelines supported by machine-proposed analysis workflows, preemptive analysis of data, interactivity, and user interfaces for intelligent data visualizations. The systems will rely on a mixture of concurrent software architectures to meet the challenge of seamless integration of explorative data interfaces at client site with management of concurrent data mining procedures on the servers

    Optimization of Real-World MapReduce Applications With Flame-MR: Practical Use Cases

    Get PDF
    [Abstract] Apache Hadoop is a widely used MapReduce framework for storing and processing large amounts of data. However, it presents some performance issues that hinder its utilization in many practical use cases. Although existing alternatives like Spark or Hama can outperform Hadoop, they require to rewrite the source code of the applications due to API incompatibilities. This paper studies the use of Flame-MR, an in-memory processing architecture for MapReduce applications, to improve the performance of real-world use cases in a transparent way while keeping application compatibility. Flame-MR adapts to the characteristics of the workloads, managing efficiently the use of custom data formats and iterative computations, while also reducing workload imbalance. The experimental evaluation, conducted in high performance clusters and the Microsoft Azure cloud, shows a clear outperformance of Flame-MR over Hadoop. In most cases, Flame-MR reduces the execution times by more than a half

    Concurrent software architectures for exploratory data analysis

    Get PDF
    Decades ago, increased volume of data made manual analysis obsolete and prompted the use of computational tools with interactive user interfaces and rich palette of data visualizations. Yet their classic, desktop-based architectures can no longer cope with the ever-growing size and complexity of data. Next-generation systems for explorative data analysis will be developed on client–server architectures, which already run concurrent software for data analytics but are not tailored to for an engaged, interactive analysis of data and models. In explorative data analysis, the key is the responsiveness of the system and prompt construction of interactive visualizations that can guide the users to uncover interesting data patterns. In this study, we review the current software architectures for distributed data analysis and propose a list of features to be included in the next generation frameworks for exploratory data analysis. The new generation of tools for explorative data analysis will need to address integrated data storage and processing, fast prototyping of data analysis pipelines supported by machine-proposed analysis workflows, preemptive analysis of data, interactivity, and user interfaces for intelligent data visualizations. The systems will rely on a mixture of concurrent software architectures to meet the challenge of seamless integration of explorative data interfaces at client site with management of concurrent data mining procedures on the servers
    corecore