1,042 research outputs found

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    MPI-Vector-IO: Parallel I/O and Partitioning for Geospatial Vector Data

    Get PDF
    In recent times, geospatial datasets are growing in terms of size, complexity and heterogeneity. High performance systems are needed to analyze such data to produce actionable insights in an efficient manner. For polygonal a.k.a vector datasets, operations such as I/O, data partitioning, communication, and load balancing becomes challenging in a cluster environment. In this work, we present MPI-Vector-IO 1 , a parallel I/O library that we have designed using MPI-IO specifically for partitioning and reading irregular vector data formats such as Well Known Text. It makes MPI aware of spatial data, spatial primitives and provides support for spatial data types embedded within collective computation and communication using MPI message-passing library. These abstractions along with parallel I/O support are useful for parallel Geographic Information System (GIS) application development on HPC platforms

    A Survey Paper on Implementing Service Oriented Architecture for Data Mining

    Get PDF
    Web service is working with the web with an object or component to achieve the communication between the distributed applications and between the different platforms through a series of protocols. Web Service provides a set of standard types systems, rules, techniques and internet service-oriented applications for communication between the different platforms, different programming languages and different types of systems to achieve interoperability. This survey paper gives the application of web service for data mining also we build a data mining model based on Web services and going forward it is possible to build a new data mining solution for security according to the prototype of a dynamic web service based data mining process system. DOI: 10.17762/ijritcc2321-8169.15079

    Parsing Large XES Files for Discovering Process Models: A Big Data Problem

    Get PDF
    Process mining is a group of techniques for retrieving de-facto models using system traces. Discovering algorithms can obtain mathematical models exploiting the information contained into list of events of activities. Completeness of the traces is relevant for the accuracy of the final results. Noiseless traces appear as an ideal scenario. The performance of the algorithms is significant reduce if the log files are not processed efficiently. XES is a logical model for process logs stored in data centric xml files. In real processes the sizes of the logs increase exponentially. Parsing XES files is presented as a big data problem in real scenarios with dense traces. Lazy parsers and DOM models are not enough appropriate in scenarios with large volumes of data. We discuss this problematic and how to use indexing techniques for retrieving useful information for process mining. An XES compression schema is also discussed for reducing the index construction time

    Implementing Service Oriented Architecture for Data Mining

    Get PDF
    With Web technology, data on internet has become increasingly large and complex. No matter users or internet users needs all this data. Also the data which is available on web not all the time useful information or it is knowledgeable. Hence web data mining is necessary to fulfill this demand. Web data mining can extract unstructured, undiscovered data which is possibly useful information and knowledge, from much incomplete, noisy, ambiguous, random, practical application related data from WWW network. It is a new emerging commercial information/data mining technology. Its main characteristic is to extract key data to support business for decision making from business database through the use of extraction, conversion, analysis and other transaction models. Web service is deployed on the web with an object or component to achieve distributed application software platform through a series of protocols. Web Service platform provides a set of standard types systems, rules, techniques and internet service-oriented applications for communication between the different platforms, different programming languages and different types of systems to achieve interoperability. This paper gives the actual and practical application of web services for data mining, we build a data mining model based on Web services and going forward it is possible to implement the new data mining solution for security configuration. This has been achieved with the use of prototypes of a dynamic web service based data mining systems. DOI: 10.17762/ijritcc2321-8169.15079
    • …
    corecore