294 research outputs found

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Concept Based Semantic Search Engine

    Get PDF
    In the current day and age, search engines are the most relied on and critical ways to find out information on the World Wide Web (W3). With the ushering in of Big Data, traditional search engines are becoming inept and inadequate at dishing out relevant pages. It has become increasingly difficult to locate meaningful results from the mind boggling list of returns typical of returned search queries. Keywords, often times, alone cannot capture the intended concept with high precision. These and associated issues with the current search engines call for a more powerful and holistic search engine capability. The current project presents a new approach to resolving this widely relevant problem - a concept based search engine. It is known that a collection of concepts naturally forms a polyhedron. Combinatorial topology is, thus, used to manipulate the polyhedron of concepts that are mined from W3. Based on this triangulated polyhedron, the concepts are clustered together based on primitive concepts that are geometrically, simplexes of maximal dimensions. Such clustering is different from conventional clustering since the proposed model may have overlapping. Based on such clustering, the search results can then be categorized and users allowed to select a category more apt to their needs. The results displayed are based on aforementioned categorization thereby leading to more sharply gathered and, thus, semantically related relevant information

    Temporal RDF(S) Data Storage and Query with HBase

    Get PDF
    Resource Description Framework (RDF) is a metadata model recommended by World Wide Web Consortium (W3C) for describing the Web resources. With the arrival of the era of Big Data, very large amounts of RDF data are continuously being created and need to be stored for management. The traditional centralized RDF storage models cannot meet the need of largescale RDF data storage. Meanwhile, the importance of temporal information management and processing has been acknowledged by academia and industry. In this paper, we propose a storage model to store temporal RDF based on HBase. The proposed storage model applies the built-in time mechanism of HBase. Our experiments on LUBM dataset with temporal information added show that our storage model can store large temporal RDF data and obtain good query efficiency

    P-LUPOSDATE: Using Precomputed Bloom Filters to Speed Up SPARQL Processing in the Cloud

    Get PDF
    Increasingly data on the Web is stored in the form of Semantic Web data. Because of today's information overload, it becomes very important to store and query these big datasets in a scalable way and hence in a distributed fashion. Cloud Computing offers such a distributed environment with dynamic reallocation of computing and storing resources based on needs. In this work we introduce a scalable distributed Semantic Web database in the Cloud. In order to reduce the number of (unnecessary) intermediate results early, we apply bloom filters. Instead of computing bloom filters, a time-consuming task during query processing as it has been done traditionally, we precompute the bloom filters as much as possible and store them in the indices besides the data. The experimental results with data sets up to 1 billion triples show that our approach speeds up query processing significantly and sometimes even reduces the processing time to less than half

    SDSF : social-networking trust based distributed data storage and co-operative information fusion.

    Get PDF
    As of 2014, about 2.5 quintillion bytes of data are created each day, and 90% of the data in the world was created in the last two years alone. The storage of this data can be on external hard drives, on unused space in peer-to-peer (P2P) networks or using the more currently popular approach of storing in the Cloud. When the users store their data in the Cloud, the entire data is exposed to the administrators of the services who can view and possibly misuse the data. With the growing popularity and usage of Cloud storage services like Google Drive, Dropbox etc., the concerns of privacy and security are increasing. Searching for content or documents, from this distributed stored data, given the rate of data generation, is a big challenge. Information fusion is used to extract information based on the query of the user, and combine the data and learn useful information. This problem is challenging if the data sources are distributed and heterogeneous in nature where the trustworthiness of the documents may be varied. This thesis proposes two innovative solutions to resolve both of these problems. Firstly, to remedy the situation of security and privacy of stored data, we propose an innovative Social-based Distributed Data Storage and Trust based co-operative Information Fusion Framework (SDSF). The main objective is to create a framework that assists in providing a secure storage system while not overloading a single system using a P2P like approach. This framework allows the users to share storage resources among friends and acquaintances without compromising the security or privacy and enjoying all the benefits that the Cloud storage offers. The system fragments the data and encodes it to securely store it on the unused storage capacity of the data owner\u27s friends\u27 resources. The system thus gives a centralized control to the user over the selection of peers to store the data. Secondly, to retrieve the stored distributed data, the proposed system performs the fusion also from distributed sources. The technique uses several algorithms to ensure the correctness of the query that is used to retrieve and combine the data to improve the information fusion accuracy and efficiency for combining the heterogeneous, distributed and massive data on the Cloud for time critical operations. We demonstrate that the retrieved documents are genuine when the trust scores are also used while retrieving the data sources. The thesis makes several research contributions. First, we implement Social Storage using erasure coding. Erasure coding fragments the data, encodes it, and through introduction of redundancy resolves issues resulting from devices failures. Second, we exploit the inherent concept of trust that is embedded in social networks to determine the nodes and build a secure net-work where the fragmented data should be stored since the social network consists of a network of friends, family and acquaintances. The trust between the friends, and availability of the devices allows the user to make an informed choice about where the information should be stored using `k\u27 optimal paths. Thirdly, for the purpose of retrieval of this distributed stored data, we propose information fusion on distributed data using a combination of Enhanced N-grams (to ensure correctness of the query), Semantic Machine Learning (to extract the documents based on the context and not just bag of words and also considering the trust score) and Map Reduce (NSM) Algorithms. Lastly we evaluate the performance of distributed storage of SDSF using era- sure coding and identify the social storage providers based on trust and evaluate their trustworthiness. We also evaluate the performance of our information fusion algorithms in distributed storage systems. Thus, the system using SDSF framework, implements the beneficial features of P2P networks and Cloud storage while avoiding the pitfalls of these systems. The multi-layered encrypting ensures that all other users, including the system administrators cannot decode the stored data. The application of NSM algorithm improves the effectiveness of fusion since large number of genuine documents are retrieved for fusion
    corecore