537 research outputs found

    Distributed k-core view materialization and maintenance for large dynamic graphs

    Get PDF
    Cataloged from PDF version of article.In graph theory, k-core is a key metric used to identify subgraphs of high cohesion, also known as the ‘dense’ regions of a graph. As the real world graphs such as social network graphs grow in size, the contents get richer and the topologies change dynamically, we are challenged not only to materialize k-core subgraphs for one time but also to maintain them in order to keep up with continuous updates. Adding to the challenge is that real world data sets are outgrowing the capacity of a single server and its main memory. These challenges inspired us to propose a new set of distributed algorithms for k-core view construction and maintenance on a horizontally scaling storage and computing platform. Our algorithms execute against the partitioned graph data in parallel and take advantage of k-core properties to aggressively prune unnecessary computation. Experimental evaluation results demonstrated orders of magnitude speedup and advantages of maintaining k-core incrementally and in batch windows over complete reconstruction. Our algorithms thus enable practitioners to create and maintain many k-core views on different topics in rich social network content simultaneously

    Distributed Iterative Graph Processing Using NoSQL with Data Locality

    Get PDF
    A tremendous amount of data is generated every day from a wide range of sources such as social networks, sensors, and application logs. Among them, graph data is one type that represents valuable relationships between various entities. Analytics of large graphs has become an essential part of business processes and scientific studies because it leads to deep and meaningful insights into the related domain based on the connections between various entities. However, the optimal processing of large-scale iterative graph computations is very challenging due to the issues like fault tolerance, high memory requirement, parallelization, and scalability. Most of the contemporary systems focus either on keeping the entire graph data in memory and minimizing the disk access or on processing the graph data completely on a single node with a centralized disk system. GraphMap is one of the state-of-the-art scalable and efficient out-of-core disk-based iterative graph processing systems that focus on using the secondary storage and optimizing the I/O access. In this thesis, we investigate two new extensions to the existing out-of-core NoSQL-based distributed iterative graph processing system: 1) Intra-worker data locality and 2) Mincut-based partitioning. We design an additional suite of data locality that moves the computation towards the data rather than the other way around. A significant improvement in performance, up to 39\%, is demonstrated by this locality implementation. Similarly, we use the mincut-based graph partitioning technique to distribute the graph data uniformly across the workers for parallelization so that the inter-worker communication volume is minimized. By extensive experiments, we also show that the mincut-based graph partitioning technique can lead to improper parallelization due to sub-optimal load-balancing

    BetterLife 2.0: large-scale social intelligence reasoning on cloud

    Get PDF
    This paper presents the design of the BetterLife 2.0 framework, which facilitates implementation of large-scale social intelligence application in cloud environment. We argued that more and more mobile social applications in pervasive computing need to be implemented this way, with a lot of user generated activities in social networking websites. We adopted the Case-based Reasoning technique to provide logical reasoning and outlined design considerations when porting a typical CBR framework jCOLIBRI2 to cloud, using Hadoop's various services (HDFS, HBase). These services allow efficient case base management (e.g. case insertion) and distribution of computational intensive jobs to speed up reasoning process more than 5 times. With the scalability merit of MapReduce, we can improve recommendation service with social network analysis that needs to handle millions of users' social activities. © 2010 IEEE.published_or_final_versionThe 2nd IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2010), Indianapolis, IN., 30 November-3 December 2010. In Proceedings of the 2nd CloudCom, 2010, p. 529-53

    Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks

    Get PDF
    The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks

    Deferred lightweight indexing for log-structured key-value stores

    Get PDF
    The recent shift towards write-intensive workload on bigdata (e.g., financial trading, social user-generated data streams)has pushed the proliferation of log-structured key-value stores, represented by Google's BigTable [1], Apache HBase [2] andCassandra [3]. While providing key-based data access with aPut/Get interface, these key-value stores do not support value-based access methods, which significantly limits their applicability in modern web and database applications. In this paper, we present DELI, a DEferred Lightweight Indexing scheme on the log-structured key-value stores. To index intensively updated bigdata in real time, DELI aims at making the index maintenance as lightweight as possible. The key idea is to apply an append-only design for online index maintenance and to collect index garbage at carefully chosen time. DELI optimizes the performance of index garbage collection through tightly coupling its execution with a native routine process called compaction. The DELI'ssystem design is fault-tolerant and generic (to most key-valuestores), we implemented a prototype of DELI based on HBasewithout internal code modification. Our experiments show that the DELI offers significant performance advantage for the write-intensive index maintenance

    Data management techniques

    Get PDF
    Today, it is projected that data storage and management is becoming one of the key challenges in order to achieve ultrascale computing for several reasons. First, data is expected to grow exponentially in the coming years and this progression will imply that disruptive technologies will be needed to store large amounts of data and more importantly to access it in a timely manner. Second, the improvement of computing elements and their scalability are shifting application execution from CPU bound to I/O bound. This creates additional challenges for significantly improving the access to data to keep with computation time and thus avoid high-performance computing (HPC) from being underutilized due to large periods of I/O activity. Third, the two initially separate worlds of HPC that mainly consisted on one hand of simulations that are CPU bound and on the other hand of analytics that mainly perform huge data scans to discover information and are I/O bound are blurring. Now, simulations and analytics need to work cooperatively and share the same I/O infrastructure
    • 

    corecore