581 research outputs found

    Real-time analytics on large dynamic graphs

    Get PDF
    In today's fast-paced and interconnected digital world, the data generated by an increasing number of applications is being modeled as dynamic graphs. The graph structure encodes relationships among data items, while the structural changes to the graphs as well as the continuous stream of information produced by the entities in these graphs make them dynamic in nature. Examples include social networks where users post status updates, images, videos, etc.; phone call networks where nodes may send text messages or place phone calls; road traffic networks where the traffic behavior of the road segments changes constantly, and so on. There is a tremendous value in storing, managing, and analyzing such dynamic graphs and deriving meaningful insights in real-time. However, a majority of the work in graph analytics assumes a static setting, and there is a lack of systematic study of the various dynamic scenarios, the complexity they impose on the analysis tasks, and the challenges in building efficient systems that can support such tasks at a large scale. In this dissertation, I design a unified streaming graph data management framework, and develop prototype systems to support increasingly complex tasks on dynamic graphs. In the first part, I focus on the management and querying of distributed graph data. I develop a hybrid replication policy that monitors the read-write frequencies of the nodes to decide dynamically what data to replicate, and whether to do eager or lazy replication in order to minimize network communication and support low-latency querying. In the second part, I study parallel execution of continuous neighborhood-driven aggregates, where each node aggregates the information generated in its neighborhoods. I build my system around the notion of an aggregation overlay graph, a pre-compiled data structure that enables sharing of partial aggregates across different queries, and also allows partial pre-computation of the aggregates to minimize the query latencies and increase throughput. Finally, I extend the framework to support continuous detection and analysis of activity-based subgraphs, where subgraphs could be specified using both graph structure as well as activity conditions on the nodes. The query specification tasks in my system are expressed using a set of active structural primitives, which allows the query evaluator to use a set of novel optimization techniques, thereby achieving high throughput. Overall, in this dissertation, I define and investigate a set of novel tasks on dynamic graphs, design scalable optimization techniques, build prototype systems, and show the effectiveness of the proposed techniques through extensive evaluation using large-scale real and synthetic datasets

    Structured Peer-to-Peer Overlays for NATed Churn Intensive Networks

    Get PDF
    The wide-spread coverage and ubiquitous presence of mobile networks has propelled the usage and adoption of mobile phones to an unprecedented level around the globe. The computing capabilities of these mobile phones have improved considerably, supporting a vast range of third party applications. Simultaneously, Peer-to-Peer (P2P) overlay networks have experienced a tremendous growth in terms of usage as well as popularity in recent years particularly in fixed wired networks. In particular, Distributed Hash Table (DHT) based Structured P2P overlay networks offer major advantages to users of mobile devices and networks such as scalable, fault tolerant and self-managing infrastructure which does not exhibit single points of failure. Integrating P2P overlays on the mobile network seems a logical progression; considering the popularities of both technologies. However, it imposes several challenges that need to be handled, such as the limited hardware capabilities of mobile phones and churn (i.e. the frequent join and leave of nodes within a network) intensive mobile networks offering limited yet expensive bandwidth availability. This thesis investigates the feasibility of extending P2P to mobile networks so that users can take advantage of both these technologies: P2P and mobile networks. This thesis utilises OverSim, a P2P simulator, to experiment with the performance of various P2P overlays, considering high churn and bandwidth consumption which are the two most crucial constraints of mobile networks. The experiment results show that Kademlia and EpiChord are the two most appropriate P2P overlays that could be implemented in mobile networks. Furthermore, Network Address Translation (NAT) is a major barrier to the adoption of P2P overlays in mobile networks. Integrating NAT traversal approaches with P2P overlays is a crucial step for P2P overlays to operate successfully on mobile networks. This thesis presents a general approach of NAT traversal for ring based overlays without the use of a single dedicated server which is then implemented in OverSim. Several experiments have been performed under NATs to determine the suitability of the chosen P2P overlays under NATed environments. The results show that the performance of these overlays is comparable in terms of successful lookups in both NATed and non-NATed environments; with Kademlia and EpiChord exhibiting the best performance. The presence of NATs and also the level of churn in a network influence the routing techniques used in P2P overlays. Recursive routing is more resilient to IP connectivity restrictions posed by NATs but not very robust in high churn environments, whereas iterative routing is more suitable to high churn networks, but difficult to use in NATed environments. Kademlia supports both these routing schemes whereas EpiChord only supports the iterating routing. This undermines the usefulness of EpiChord in NATed environments. In order to harness the advantages of both routing schemes, this thesis presents an adaptive routing scheme, called Churn Aware Routing Protocol (ChARP), combining recursive and iterative lookups where nodes can switch between recursive and iterative routing depending on their lifetimes. The proposed approach has been implemented in OverSim and several experiments have been carried out. The experiment results indicate an improved performance which in turn validates the applicability and suitability of ChARP in NATed environments

    Distributed resource discovery: architectures and applications in mobile networks

    Get PDF
    As the amount of digital information and services increases, it becomes increasingly important to be able to locate the desired content. The purpose of a resource discovery system is to allow available resources (information or services) to be located using a user-defined search criterion. This work studies distributed resource discovery systems that guarantee all existing resources to be found and allow a wide range of complex queries. Our goal is to allocate the load uniformly between the participating nodes, or alternatively to concentrate the load in the nodes with the highest available capacity. The first part of the work examines the performance of various existing unstructured architectures and proposes new architectures that provide features especially valuable in mobile networks. To reduce the network traffic, we use indexing, which is particularly useful in scenarios, where searches are frequent compared to resource modifications. The ratio between the search and update frequencies determines the optimal level of indexing. Based on this observation, we develop an architecture that adjusts itself to changing network conditions and search behavior while maintaining optimal indexing. We also propose an architecture based on large-scale indexing that we later apply to resource sharing within a user group. Furthermore, we propose an architecture that relieves the topology constraints of the Parallel Index Clustering architecture. The performance of the architectures is evaluated using simulation. In the second part of the work we apply the architectures to two types of mobile networks: cellular networks and ad hoc networks. In the cellular network, we first consider scenarios where multiple commercial operators provide a resource sharing service, and then a scenario where the users share resources without operator support. We evaluate the feasibility of the mobile peer-to-peer concept using user opinion surveys and technical performance studies. Based on user input we develop access control and group management algorithms for peer-to-peer networks. The technical evaluation is performed using prototype implementations. In particular, we examine whether the Session Initiation Protocol can be used for signaling in peer-to-peer networks. Finally, we study resource discovery in an ad hoc network. We observe that in an ad hoc network consisting of consumer devices, the capacity and mobility among nodes vary widely. We utilize this property in order to allocate the load to the high-capacity nodes, which serve lower-capacity nodes. We propose two methods for constructing a virtual backbone connecting the nodes

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Distributed D3: A web-based distributed data visualisation framework for Big Data

    Get PDF
    The influx of Big Data has created an ever-growing need for analytic tools targeting towards the acquisition of insights and knowledge from large datasets. Visual perception as a fundamental tool used by humans to retrieve information from the outside world around us has its unique ability to distinguish patterns pre-attentively. Visual analytics via data visualisations is therefore a very powerful tool and has become ever more important in this era. Data-Driven Documents (D3.js) is a versatile and popular web-based data visualisation library that has tended to be the standard toolkit for visualising data in recent years. However, the library is technically inherent and limited in capability by the single thread model of a single browser window in a single machine, and therefore not able to deal with large datasets. The main objective of this thesis is to overcome this limitation and address possible challenges by developing the Distributed D3 framework that employs distributed mechanism to enable the possibility of delivering web-based visualisations for large-scale data, which also allows to effectively utilise the graphical computational resources of the modern visualisation environments. As a result, the first contribution is that the integrated version of Distributed D3 framework has been developed for the Data Observatory. The work proves the concept of Distributed D3 is feasible in reality and also enables developers to collaborate on large-scale data visualisations by using it on the Data Observatory. The second contribution is that the Distributed D3 has been optimised by investigating the potential bottlenecks for large-scale data visualisation applications. The work finds the key performance bottlenecks of the framework and shows an improvement of the overall performance by 35.7% after optimisations, which improves the scalability and usability of Distributed D3 for large-scale data visualisation applications. The third contribution is that the generic version of Distributed D3 framework has been developed for the customised environments. The work improves the usability and flexibility of the framework and makes it ready to be published in the open-source community for further improvements and usages.Open Acces

    A schema-based peer-to-peer infrastructure for digital library networks

    Get PDF
    [no abstract

    The INCF Digital Atlasing Program: Report on Digital Atlasing Standards in the Rodent Brain

    Get PDF
    The goal of the INCF Digital Atlasing Program is to provide the vision and direction necessary to make the rapidly growing collection of multidimensional data of the rodent brain (images, gene expression, etc.) widely accessible and usable to the international research community. This Digital Brain Atlasing Standards Task Force was formed in May 2008 to investigate the state of rodent brain digital atlasing, and formulate standards, guidelines, and policy recommendations.

Our first objective has been the preparation of a detailed document that includes the vision and specific description of an infrastructure, systems and methods capable of serving the scientific goals of the community, as well as practical issues for achieving
the goals. This report builds on the 1st INCF Workshop on Mouse and Rat Brain Digital Atlasing Systems (Boline et al., 2007, _Nature Preceedings_, doi:10.1038/npre.2007.1046.1) and includes a more detailed analysis of both the current state and desired state of digital atlasing along with specific recommendations for achieving these goals

    Dynamic data placement and discovery in wide-area networks

    Get PDF
    The workloads of online services and applications such as social networks, sensor data platforms and web search engines have become increasingly global and dynamic, setting new challenges to providing users with low latency access to data. To achieve this, these services typically leverage a multi-site wide-area networked infrastructure. Data access latency in such an infrastructure depends on the network paths between users and data, which is determined by the data placement and discovery strategies. Current strategies are static, which offer low latencies upon deployment but worse performance under a dynamic workload. We propose dynamic data placement and discovery strategies for wide-area networked infrastructures, which adapt to the data access workload. We achieve this with data activity correlation (DAC), an application-agnostic approach for determining the correlations between data items based on access pattern similarities. By dynamically clustering data according to DAC, network traffic in clusters is kept local. We utilise DAC as a key component in reducing access latencies for two application scenarios, emphasising different aspects of the problem: The first scenario assumes the fixed placement of data at sites, and thus focusses on data discovery. This is the case for a global sensor discovery platform, which aims to provide low latency discovery of sensor metadata. We present a self-organising hierarchical infrastructure consisting of multiple DAC clusters, maintained with an online and distributed split-and-merge algorithm. This reduces the number of sites visited, and thus latency, during discovery for a variety of workloads. The second scenario focusses on data placement. This is the case for global online services that leverage a multi-data centre deployment to provide users with low latency access to data. We present a geo-dynamic partitioning middleware, which maintains DAC clusters with an online elastic partition algorithm. It supports the geo-aware placement of partitions across data centres according to the workload. This provides globally distributed users with low latency access to data for static and dynamic workloads.Open Acces
    • …
    corecore