581 research outputs found
Real-time analytics on large dynamic graphs
In today's fast-paced and interconnected digital world, the data generated by an increasing number of applications is being modeled as dynamic graphs. The graph structure encodes relationships among data items, while the structural changes to the graphs as well as the continuous stream of information produced by the entities in these graphs make them dynamic in nature. Examples include social networks where users post status updates, images, videos, etc.; phone call networks where nodes may send text messages or place phone calls; road traffic networks where the traffic behavior of the road segments changes constantly, and so on. There is a tremendous value in storing, managing, and analyzing such dynamic graphs and deriving meaningful insights in real-time. However, a majority of the work in graph analytics assumes a static setting, and there is a lack of systematic study of the various dynamic scenarios, the complexity they impose on the analysis tasks, and the challenges in building efficient systems that can support such tasks at a large scale.
In this dissertation, I design a unified streaming graph data management framework, and develop prototype systems to support increasingly complex tasks on dynamic graphs. In the first part, I focus on the management and querying of distributed graph data. I develop a hybrid replication policy that monitors the read-write frequencies of the nodes to decide dynamically what data to replicate, and whether to do eager or lazy replication in order to minimize network communication and support low-latency querying. In the second part, I study parallel execution of continuous neighborhood-driven aggregates, where each node aggregates the information generated in its neighborhoods. I build my system around the notion of an aggregation overlay graph, a pre-compiled data structure that enables sharing of partial aggregates across different queries, and also allows partial pre-computation of the aggregates to minimize the query latencies and increase throughput. Finally, I extend the framework to support continuous detection and analysis of activity-based subgraphs, where subgraphs could be specified using both graph structure as well as activity conditions on the nodes. The query specification tasks in my system are expressed using a set of active structural primitives, which allows the query evaluator to use a set of novel optimization techniques, thereby achieving high throughput.
Overall, in this dissertation, I define and investigate a set of novel tasks on dynamic graphs, design scalable optimization techniques, build prototype systems, and show the effectiveness of the proposed techniques through extensive evaluation using large-scale real and synthetic datasets
Structured Peer-to-Peer Overlays for NATed Churn Intensive Networks
The wide-spread coverage and ubiquitous presence of mobile networks has propelled the usage and adoption of mobile phones to an unprecedented level around the globe. The computing capabilities of these mobile phones have improved considerably, supporting a vast range of third party applications. Simultaneously, Peer-to-Peer (P2P) overlay networks have experienced a tremendous growth in terms of usage as well as popularity in recent years particularly in fixed wired networks. In particular, Distributed Hash Table (DHT) based Structured P2P overlay networks offer major advantages to users of mobile devices and networks such as scalable, fault tolerant and self-managing infrastructure which does not exhibit single points of failure. Integrating P2P overlays on the mobile network seems a logical progression; considering the popularities of both technologies. However, it imposes several challenges that need to be handled, such as the limited hardware capabilities of mobile phones and churn (i.e. the frequent join and leave of nodes within a network) intensive mobile networks offering limited yet expensive bandwidth availability. This thesis investigates the feasibility of extending P2P to mobile networks so that users can take advantage of both these technologies: P2P and mobile networks.
This thesis utilises OverSim, a P2P simulator, to experiment with the performance of various P2P overlays, considering high churn and bandwidth consumption which are the two most crucial constraints of mobile networks. The experiment results show that Kademlia and EpiChord are the two most appropriate P2P overlays that could be implemented in mobile networks. Furthermore, Network Address Translation (NAT) is a major barrier to the adoption of P2P overlays in mobile networks. Integrating NAT traversal approaches with P2P overlays is a crucial step for P2P overlays to operate successfully on mobile networks. This thesis presents a general approach of NAT traversal for ring based overlays without the use of a single dedicated server which is then implemented in OverSim. Several experiments have been performed under NATs to determine the suitability of the chosen P2P overlays under NATed environments. The results show that the performance of these overlays is comparable in terms of successful lookups in both NATed and non-NATed environments; with Kademlia and EpiChord exhibiting the best performance.
The presence of NATs and also the level of churn in a network influence the routing techniques used in P2P overlays. Recursive routing is more resilient to IP connectivity restrictions posed by NATs but not very robust in high churn environments, whereas iterative routing is more suitable to high churn networks, but difficult to use in NATed environments. Kademlia supports both these routing schemes whereas EpiChord only supports the iterating routing. This undermines the usefulness of EpiChord in NATed environments.
In order to harness the advantages of both routing schemes, this thesis presents an adaptive routing scheme, called Churn Aware Routing Protocol (ChARP), combining recursive and iterative lookups where nodes can switch between recursive and iterative routing depending on their lifetimes. The proposed approach has been implemented in OverSim and several experiments have been carried out. The experiment results indicate an improved performance which in turn validates the applicability and suitability of ChARP in NATed environments
Distributed resource discovery: architectures and applications in mobile networks
As the amount of digital information and services increases, it becomes increasingly important to be able to locate the desired content. The purpose of a resource discovery system is to allow available resources (information or services) to be located using a user-defined search criterion. This work studies distributed resource discovery systems that guarantee all existing resources to be found and allow a wide range of complex queries. Our goal is to allocate the load uniformly between the participating nodes, or alternatively to concentrate the load in the nodes with the highest available capacity.
The first part of the work examines the performance of various existing unstructured architectures and proposes new architectures that provide features especially valuable in mobile networks. To reduce the network traffic, we use indexing, which is particularly useful in scenarios, where searches are frequent compared to resource modifications. The ratio between the search and update frequencies determines the optimal level of indexing. Based on this observation, we develop an architecture that adjusts itself to changing network conditions and search behavior while maintaining optimal indexing. We also propose an architecture based on large-scale indexing that we later apply to resource sharing within a user group. Furthermore, we propose an architecture that relieves the topology constraints of the Parallel Index Clustering architecture. The performance of the architectures is evaluated using simulation.
In the second part of the work we apply the architectures to two types of mobile networks: cellular networks and ad hoc networks. In the cellular network, we first consider scenarios where multiple commercial operators provide a resource sharing service, and then a scenario where the users share resources without operator support. We evaluate the feasibility of the mobile peer-to-peer concept using user opinion surveys and technical performance studies. Based on user input we develop access control and group management algorithms for peer-to-peer networks. The technical evaluation is performed using prototype implementations. In particular, we examine whether the Session Initiation Protocol can be used for signaling in peer-to-peer networks. Finally, we study resource discovery in an ad hoc network. We observe that in an ad hoc network consisting of consumer devices, the capacity and mobility among nodes vary widely. We utilize this property in order to allocate the load to the high-capacity nodes, which serve lower-capacity nodes. We propose two methods for constructing a virtual backbone connecting the nodes
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Recommended from our members
A secure and scalable communication framework for inter-cloud services
A lot of contemporary cloud computing platforms offer Infrastructure-as-a-Service provisioning model, which offers to deliver basic virtualized computing resources like storage, hardware, and networking as on-demand and dynamic services. However, a single cloud service provider does not have limitless resources to offer to its users, and increasingly users are demanding the features of extensibility and inter-operability with other cloud service providers. This has increased the complexity of the cloud ecosystem and resulted in the emergence of the concept of an Inter-Cloud environment where a cloud computing platform can use the infrastructure resources of other cloud computing platforms to offer a greater value and flexibility to its users. However, there are no common models or standards in existence that allows the users of the cloud service providers to provision even some basic services across multiple cloud service providers seamlessly, although admittedly it is not due to any inherent incompatibility or proprietary nature of the foundation technologies on which these cloud computing platforms are built. Therefore, there is a justified need of investigating models and frameworks which allow the users of the cloud computing technologies to benefit from the added values of the emerging Inter-Cloud environment. In this dissertation, we present a novel security model and protocols that aims to cover one of the most important gaps in a subsection of this field, that is, the problem domain of provisioning secure communication within the context of a multi-provider Inter-Cloud environment. Our model offers a secure communication framework that enables a user of multiple cloud service providers to provision a dynamic application-level secure virtual private network on top of the participating cloud service providers. We accomplish this by taking leverage of the scalability, robustness, and flexibility of peer-to-peer overlays and distributed hash tables, in addition to novel usage of applied cryptography techniques to design secure and efficient admission control and resource discovery protocols. The peer-to-peer approach helps us in eliminating the problems of manual configurations, key management, and peer churn that are encountered when
setting up the secure communication channels dynamically, whereas the secure admission control and secure resource discovery protocols plug the security gaps that are commonly found in the peer-to-peer overlays. In addition to the design and architecture of our research contributions, we also present the details of a prototype implementation containing all of the elements of our research, as well as showcase our experimental results detailing the performance, scalability, and overheads of our approach, that have been carried out on actual (as
opposed to simulated) multiple commercial and non-commercial cloud computing platforms. These results demonstrate that our architecture incurs minimal latency and throughput overheads for the Inter-Cloud VPN connections among the virtual machines of a service deployed on multiple cloud platforms, which are 5% and 10% respectively. Our results also show that our admission control scheme is approximately 82% more efficient and our secure resource discovery scheme is about 72% more efficient than a standard PKI-based (Public Key Infrastructure) scheme
Distributed D3: A web-based distributed data visualisation framework for Big Data
The influx of Big Data has created an ever-growing need for analytic tools targeting towards the acquisition of insights and knowledge from large datasets. Visual perception as a fundamental tool used by humans to retrieve information from the outside world around us has its unique ability to distinguish patterns pre-attentively. Visual analytics via data visualisations is therefore a very powerful tool and has become ever more important in this era. Data-Driven Documents (D3.js) is a versatile and popular web-based data visualisation library that has tended to be the standard toolkit for visualising data in recent years. However, the library is technically inherent and limited in capability by the single thread model of a single browser window in a single machine, and therefore not able to deal with large datasets. The main objective of this thesis is to overcome this limitation and address possible challenges by developing the Distributed D3 framework that employs distributed mechanism to enable the possibility of delivering web-based visualisations for large-scale data, which also allows to effectively utilise the graphical computational resources of the modern visualisation environments. As a result, the first contribution is that the integrated version of Distributed D3 framework has been developed for the Data Observatory. The work proves the concept of Distributed D3 is feasible in reality and also enables developers to collaborate on large-scale data visualisations by using it on the Data Observatory. The second contribution is that the Distributed D3 has been optimised by investigating the potential bottlenecks for large-scale data visualisation applications. The work finds the key performance bottlenecks of the framework and shows an improvement of the overall performance by 35.7% after optimisations, which improves the scalability and usability of Distributed D3 for large-scale data visualisation applications. The third contribution is that the generic version of Distributed D3 framework has been developed for the customised environments. The work improves the usability and flexibility of the framework and makes it ready to be published in the open-source community for further improvements and usages.Open Acces
The INCF Digital Atlasing Program: Report on Digital Atlasing Standards in the Rodent Brain
The goal of the INCF Digital Atlasing Program is to provide the vision and direction necessary to make the rapidly growing collection of multidimensional data of the rodent brain (images, gene expression, etc.) widely accessible and usable to the international research community. This Digital Brain Atlasing Standards Task Force was formed in May 2008 to investigate the state of rodent brain digital atlasing, and formulate standards, guidelines, and policy recommendations.

Our first objective has been the preparation of a detailed document that includes the vision and specific description of an infrastructure, systems and methods capable of serving the scientific goals of the community, as well as practical issues for achieving
the goals. This report builds on the 1st INCF Workshop on Mouse and Rat Brain Digital Atlasing Systems (Boline et al., 2007, _Nature Preceedings_, doi:10.1038/npre.2007.1046.1) and includes a more detailed analysis of both the current state and desired state of digital atlasing along with specific recommendations for achieving these goals
Dynamic data placement and discovery in wide-area networks
The workloads of online services and applications such as social networks, sensor data platforms and web search engines have become increasingly global and dynamic, setting new challenges to providing users with low latency access to data. To achieve this, these services typically leverage a multi-site wide-area networked infrastructure. Data access latency in such an infrastructure depends on the network paths between users and data, which is determined by the data placement and discovery strategies. Current strategies are static, which offer low latencies upon deployment but worse performance under a dynamic workload.
We propose dynamic data placement and discovery strategies for wide-area networked infrastructures, which adapt to the data access workload. We achieve this with data activity correlation (DAC), an application-agnostic approach for determining the correlations between data items based on access pattern similarities. By dynamically clustering data according to DAC, network traffic in clusters is kept local. We utilise DAC as a key component in reducing access latencies for two application scenarios, emphasising different aspects of the problem:
The first scenario assumes the fixed placement of data at sites, and thus focusses on data discovery. This is the case for a global sensor discovery platform, which aims to provide low latency discovery of sensor metadata. We present a self-organising hierarchical infrastructure consisting of multiple DAC clusters, maintained with an online and distributed split-and-merge algorithm. This reduces the number of sites visited, and thus latency, during discovery for a variety of workloads.
The second scenario focusses on data placement. This is the case for global online services that leverage a multi-data centre deployment to provide users with low latency access to data. We present a geo-dynamic partitioning middleware, which maintains DAC clusters with an online elastic partition algorithm. It supports the geo-aware placement of partitions across data centres according to the workload. This provides globally distributed users with low latency access to data for static and dynamic workloads.Open Acces
- …