7 research outputs found

    An Efficient Holistic Data Distribution and Storage Solution for Online Social Networks

    Get PDF
    In the past few years, Online Social Networks (OSNs) have dramatically spread over the world. Facebook [4], one of the largest worldwide OSNs, has 1.35 billion users, 82.2% of whom are outside the US [36]. The browsing and posting interactions (text content) between OSN users lead to user data reads (visits) and writes (updates) in OSN datacenters, and Facebook now serves a billion reads and tens of millions of writes per second [37]. Besides that, Facebook has become one of the top Internet traļ¬ƒc sources [36] by sharing tremendous number of large multimedia ļ¬les including photos and videos. The servers in datacenters have limited resources (e.g. bandwidth) to supply latency eļ¬ƒcient service for multimedia ļ¬le sharing among the rapid growing users worldwide. Most online applications operate under soft real-time constraints (e.g., ā‰¤ 300 ms latency) for good user experience, and its service latency is negatively proportional to its income. Thus, the service latency is a very important requirement for Quality of Service (QoS) to the OSN as a web service, since it is relevant to the OSNā€™s revenue and user experience. Also, to increase OSN revenue, OSN service providers need to constrain capital investment, operation costs, and the resource (bandwidth) usage costs. Therefore, it is critical for the OSN to supply a guaranteed QoS for both text and multimedia contents to users while minimizing its costs. To achieve this goal, in this dissertation, we address three problems. i) Data distribution among datacenters: how to allocate data (text contents) among data servers with low service latency and minimized inter-datacenter network load; ii) Eļ¬ƒcient multimedia ļ¬le sharing: how to facilitate the servers in datacenters to eļ¬ƒciently share multimedia ļ¬les among users; iii) Cost minimized data allocation among cloud storages: how to save the infrastructure (datacenters) capital investment and operation costs by leveraging commercial cloud storage services. Data distribution among datacenters. To serve the text content, the new OSN model, which deploys datacenters globally, helps reduce service latency to worldwide distributed users and release the load of the existing datacenters. However, it causes higher inter-datacenter communica-tion load. In the OSN, each datacenter has a full copy of all data, and the master datacenter updates all other datacenters, generating tremendous load in this new model. The distributed data storage, which only stores a userā€™s data to his/her geographically closest datacenters, simply mitigates the problem. However, frequent interactions between distant users lead to frequent inter-datacenter com-munication and hence long service latencies. Therefore, the OSNs need a data allocation algorithm among datacenters with minimized network load and low service latency. Eļ¬ƒcient multimedia ļ¬le sharing. To serve multimedia ļ¬le sharing with rapid growing user population, the ļ¬le distribution method should be scalable and cost eļ¬ƒcient, e.g. minimiza-tion of bandwidth usage of the centralized servers. The P2P networks have been widely used for ļ¬le sharing among a large amount of users [58, 131], and meet both scalable and cost eļ¬ƒcient re-quirements. However, without fully utilizing the altruism and trust among friends in the OSNs, current P2P assisted ļ¬le sharing systems depend on strangers or anonymous users to distribute ļ¬les that degrades their performance due to user selļ¬sh and malicious behaviors. Therefore, the OSNs need a cost eļ¬ƒcient and trustworthy P2P-assisted ļ¬le sharing system to serve multimedia content distribution. Cost minimized data allocation among cloud storages. The new trend of OSNs needs to build worldwide datacenters, which introduce a large amount of capital investment and maintenance costs. In order to save the capital expenditures to build and maintain the hardware infrastructures, the OSNs can leverage the storage services from multiple Cloud Service Providers (CSPs) with existing worldwide distributed datacenters [30, 125, 126]. These datacenters provide diļ¬€erent Get/Put latencies and unit prices for resource utilization and reservation. Thus, when se-lecting diļ¬€erent CSPsā€™ datacenters, an OSN as a cloud customer of a globally distributed application faces two challenges: i) how to allocate data to worldwide datacenters to satisfy application SLA (service level agreement) requirements including both data retrieval latency and availability, and ii) how to allocate data and reserve resources in datacenters belonging to diļ¬€erent CSPs to minimize the payment cost. Therefore, the OSNs need a data allocation system distributing data among CSPsā€™ datacenters with cost minimization and SLA guarantee. In all, the OSN needs an eļ¬ƒcient holistic data distribution and storage solution to minimize its network load and cost to supply a guaranteed QoS for both text and multimedia contents. In this dissertation, we propose methods to solve each of the aforementioned challenges in OSNs. Firstly, we verify the beneļ¬ts of the new trend of OSNs and present OSN typical properties that lay the basis of our design. We then propose Selective Data replication mechanism in Distributed Datacenters (SD3) to allocate user data among geographical distributed datacenters. In SD3,a datacenter jointly considers update rate and visit rate to select user data for replication, and further atomizes a userā€™s diļ¬€erent types of data (e.g., status update, friend post) for replication, making sure that a replica always reduces inter-datacenter communication. Secondly, we analyze a BitTorrent ļ¬le sharing trace, which proves the necessity of proximity-and interest-aware clustering. Based on the trace study and OSN properties, to address the second problem, we propose a SoCial Network integrated P2P ļ¬le sharing system for enhanced Eļ¬ƒciency and Trustworthiness (SOCNET) to fully and cooperatively leverage the common-interest, geographically-close and trust properties of OSN friends. SOCNET uses a hierarchical distributed hash table (DHT) to cluster common-interest nodes, and then further clusters geographically close nodes into a subcluster, and connects the nodes in a subcluster with social links. Thus, when queries travel along trustable social links, they also gain higher probability of being successfully resolved by proximity-close nodes, simultaneously enhancing eļ¬ƒciency and trustworthiness. Thirdly, to handle the third problem, we model the cost minimization problem under the SLA constraints using integer programming. According to the system model, we propose an Eco-nomical and SLA-guaranteed cloud Storage Service (ES3), which ļ¬nds a data allocation and resource reservation schedule with cost minimization and SLA guarantee. ES3 incorporates (1) a data al-location and reservation algorithm, which allocates each data item to a datacenter and determines the reservation amount on datacenters by leveraging all the pricing policies; (2) a genetic algorithm based data allocation adjustment approach, which makes data Get/Put rates stable in each data-center to maximize the reservation beneļ¬t; and (3) a dynamic request redirection algorithm, which dynamically redirects a data request from an over-utilized datacenter to an under-utilized datacenter with suļ¬ƒcient reserved resource when the request rate varies greatly to further reduce the payment. Finally, we conducted trace driven experiments on a distributed testbed, PlanetLab, and real commercial cloud storage (Amazon S3, Windows Azure Storage and Google Cloud Storage) to demonstrate the eļ¬ƒciency and eļ¬€ectiveness of our proposed systems in comparison with other systems. The results show that our systems outperform others in the network savings and data distribution eļ¬ƒciency

    High Performance Network Evaluation and Testing

    Get PDF

    On Improving Efficiency of Data-Intensive Applications in Geo-Distributed Environments

    Get PDF
    Distributed systems are pervasively demanded and adopted in nowadays for processing data-intensive workloads since they greatly accelerate large-scale data processing with scalable parallelism and improved data locality. Traditional distributed systems initially targeted computing clusters but have since evolved to data centers with multiple clusters. These systems are mostly built on top of homogeneous, tightly integrated resources connected in high-speed local-area networks (LANs), and typically require data to be ingested to a central data center for processing. Today, with enormous volumes of data continuously generated from geographically distributed locations, direct adoption of such systems is prohibitively inefficient due to the limited system scalability and high cost for centralizing the geo-distributed data over the wide-area networks (WANs). More commonly, it becomes a trend to build geo-distributed systems wherein data processing jobs are performed on top of geo-distributed, heterogeneous resources in proximity to the data at vastly distributed geo-locations. However, critical challenges and mechanisms for efficient execution of data-intensive applications in such geo-distributed environments are unclear by far. The goal of this dissertation is to identify such challenges and mechanisms, by extensively using the research principles and methodology of conventional distributed systems to investigate the geo-distributed environment, and by developing new techniques to tackle these challenges and run data-intensive applications with efficiency at scale. The contributions of this dissertation are threefold. Firstly, the dissertation shows that the high level of resource heterogeneity exhibited in the geo-distributed environment undermines the scalability of geo-distributed systems. Virtualization-based resource abstraction mechanisms have been introduced to abstract the hardware, network, and OS resources throughout the system, to mitigate the underlying resource heterogeneity and enhance the system scalability. Secondly, the dissertation reveals the overwhelming performance and monetary cost incurred by indulgent data sharing over the WANs in geo-distributed systems. Network optimization approaches, including linear- programming-based global optimization, greedy bin-packing heuristics, and TCP enhancement, are developed to optimize the network resource utilization and circumvent unnecessary expenses imposed on data sharing in WANs. Lastly, the dissertation highlights the importance of data locality for data-intensive applications running in the geo-distributed environment. Novel data caching and locality-aware scheduling techniques are devised to improve the data locality.Doctor of Philosoph

    The terminator : an AI-based framework to handle dependability threats in large-scale distributed systems

    Get PDF
    With the advent of resource-hungry applications such as scientific simulations and artificial intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming more pressing. HPC systems are typically characterised by the scale of the resources they possess, containing a large number of sophisticated HW components that are tightly integrated. This scale and design complexity inherently contribute to sources of uncertainties, i.e., there are dependability threats that perturb the system during application execution. During system execution, these HPC systems generate a massive amount of log messages that capture the health status of the various components. Several previous works have leveraged those systemsā€™ logs for dependability purposes, such as failure prediction, with varying results. In this work, three novel AI-based techniques are proposed to address two major dependability problems, those of (i) error detection and (ii) failure prediction. The proposed error detection technique leverages the sentiments embedded in log messages in a novel way, making the approach HPC system-independent, i.e., the technique can be used to detect errors in any HPC system. On the other hand, two novel self-supervised transformer neural networks are developed for failure prediction, thereby obviating the need for labels, which are notoriously difficult to obtain in HPC systems. The first transformer technique, called Clairvoyant, accurately predicts the location of the failure, while the second technique, called Time Machine, extends Clairvoyant by also accurately predicting the lead time to failure (LTTF). Time Machine addresses the typical regression problem of LTTF as a novel multi-class classification problem, using a novel oversampling method for online time-based task training. Results from six real-world HPC clustersā€™ datasets show that our approaches significantly outperform the state-of-the-art methods on various metrics
    corecore