Search CORE

94 research outputs found

An Introduction to Hyperdex and the Brave New World of High Performance, Scalable, Consistent, Faulttolerant Data Stores

Author: Bernard Wong
Emin
Gün Sirer
Robert Escriva
Publication venue
Publication date
Field of study

Low latency fast data computation scheme for map reduce based clusters

Author: Shabbir Aisha
Publication venue
Publication date: 01/01/2020
Field of study

MapReduce based clusters is an emerging paradigm for big data analytics to scale up and speed up the big data classification, investigation, and processing of the huge volumes, massive and complex data sets. One of the fundamental issues of processing the data in MapReduce clusters is to deal with resource heterogeneity, especially when there is data inter-dependency among the tasks. Secondly, MapReduce runs a job in many phases; the intermediate data traffic and its migration time become a major bottleneck for the computation of jobs which produces a huge intermediate data in the shuffle phase. Further, encountering factors to monitor the critical issue of straggling is necessary because it produces unnecessary delays and poses a serious constraint on the overall performance of the system. Thus, this research aims to provide a low latency fast data computation scheme which introduces three algorithms to handle interdependent task computation among heterogeneous resources, reducing intermediate data traffic with its migration time and monitoring and modelling job straggling factors. This research has developed a Low Latency and Computational Cost based Tasks Scheduling (LLCC-TS) algorithm of interdependent tasks on heterogeneous resources by encountering priority to provide cost-effective resource utilization and reduced makespan. Furthermore, an Aggregation and Partition based Accelerated Intermediate Data Migration (APAIDM) algorithm has been presented to reduce the intermediate data traffic and data migration time in the shuffle phase by using aggregators and custom partitioner. Moreover, MapReduce Total Execution Time Prediction (MTETP) scheme for MapReduce job computation with inclusion of the factors which affect the job computation time has been produced using machine learning technique (linear regression) in order to monitor the job straggling and minimize the latency. LLCCTS algorithm has 66.13%, 22.23%, 43.53%, and 44.74% performance improvement rate over FIFO, improved max-min, SJF and MOS algorithms respectively for makespan time of scheduling of interdependent tasks. The AP-AIDM algorithm scored 66.62% and 48.4% performance improvements in reducing the data migration time over hash basic and conventional aggregation algorithms, respectively. Moreover, an MTETP technique shows the performance improvement in predicting the total job execution time with 20.42% accuracy than the improved HP technique. Thus, the combination of the three algorithms mentioned above provides a low latency fast data computation scheme for MapReduce based clusters

Universiti Teknologi Malaysia Institutional Repository

Recommended from our members

Achieving Accurate Predictions of Future Events Under Hardware Heterogeneity

Author: Prodromou Andreas
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Heterogeneous hardware is becoming increasingly available in modern hardware, while research breakthroughs enforce the expectation that heterogeneity will keep increasing in the future. Significant gains can be achieved via appropriate utilization of heterogeneity, in terms of performance and power consumption, however, poor utilization can have a detrimental effect. Intelligent scheduling and resource management is a crucial challenge we need to overcome in order to harvest the full potential of heterogeneous hardware. As systems become larger and include greater levels of hardware diversity, the importance of intelligent scheduling and resource management is further accentuated.This dissertation presents techniques that aid the process of scheduling and resource management in the presence of heterogeneous hardware, via accurately predicting upcoming runtime events. With a proactive and accurate view of the near future, schedulers can utilize the underlying hardware more efficiently, and fully take advantage of the available benefits.By adapting a majority element heuristic, this dissertation significantly improves the accuracy of predicting memory addresses about to be accessed, while reducing prediction-related costs by a factor of ten thousand compared to previously proposed predictive approaches. Coupled with novel microarchitectural modifications, accurate address predictions are shown to improve the performance of heterogeneous memory architectures.Machine learning-based performance predictors are further presented, capable of predicting a program's performance when executed on a given general-purpose core. Trained to model the subtleties of the interaction between hardware and software, these predictors are capable of generating highly accurate predictions even for cores with varied Instruction Set Architectures. Utilizing these performance predictions for job scheduling, is shown to improve overall system performance. The trained predictors are further examined and interpreted in order to visualize the correlations between features picked up and amplified during training.Finally, this dissertation demonstrates that scheduling algorithms cannot guarantee deriving an optimal schedule during realistic execution scenarios due to the underlying hardware heterogeneity, the wide range of runtime requirements of software, as well as prediction error from performance predictors. In response, deep neural networks are trained to select one scheduling approach from a list of options with varied overheads and correctness guarantees. The scheduling approach chosen, is the one which will most likely return the highest-performance schedule with the lowest overhead, given a particular instance of the job-to-core assignment problem

eScholarship - University of California

Data-centric serverless cloud architecture

Author: Ustiugov Dmitrii
Publication venue: The University of Edinburgh
Publication date: 08/06/2022
Field of study

Serverless has become a new dominant cloud architecture thanks to its high scalability and flexible, pay-as-you-go billing model. In serverless, developers compose their cloud services as a set of functions while providers take responsibility for scaling each function’s resources according to traffic changes. Hence, the provider needs to timely spawn, or tear down, function instances (i.e., HTTP servers with user-provider handles), which cannot hold state across function invocations. Performance of a modern serverless cloud is bound by data movement. Serverless architecture separates compute resources and data management to allow function instances to run on any node in a cloud datacenter. This flexibility comes at the cost of the necessity to move function initialization state across the entire datacenter when spawning new instances on demand. Furthermore, to facilitate scaling, cloud providers restrict the serverless programming model to stateless functions (which cannot hold or share state across different functions), which lack efficient support for cross-function communication. This thesis consists of four following research contributions that pave the way for a data-centric serverless cloud architecture. First, we introduce STeLLAR, an opensource serverless benchmarking framework, which enables an accurate performance characterization of serverless deployments. Using STeLLAR, we study three leading serverless clouds and identify that all of them follow the same conceptual architecture that comprises three essential subsystems, namely the worker fleet, the scheduler, and the storage. Our analysis quantifies the aspect of the data movement problem that is related to moving state from the storage to workers when spawning function instances (“cold-start” delays). Also, we study two state-of-the-art production methods of crossfunction communication that involve either the storage or the scheduler subsystems, if the data is transmitted as part of invocation HTTP requests (i.e., inline). Second, we introduce vHive, an open-source ecosystem for serverless benchmarking and experimentation, with the goal of enabling researchers to study and innovate across the entire serverless stack. In contrast to the incomplete academic prototypes and proprietary infrastructure of the leading commercial clouds, vHive is representative of the leading clouds and comprises only fully open-source production-grade components, such as Kubernetes orchestrator and AWS Firecracker hypervisor technologies. To demonstrate vHive’s utility, we analyze the cold-start delays, revealing that the high cold-start latency of function instances is attributable to frequent page faults as the function’s state is brought from disk into guest memory one page at a time. Our analysis further reveals that serverless functions operate over stable working sets - even across function invocations. Third, to reduce the cold-start delays of serverless functions, we introduce a novel snapshotting mechanism that records and prefetches their memory working sets. This mechanism, called REAP, is implemented in userspace and consists of two phases. During the first invocation of a function, all accessed memory pages are recorded and their contents are stored compactly as a part of the function snapshot. Starting from the second cold invocation, the contents of the recorded pages are retrieved from storage and installed in the guest memory before the new function instance starts to process the invocation, allowing to avoid the majority of page faults, hence significantly accelerating the function’s cold starts. Finally, to accelerate the cross-function data communication, we propose Expedited Data Transfers (XDT), an API-preserving high-performance data communication method for serverless. In production clouds, function transmit intermediate data to other functions either inline or through a third-party storage service. The former approach is restricted to small transfer sizes, the latter supports arbitrary transfers but suffers from performance and cost overheads. XDT enables direct function-to-function transfers in a way that is fully compatible with the existing autoscaling infrastructure. With XDT, a trusted component of the sender function buffers the payload in its memory and sends a secure reference to the receiver, which is picked by the load balancer and autoscaler based on the current load. Using the reference, the receiver instance pulls the transmitted data directly from sender’s memory, obviating the need for intermediary storage

Edinburgh Research Archive

Preserving and sharing born-digital and hybrid objects from and across the National Collection

Author: Arrigoni G.
Kane N.
McConnachie S.
McKim Joel
Publication venue: V&A Research Institute
Publication date: 05/01/2022
Field of study

This report is one of a set of outputs from the Arts and Humanities Research Council funded project ‘Preserving and sharing born-digital and hybrid objects from and across the National Collection’. It has been designed to provide an extensive account of the project research activities and findings, to be useful to museum, heritage, and preservation professionals, as well as to scholars interested in born-digital materials. The aims of the project were to instigate a conversation and build confidence across the museum sector to support the collecting of born-digital objects, and to lay the foundations for future research in the field. The research gathers the expertise of professionals from different backgrounds, and has an international ambition; however, institutions addressing this type of collections tend to be concentrated in a few countries across Europe, Australia and North America. The research’s methodology includes: desk-based research, the focused investigation of four case studies, interviews and workshops. The analysis of the data collected has supported the articulation of a set of themes and key ideas that provide the grounding for the expression of policy, research and practice-related recommendations. The report understands the challenges of collecting born-digital objects as going beyond the mere technical realm of obsolescence and broken dependencies, to address issues of legality, visibility and accountability. It discusses the multi-layered and complex authorship of many born-digital objects associated with communities or corporate ownership, and expands on the potential of collaborative approaches to collection stewardship

Birkbeck Institutional Research Online

Censorship Citadels: Geography and the Social Control of Girls

Author: Stearns Ami
Publication venue
Publication date: 08/05/2014
Field of study

This qualitative study examines the way in which local attempts to censor certain books reflect a greater community agenda of controlling young female behavior, specifically sexual and violent behavior. To abet my argument, I draw on Erikson’s and Durkheim’s theories on boundary maintenance, Gusfield’s symbolic crusades, an intersectional feminist perspective, and scholarship on new forms of religious fundamentalism. Using data on frequently challenged books collected by the American Library Association, I identify the top three cities with populations over 100,000 that issued the greatest number of challenges between 2000 and 2009 (“Censorship Citadels”) and compare these to three cities of similar size that only challenged one or zero titles. I document the changes in percent white, percent foreign-born, percent homeownership, and rates of poverty in each city, in addition to examining visible boundary breaches by girls for each of the three Censorship Citadels and their comparison cities. Visible boundary breaches by girls include 1) higher rates of births to minor girls, 2) no required notification or permission from parents for a minor’s abortion, 3) higher likelihood the school distributes contraceptives and 4) more newspaper articles covering girls’ violence. Lastly, I undertake a content analysis of the books challenged by the Censorship Citadels (N=119) and the comparison cities (N=1) and theorize about the relationship between the books’ contents and the community’s perceived threats from visible norm breaking by girls. I suggest that cities experiencing more demographic changes during the decade and cities housing more megachurches are cities that attempt more social control of girls through frequent book challenges

SHAREOK repository

Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

Author: Gao Xiaoming
Publication venue: [Bloomington, Ind.] : Indiana University
Publication date: 01/01/2015
Field of study

Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required. To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems. In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries. The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms. In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms. Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies

IUScholarWorks (University of Indiana)

Using Workload Prediction and Federation to Increase Cloud Utilization

Author: Pucher Alexander Ernst
Publication venue: eScholarship, University of California
Publication date: 01/01/2016
Field of study

The wide-spread adoption of cloud computing has changed how large-scale computing infrastructure is built and managed. Infrastructure-as-a-Service (IaaS) clouds consolidate different separate workloads onto a shared platform and provide a consistent quality of service by overprovisioning capacity. This additional capacity, however, remains idle for extended periods of time and represents a drag on system efficiency.The smaller scale of private IaaS clouds compared to public clouds exacerbates overprovisioning inefficiencies as opportunities for workload consolidation in private clouds are limited. Federation and cycle harvesting capabilities from computational grids help to improve efficiency, but to date have seen only limited adoption in the cloud due to a fundamental mismatch between the usage models of grids and clouds. Computational grids provide high throughput of queued batch jobs on a best-effort basis and enforce user priorities through dynamic job preemption, while IaaS clouds provide immediate feedback to user requests and make ahead-of-time guarantees about resource availability.We present a novel method to enable workload federation across IaaS clouds that overcomes this mismatch between grid and cloud usage models and improves system efficiency while also offering availability guarantees. We develop a new method for faster-than-realtime simulation of IaaS clouds to make predictions about system utilization and leverage this method to estimate the future availability of preemptible resources in the cloud. We then use these estimates to perform careful admission control and provide ahead-of-time bounds on the preemption probability of federated jobs executing on preemptible resources. Finally, we build an end-to-end prototype that addresses practical issues of workload federation and evaluate the prototype's efficacy using real-world traces from big data and compute-intensive production workloads

Ezid

eScholarship - University of California

A Commercial Law for Software Contracting

Author: Kavusturan Elif
Rustad Michael L
Publication venue: Washington & Lee University School of Law Scholarly Commons
Publication date: 19/06/2019
Field of study

Since the 1980s, software is at the core of most modern organizations, most products and most services. Part II of this Article examines how the U.C.C. evolved as the primary source of law for the first generation of computer contracts during the mainframe computer era. Part III examines how courts have overextended U.C.C. Article 2, as the main source of law for software licensing, to the limits. Part IV argues that the ALI and the NCCUSL should propose a new Article 2B for software licensing. Part V recommends a new Article 2C for “software as a service.

Washington and Lee University School of Law

The Murray Ledger and Times, April 5, 2000

Author: The Murray Ledger and Times
Publication venue: Murray State\u27s Digital Commons
Publication date: 05/04/2000
Field of study

Murray State University