94 research outputs found

    Low latency fast data computation scheme for map reduce based clusters

    Get PDF
    MapReduce based clusters is an emerging paradigm for big data analytics to scale up and speed up the big data classification, investigation, and processing of the huge volumes, massive and complex data sets. One of the fundamental issues of processing the data in MapReduce clusters is to deal with resource heterogeneity, especially when there is data inter-dependency among the tasks. Secondly, MapReduce runs a job in many phases; the intermediate data traffic and its migration time become a major bottleneck for the computation of jobs which produces a huge intermediate data in the shuffle phase. Further, encountering factors to monitor the critical issue of straggling is necessary because it produces unnecessary delays and poses a serious constraint on the overall performance of the system. Thus, this research aims to provide a low latency fast data computation scheme which introduces three algorithms to handle interdependent task computation among heterogeneous resources, reducing intermediate data traffic with its migration time and monitoring and modelling job straggling factors. This research has developed a Low Latency and Computational Cost based Tasks Scheduling (LLCC-TS) algorithm of interdependent tasks on heterogeneous resources by encountering priority to provide cost-effective resource utilization and reduced makespan. Furthermore, an Aggregation and Partition based Accelerated Intermediate Data Migration (APAIDM) algorithm has been presented to reduce the intermediate data traffic and data migration time in the shuffle phase by using aggregators and custom partitioner. Moreover, MapReduce Total Execution Time Prediction (MTETP) scheme for MapReduce job computation with inclusion of the factors which affect the job computation time has been produced using machine learning technique (linear regression) in order to monitor the job straggling and minimize the latency. LLCCTS algorithm has 66.13%, 22.23%, 43.53%, and 44.74% performance improvement rate over FIFO, improved max-min, SJF and MOS algorithms respectively for makespan time of scheduling of interdependent tasks. The AP-AIDM algorithm scored 66.62% and 48.4% performance improvements in reducing the data migration time over hash basic and conventional aggregation algorithms, respectively. Moreover, an MTETP technique shows the performance improvement in predicting the total job execution time with 20.42% accuracy than the improved HP technique. Thus, the combination of the three algorithms mentioned above provides a low latency fast data computation scheme for MapReduce based clusters

    Data-centric serverless cloud architecture

    Get PDF
    Serverless has become a new dominant cloud architecture thanks to its high scalability and flexible, pay-as-you-go billing model. In serverless, developers compose their cloud services as a set of functions while providers take responsibility for scaling each function’s resources according to traffic changes. Hence, the provider needs to timely spawn, or tear down, function instances (i.e., HTTP servers with user-provider handles), which cannot hold state across function invocations. Performance of a modern serverless cloud is bound by data movement. Serverless architecture separates compute resources and data management to allow function instances to run on any node in a cloud datacenter. This flexibility comes at the cost of the necessity to move function initialization state across the entire datacenter when spawning new instances on demand. Furthermore, to facilitate scaling, cloud providers restrict the serverless programming model to stateless functions (which cannot hold or share state across different functions), which lack efficient support for cross-function communication. This thesis consists of four following research contributions that pave the way for a data-centric serverless cloud architecture. First, we introduce STeLLAR, an opensource serverless benchmarking framework, which enables an accurate performance characterization of serverless deployments. Using STeLLAR, we study three leading serverless clouds and identify that all of them follow the same conceptual architecture that comprises three essential subsystems, namely the worker fleet, the scheduler, and the storage. Our analysis quantifies the aspect of the data movement problem that is related to moving state from the storage to workers when spawning function instances (“cold-start” delays). Also, we study two state-of-the-art production methods of crossfunction communication that involve either the storage or the scheduler subsystems, if the data is transmitted as part of invocation HTTP requests (i.e., inline). Second, we introduce vHive, an open-source ecosystem for serverless benchmarking and experimentation, with the goal of enabling researchers to study and innovate across the entire serverless stack. In contrast to the incomplete academic prototypes and proprietary infrastructure of the leading commercial clouds, vHive is representative of the leading clouds and comprises only fully open-source production-grade components, such as Kubernetes orchestrator and AWS Firecracker hypervisor technologies. To demonstrate vHive’s utility, we analyze the cold-start delays, revealing that the high cold-start latency of function instances is attributable to frequent page faults as the function’s state is brought from disk into guest memory one page at a time. Our analysis further reveals that serverless functions operate over stable working sets - even across function invocations. Third, to reduce the cold-start delays of serverless functions, we introduce a novel snapshotting mechanism that records and prefetches their memory working sets. This mechanism, called REAP, is implemented in userspace and consists of two phases. During the first invocation of a function, all accessed memory pages are recorded and their contents are stored compactly as a part of the function snapshot. Starting from the second cold invocation, the contents of the recorded pages are retrieved from storage and installed in the guest memory before the new function instance starts to process the invocation, allowing to avoid the majority of page faults, hence significantly accelerating the function’s cold starts. Finally, to accelerate the cross-function data communication, we propose Expedited Data Transfers (XDT), an API-preserving high-performance data communication method for serverless. In production clouds, function transmit intermediate data to other functions either inline or through a third-party storage service. The former approach is restricted to small transfer sizes, the latter supports arbitrary transfers but suffers from performance and cost overheads. XDT enables direct function-to-function transfers in a way that is fully compatible with the existing autoscaling infrastructure. With XDT, a trusted component of the sender function buffers the payload in its memory and sends a secure reference to the receiver, which is picked by the load balancer and autoscaler based on the current load. Using the reference, the receiver instance pulls the transmitted data directly from sender’s memory, obviating the need for intermediary storage

    Preserving and sharing born-digital and hybrid objects from and across the National Collection

    Get PDF
    This report is one of a set of outputs from the Arts and Humanities Research Council funded project ‘Preserving and sharing born-digital and hybrid objects from and across the National Collection’. It has been designed to provide an extensive account of the project research activities and findings, to be useful to museum, heritage, and preservation professionals, as well as to scholars interested in born-digital materials. The aims of the project were to instigate a conversation and build confidence across the museum sector to support the collecting of born-digital objects, and to lay the foundations for future research in the field. The research gathers the expertise of professionals from different backgrounds, and has an international ambition; however, institutions addressing this type of collections tend to be concentrated in a few countries across Europe, Australia and North America. The research’s methodology includes: desk-based research, the focused investigation of four case studies, interviews and workshops. The analysis of the data collected has supported the articulation of a set of themes and key ideas that provide the grounding for the expression of policy, research and practice-related recommendations. The report understands the challenges of collecting born-digital objects as going beyond the mere technical realm of obsolescence and broken dependencies, to address issues of legality, visibility and accountability. It discusses the multi-layered and complex authorship of many born-digital objects associated with communities or corporate ownership, and expands on the potential of collaborative approaches to collection stewardship

    Censorship Citadels: Geography and the Social Control of Girls

    Get PDF
    This qualitative study examines the way in which local attempts to censor certain books reflect a greater community agenda of controlling young female behavior, specifically sexual and violent behavior. To abet my argument, I draw on Erikson’s and Durkheim’s theories on boundary maintenance, Gusfield’s symbolic crusades, an intersectional feminist perspective, and scholarship on new forms of religious fundamentalism. Using data on frequently challenged books collected by the American Library Association, I identify the top three cities with populations over 100,000 that issued the greatest number of challenges between 2000 and 2009 (“Censorship Citadels”) and compare these to three cities of similar size that only challenged one or zero titles. I document the changes in percent white, percent foreign-born, percent homeownership, and rates of poverty in each city, in addition to examining visible boundary breaches by girls for each of the three Censorship Citadels and their comparison cities. Visible boundary breaches by girls include 1) higher rates of births to minor girls, 2) no required notification or permission from parents for a minor’s abortion, 3) higher likelihood the school distributes contraceptives and 4) more newspaper articles covering girls’ violence. Lastly, I undertake a content analysis of the books challenged by the Censorship Citadels (N=119) and the comparison cities (N=1) and theorize about the relationship between the books’ contents and the community’s perceived threats from visible norm breaking by girls. I suggest that cities experiencing more demographic changes during the decade and cities housing more megachurches are cities that attempt more social control of girls through frequent book challenges

    Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required. To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems. In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries. The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms. In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms. Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies

    Using Workload Prediction and Federation to Increase Cloud Utilization

    Get PDF
    The wide-spread adoption of cloud computing has changed how large-scale computing infrastructure is built and managed. Infrastructure-as-a-Service (IaaS) clouds consolidate different separate workloads onto a shared platform and provide a consistent quality of service by overprovisioning capacity. This additional capacity, however, remains idle for extended periods of time and represents a drag on system efficiency.The smaller scale of private IaaS clouds compared to public clouds exacerbates overprovisioning inefficiencies as opportunities for workload consolidation in private clouds are limited. Federation and cycle harvesting capabilities from computational grids help to improve efficiency, but to date have seen only limited adoption in the cloud due to a fundamental mismatch between the usage models of grids and clouds. Computational grids provide high throughput of queued batch jobs on a best-effort basis and enforce user priorities through dynamic job preemption, while IaaS clouds provide immediate feedback to user requests and make ahead-of-time guarantees about resource availability.We present a novel method to enable workload federation across IaaS clouds that overcomes this mismatch between grid and cloud usage models and improves system efficiency while also offering availability guarantees. We develop a new method for faster-than-realtime simulation of IaaS clouds to make predictions about system utilization and leverage this method to estimate the future availability of preemptible resources in the cloud. We then use these estimates to perform careful admission control and provide ahead-of-time bounds on the preemption probability of federated jobs executing on preemptible resources. Finally, we build an end-to-end prototype that addresses practical issues of workload federation and evaluate the prototype's efficacy using real-world traces from big data and compute-intensive production workloads

    A Commercial Law for Software Contracting

    Full text link
    Since the 1980s, software is at the core of most modern organizations, most products and most services. Part II of this Article examines how the U.C.C. evolved as the primary source of law for the first generation of computer contracts during the mainframe computer era. Part III examines how courts have overextended U.C.C. Article 2, as the main source of law for software licensing, to the limits. Part IV argues that the ALI and the NCCUSL should propose a new Article 2B for software licensing. Part V recommends a new Article 2C for “software as a service.

    The Murray Ledger and Times, April 5, 2000

    Get PDF
