94 research outputs found
Low latency fast data computation scheme for map reduce based clusters
MapReduce based clusters is an emerging paradigm for big data analytics to scale up and speed up the big data classification, investigation, and processing of the huge volumes, massive and complex data sets. One of the fundamental issues of processing the data in MapReduce clusters is to deal with resource heterogeneity, especially when there is data inter-dependency among the tasks. Secondly, MapReduce runs a job in many phases; the intermediate data traffic and its migration time become a major bottleneck for the computation of jobs which produces a huge intermediate data in the shuffle phase. Further, encountering factors to monitor the critical issue of straggling is necessary because it produces unnecessary delays and poses a serious constraint on the overall performance of the system. Thus, this research aims to provide a low latency fast data computation scheme which introduces three algorithms to handle interdependent task computation among heterogeneous resources, reducing intermediate data traffic with its migration time and monitoring and modelling job straggling factors. This research has developed a Low Latency and Computational Cost based Tasks Scheduling (LLCC-TS) algorithm of interdependent tasks on heterogeneous resources by encountering priority to provide cost-effective resource utilization and reduced makespan. Furthermore, an Aggregation and Partition based Accelerated Intermediate Data Migration (APAIDM) algorithm has been presented to reduce the intermediate data traffic and data migration time in the shuffle phase by using aggregators and custom partitioner. Moreover, MapReduce Total Execution Time Prediction (MTETP) scheme for MapReduce job computation with inclusion of the factors which affect the job computation time has been produced using machine learning technique (linear regression) in order to monitor the job straggling and minimize the latency. LLCCTS algorithm has 66.13%, 22.23%, 43.53%, and 44.74% performance improvement rate over FIFO, improved max-min, SJF and MOS algorithms respectively for makespan time of scheduling of interdependent tasks. The AP-AIDM algorithm scored 66.62% and 48.4% performance improvements in reducing the data migration time over hash basic and conventional aggregation algorithms, respectively. Moreover, an MTETP technique shows the performance improvement in predicting the total job execution time with 20.42% accuracy than the improved HP technique. Thus, the combination of the three algorithms mentioned above provides a low latency fast data computation scheme for MapReduce based clusters
Recommended from our members
Achieving Accurate Predictions of Future Events Under Hardware Heterogeneity
Heterogeneous hardware is becoming increasingly available in modern hardware, while research breakthroughs enforce the expectation that heterogeneity will keep increasing in the future. Significant gains can be achieved via appropriate utilization of heterogeneity, in terms of performance and power consumption, however, poor utilization can have a detrimental effect. Intelligent scheduling and resource management is a crucial challenge we need to overcome in order to harvest the full potential of heterogeneous hardware. As systems become larger and include greater levels of hardware diversity, the importance of intelligent scheduling and resource management is further accentuated.This dissertation presents techniques that aid the process of scheduling and resource management in the presence of heterogeneous hardware, via accurately predicting upcoming runtime events. With a proactive and accurate view of the near future, schedulers can utilize the underlying hardware more efficiently, and fully take advantage of the available benefits.By adapting a majority element heuristic, this dissertation significantly improves the accuracy of predicting memory addresses about to be accessed, while reducing prediction-related costs by a factor of ten thousand compared to previously proposed predictive approaches. Coupled with novel microarchitectural modifications, accurate address predictions are shown to improve the performance of heterogeneous memory architectures.Machine learning-based performance predictors are further presented, capable of predicting a program's performance when executed on a given general-purpose core. Trained to model the subtleties of the interaction between hardware and software, these predictors are capable of generating highly accurate predictions even for cores with varied Instruction Set Architectures. Utilizing these performance predictions for job scheduling, is shown to improve overall system performance. The trained predictors are further examined and interpreted in order to visualize the correlations between features picked up and amplified during training.Finally, this dissertation demonstrates that scheduling algorithms cannot guarantee deriving an optimal schedule during realistic execution scenarios due to the underlying hardware heterogeneity, the wide range of runtime requirements of software, as well as prediction error from performance predictors. In response, deep neural networks are trained to select one scheduling approach from a list of options with varied overheads and correctness guarantees. The scheduling approach chosen, is the one which will most likely return the highest-performance schedule with the lowest overhead, given a particular instance of the job-to-core assignment problem
Data-centric serverless cloud architecture
Serverless has become a new dominant cloud architecture thanks to its high scalability
and flexible, pay-as-you-go billing model. In serverless, developers compose their
cloud services as a set of functions while providers take responsibility for scaling each
functionâs resources according to traffic changes. Hence, the provider needs to timely
spawn, or tear down, function instances (i.e., HTTP servers with user-provider handles),
which cannot hold state across function invocations.
Performance of a modern serverless cloud is bound by data movement. Serverless
architecture separates compute resources and data management to allow function instances
to run on any node in a cloud datacenter. This flexibility comes at the cost of
the necessity to move function initialization state across the entire datacenter when
spawning new instances on demand. Furthermore, to facilitate scaling, cloud providers
restrict the serverless programming model to stateless functions (which cannot hold or
share state across different functions), which lack efficient support for cross-function
communication.
This thesis consists of four following research contributions that pave the way for
a data-centric serverless cloud architecture. First, we introduce STeLLAR, an opensource
serverless benchmarking framework, which enables an accurate performance
characterization of serverless deployments. Using STeLLAR, we study three leading
serverless clouds and identify that all of them follow the same conceptual architecture
that comprises three essential subsystems, namely the worker fleet, the scheduler, and
the storage. Our analysis quantifies the aspect of the data movement problem that is
related to moving state from the storage to workers when spawning function instances
(âcold-startâ delays). Also, we study two state-of-the-art production methods of crossfunction
communication that involve either the storage or the scheduler subsystems, if
the data is transmitted as part of invocation HTTP requests (i.e., inline).
Second, we introduce vHive, an open-source ecosystem for serverless benchmarking
and experimentation, with the goal of enabling researchers to study and innovate across
the entire serverless stack. In contrast to the incomplete academic prototypes and
proprietary infrastructure of the leading commercial clouds, vHive is representative of
the leading clouds and comprises only fully open-source production-grade components,
such as Kubernetes orchestrator and AWS Firecracker hypervisor technologies. To
demonstrate vHiveâs utility, we analyze the cold-start delays, revealing that the high
cold-start latency of function instances is attributable to frequent page faults as the
functionâs state is brought from disk into guest memory one page at a time. Our analysis
further reveals that serverless functions operate over stable working sets - even across
function invocations.
Third, to reduce the cold-start delays of serverless functions, we introduce a novel
snapshotting mechanism that records and prefetches their memory working sets. This
mechanism, called REAP, is implemented in userspace and consists of two phases.
During the first invocation of a function, all accessed memory pages are recorded and
their contents are stored compactly as a part of the function snapshot. Starting from the
second cold invocation, the contents of the recorded pages are retrieved from storage
and installed in the guest memory before the new function instance starts to process the
invocation, allowing to avoid the majority of page faults, hence significantly accelerating
the functionâs cold starts.
Finally, to accelerate the cross-function data communication, we propose Expedited
Data Transfers (XDT), an API-preserving high-performance data communication
method for serverless. In production clouds, function transmit intermediate data to other
functions either inline or through a third-party storage service. The former approach is
restricted to small transfer sizes, the latter supports arbitrary transfers but suffers from
performance and cost overheads. XDT enables direct function-to-function transfers
in a way that is fully compatible with the existing autoscaling infrastructure. With
XDT, a trusted component of the sender function buffers the payload in its memory
and sends a secure reference to the receiver, which is picked by the load balancer and
autoscaler based on the current load. Using the reference, the receiver instance pulls the
transmitted data directly from senderâs memory, obviating the need for intermediary
storage
Preserving and sharing born-digital and hybrid objects from and across the National Collection
This report is one of a set of outputs from the Arts and Humanities
Research Council funded project âPreserving and sharing born-digital and hybrid objects from and across the National Collectionâ. It has been designed to provide an extensive account of the project
research activities and findings, to be useful to museum, heritage,
and preservation professionals, as well as to scholars interested in
born-digital materials.
The aims of the project were to instigate a conversation and build confidence across the museum sector to support the collecting of born-digital objects, and to lay the foundations for future research in the field. The research gathers the expertise of professionals from different backgrounds, and has an international ambition; however, institutions addressing this type of collections tend to be concentrated in a few countries across Europe, Australia and North America.
The researchâs methodology includes: desk-based research, the focused investigation of four case studies, interviews and workshops. The analysis of the data collected has supported the articulation of a set of themes and key ideas that provide the grounding for the expression of policy, research and practice-related recommendations.
The report understands the challenges of collecting born-digital objects as going beyond the mere technical realm of obsolescence and broken dependencies, to address issues of legality, visibility and accountability. It discusses the multi-layered and complex authorship of many born-digital objects associated with communities or corporate ownership, and expands
on the potential of collaborative approaches to collection stewardship
Censorship Citadels: Geography and the Social Control of Girls
This qualitative study examines the way in which local attempts to censor certain books reflect a greater community agenda of controlling young female behavior, specifically sexual and violent behavior. To abet my argument, I draw on Eriksonâs and Durkheimâs theories on boundary maintenance, Gusfieldâs symbolic crusades, an intersectional feminist perspective, and scholarship on new forms of religious fundamentalism. Using data on frequently challenged books collected by the American Library Association, I identify the top three cities with populations over 100,000 that issued the greatest number of challenges between 2000 and 2009 (âCensorship Citadelsâ) and compare these to three cities of similar size that only challenged one or zero titles. I document the changes in percent white, percent foreign-born, percent homeownership, and rates of poverty in each city, in addition to examining visible boundary breaches by girls for each of the three Censorship Citadels and their comparison cities. Visible boundary breaches by girls include 1) higher rates of births to minor girls, 2) no required notification or permission from parents for a minorâs abortion, 3) higher likelihood the school distributes contraceptives and 4) more newspaper articles covering girlsâ violence. Lastly, I undertake a content analysis of the books challenged by the Censorship Citadels (N=119) and the comparison cities (N=1) and theorize about the relationship between the booksâ contents and the communityâs perceived threats from visible norm breaking by girls. I suggest that cities experiencing more demographic changes during the decade and cities housing more megachurches are cities that attempt more social control of girls through frequent book challenges
Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required.
To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems.
In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries.
The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms.
In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms.
Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies
Using Workload Prediction and Federation to Increase Cloud Utilization
The wide-spread adoption of cloud computing has changed how large-scale computing infrastructure is built and managed. Infrastructure-as-a-Service (IaaS) clouds consolidate different separate workloads onto a shared platform and provide a consistent quality of service by overprovisioning capacity. This additional capacity, however, remains idle for extended periods of time and represents a drag on system efficiency.The smaller scale of private IaaS clouds compared to public clouds exacerbates overprovisioning inefficiencies as opportunities for workload consolidation in private clouds are limited. Federation and cycle harvesting capabilities from computational grids help to improve efficiency, but to date have seen only limited adoption in the cloud due to a fundamental mismatch between the usage models of grids and clouds. Computational grids provide high throughput of queued batch jobs on a best-effort basis and enforce user priorities through dynamic job preemption, while IaaS clouds provide immediate feedback to user requests and make ahead-of-time guarantees about resource availability.We present a novel method to enable workload federation across IaaS clouds that overcomes this mismatch between grid and cloud usage models and improves system efficiency while also offering availability guarantees. We develop a new method for faster-than-realtime simulation of IaaS clouds to make predictions about system utilization and leverage this method to estimate the future availability of preemptible resources in the cloud. We then use these estimates to perform careful admission control and provide ahead-of-time bounds on the preemption probability of federated jobs executing on preemptible resources. Finally, we build an end-to-end prototype that addresses practical issues of workload federation and evaluate the prototype's efficacy using real-world traces from big data and compute-intensive production workloads
A Commercial Law for Software Contracting
Since the 1980s, software is at the core of most modern organizations, most products and most services. Part II of this Article examines how the U.C.C. evolved as the primary source of law for the first generation of computer contracts during the mainframe computer era. Part III examines how courts have overextended U.C.C. Article 2, as the main source of law for software licensing, to the limits. Part IV argues that the ALI and the NCCUSL should propose a new Article 2B for software licensing. Part V recommends a new Article 2C for âsoftware as a service.
- âŠ