1,304 research outputs found
Recommended from our members
A Platform for Scalable Low-Latency Analytics using MapReduce
Today, the ability to process big data has become crucial to the information needs of many enterprise businesses, scientific applications, and governments. Recently, there have been increasing needs of processing data that is not only big but also fast . Here fast data refers to high-speed real-time and near real-time data streams, such as Twitter feeds, search query streams, click streams, impressions, and system logs. To handle both historical data and real-time data, many companies have to maintain multiple systems. However, recent real-world case studies show that maintaining multiple systems cause not only code duplication, but also intensive manual work to partition the analytics workloads and determine which data is processed by which system. These issues point to the need for a general, unified data processing framework to support analytical queries with different latency requirements.
This thesis takes a further step towards building a general, unified system for big and fast data analytics. In order to build such a system, I propose to build on existing solutions on data parallelism and extend them with two new features: incremental processing and stream processing with latency constraints. This thesis starts with Hadoop, the most popular open-source MapReduce implementation, which provides proven scalability based on data parallelism. I answer the following questions: (1) Is Hadoop able to support incremental processing? (2) What are the necessary architecture changes in order to support incremental processing? (3) What are the additional design features needed to support stream processing with latency constraints? The thesis includes three parts that answer each of the questions.
The first part of the thesis validates whether the existing MapReduce implementations can support incremental processing. Incremental processing means that computation is performed as soon as the relevant data becomes available. My extensive benchmark study of Hadoop-based MapReduce systems shows that the widely-used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental computation. I further propose a cost model, and optimize the Hadoop system configuration based on the model. The benchmark results over the optimized system verify that the barrier to incremental computation is intrinsic, and cannot be removed by tuning system parameters.
In the second part of the thesis, I employ various purely hash-based techniques to enable fast in-memory incremental processing in MapReduce, and frequent key based techniques to extend such processing to workloads that require memory more than available. I evaluate my Hadoop-based prototype equipped with all proposed techniques. The results show that the hash techniques allow the reduce progress to keep up with the map progress with up to 3 orders of magnitude reduction of internal disk spills, and enable results to be returned early.
The third part of the thesis aims to support stream processing with latency constraints based on the incremental processing platform resulted from the second part. I perform a benchmark study to understand the sources of latency. I then propose a number of necessary architecture changes to support stream processing, and augment the platform with new latency-aware model-driven resource planning and latency-aware runtime scheduling techniques to meet user-specified latency constraints while maximizing throughput. Experiments using real-world workloads show that the techniques reduce the latency from tens or hundreds of seconds to sub-second, with 2x-5x increase in throughput. The new platform offers 1-2 orders of magnitude improvements over Storm, a commercial-grade distributed stream system, and Spark Streaming, a state-of-the-art academic prototype, when considering both latency and throughput
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
MRFS: A Multi-Resource Fair Scheduling Algorithm in Heterogeneous Cloud Computing
Task scheduling in cloud computing is considered as a significant issue that has attracted much attention over the last decade. In cloud environments, users expose considerable interest in submitting tasks on multiple Resource types. Subsequently, finding an optimal and most efficient server to host users’ tasks seems a fundamental concern. Several attempts have suggested various algorithms, employing Swarm optimization and heuristics methods to solve the scheduling issues associated with cloud in a multi-resource perspective. However, these approaches have not considered the equalization of dominant resources on each specific resource type. This substantial gap leads to unfair allocation, SLA degradation and resource contention. To deal with this problem, in this paper we propose a novel task scheduling mechanism called MRFS. MRFS employs Lagrangian multipliers to locate tasks in suitable servers with respect to the number of dominant resources and maximum resource availability. To evaluate MRFS, we conduct time-series experiments in the cloudsim driven by randomly generated workloads. The results show that MRFS maximizes per-user utility function by %15-20 in FFMRA compared to FFMRA in absence of MRFS. Furthermore, the mathematical proofs confirm that the sharingincentive, and Pareto-efficiency properties are improved under MRF
Improving capacity-performance tradeoffs in the storage tier
Data-set sizes are growing. New techniques are emerging to organize and analyze these data-sets. There is a key access pattern emerging with these new techniques, large sequential file accesses. The trend toward bigger files exists to help amortize the cost of data accesses from the storage layer, as many workloads are recognized to be I/O bound. The storage layer is widely recognized as the slowest layer in the system. This work focuses on the tradeoff one can make with that storage capacity to improve system performance. ^ Capacity can be leveraged for improved availability or improved performance. This tradeoff is key in the storage layer, as this allows for data loss prevention and bandwidth aggregation. Typically these tradeoffs do not allow much choice with regard to capacity use. This work will leverage replication as the enabling mechanism to improve the capacity-performance tradeoff in the storage tier, while still providing for availability. ^ This capacity-performance tradeoff can be made at both the local and distributed file system level. I propose two techniques that allow for an improved tradeoff of capacity. The local file system can be employed on scale-out or scale-up infrastructures to improve performance. The distributed file system is targeted at distributed frameworks, such as MapReduce, to improve the cluster performance. The local file system design is MorphStore, and the distributed file system is BoostDFS. ^ MorphStore is a file system that significantly improves performance when accessing large files by using two innovations. MorphStore combines (a) load-adaptive I/O access scheduling to dynamically optimize throughput (aggregation), and (b) utility-xiii driven replication to best use capacity for performance. Additionally, adaptive-access scheduling can be utilized to optimize scheduling of requests (for throughput) on systems with a large number of storage devices. Replication is utilized to make available high utility files and then optimize throughput of these high utility files based on system load. ^ BoostDFS is a distributed file system that allows a better capacity-performance tradeoff via inter-node file replication. BoostDFS is built on the observation that distributed file systems currently inter-node replication for availability, but provide no mechanism to further improve performance. Replication for availability provides diminishing returns on performance, this is due to saturation of locality. BoostDFS exploits the common by improving I/O performance of these local tasks. This is done via intra-node replication by leveraging MorphStore as the local file system. This technique allows for capacity to be traded for availability as well as performance, with a small capacity overhead under constant availability. ^ Both MorphStore and BoostDFS utilize replication. Replication allows for both bandwidth aggregation and availability, This work primarily focuses on the performance utility of replication, but does not sacrifice availability in the process. These techniques provide an improved capacity-performance tradeoff while allowing the desired level of availability
Economic regulation for multi tenant infrastructures
Large scale computing infrastructures need scalable and effi cient resource allocation mechanisms to ful l the requirements of its participants and applications while the whole system is regulated to work e ciently. Computational markets provide e fficient allocation mechanisms that aggregate information from multiple sources in large, dynamic and complex systems where there is not a single source with complete information. They have been proven to be successful in matching resource demand and resource supply in the presence of sel sh multi-objective and utility-optimizing users and sel sh pro t-optimizing providers. However, global infrastructure metrics which may not directly affect participants of the computational market still need to be addressed -a.k.a. economic externalities like load balancing or energy-efficiency.
In this thesis, we point out the need to address these economic externalities, and we design and evaluate appropriate regulation mechanisms from di erent perspectives on top of existing economic models, to incorporate a wider range of objective metrics not considered otherwise. Our main contributions in this thesis are threefold; fi rst, we propose a taxation mechanism that addresses the resource congestion problem e ffectively improving the balance of load among resources when correlated economic preferences are present; second,
we propose a game theoretic model with complete information to derive an algorithm to aid resource providers to scale up and down resource supply so energy-related costs can be reduced; and third, we relax our previous assumptions about complete information on the resource provider side and design an incentive-compatible mechanism to encourage users to truthfully report their resource requirements effectively assisting providers to make energy-eff cient allocations while providing a dynamic allocation mechanism to users.Les infraestructures computacionals de gran escala necessiten mecanismes d’assignació de recursos escalables i eficients per complir amb els requisits computacionals de tots els seus participants, assegurant-se de que el sistema és regulat apropiadament per a que funcioni de manera efectiva. Els mercats computacionals són mecanismes d’assignació de recursos eficients que incorporen informació de diferents fonts considerant sistemes de gran escala, complexos i dinà mics on no existeix una única font que proveeixi informació completa de l'estat del sistema. Aquests mercats computacionals han demostrat ser exitosos per acomodar la demanda de recursos computacionals amb la seva oferta quan els seus participants son considerats estratègics des del punt de vist de teoria de jocs. Tot i això existeixen mètriques a nivell global sobre la infraestructura que no tenen per que influenciar els usuaris a priori de manera directa. Aixà doncs, aquestes externalitats econòmiques com poden ser el balanceig de cà rrega o la eficiència energètica, conformen una lÃnia d’investigació que cal explorar. En aquesta tesi, presentem i descrivim la problemà tica derivada d'aquestes externalitats econòmiques. Un cop establert el marc d’actuació, dissenyem i avaluem mecanismes de regulació apropiats basats en models econòmics existents per resoldre aquesta problemà tica des de diferents punts de vista per incorporar un ventall més ampli de mètriques objectiu que no havien estat considerades fins al moment. Les nostres contribucions principals tenen tres vessants: en primer lloc, proposem un mecanisme de regulació de tipus impositiu que tracta de mitigar l’aparició de recursos sobre-explotats que, efectivament, millora el balanceig de la cà rrega de treball entre els recursos disponibles; en segon lloc, proposem un model teòric basat en teoria de jocs amb informació o completa que permet derivar un algorisme que facilita la tasca dels proveïdors de recursos per modi car a l'alça o a la baixa l'oferta de recursos per tal de reduir els costos relacionats amb el consum energètic; i en tercer lloc, relaxem la nostra assumpció prèvia sobre l’existència d’informació complerta per part del proveïdor de recursos i dissenyem un mecanisme basat en incentius per fomentar que els usuaris facin pública de manera verÃdica i explÃcita els seus requeriments computacionals, ajudant d'aquesta manera als proveïdors de recursos a fer assignacions eficients des del punt de vista energètic a la vegada que oferim un mecanisme l’assignació de recursos dinà mica als usuari
Managing contamination delay to improve Timing Speculation architectures
Timing Speculation (TS) is a widely known method for realizing better-than-worst-case systems. Aggressive clocking, realizable by TS, enable systems to operate beyond specified safe frequency limits to effectively exploit the data dependent circuit delay. However, the range of aggressive clocking for performance enhancement under TS is restricted by short paths. In this paper, we show that increasing the lengths of short paths of the circuit increases the effectiveness of TS, leading to performance improvement. Also, we propose an algorithm to efficiently add delay buffers to selected short paths while keeping down the area penalty. We present our algorithm results for ISCAS-85 suite and show that it is possible to increase the circuit contamination delay by up to 30% without affecting the propagation delay. We also explore the possibility of increasing short path delays further by relaxing the constraint on propagation delay and analyze the performance impact
Cloud computing resource scheduling and a survey of its evolutionary approaches
A disruptive technology fundamentally transforming the way that computing services are delivered, cloud computing offers information and communication technology users a new dimension of convenience of resources, as services via the Internet. Because cloud provides a finite pool of virtualized on-demand resources, optimally scheduling them has become an essential and rewarding topic, where a trend of using Evolutionary Computation (EC) algorithms is emerging rapidly. Through analyzing the cloud computing architecture, this survey first presents taxonomy at two levels of scheduling cloud resources. It then paints a landscape of the scheduling problem and solutions. According to the taxonomy, a comprehensive survey of state-of-the-art approaches is presented systematically. Looking forward, challenges and potential future research directions are investigated and invited, including real-time scheduling, adaptive dynamic scheduling, large-scale scheduling, multiobjective scheduling, and distributed and parallel scheduling. At the dawn of Industry 4.0, cloud computing scheduling for cyber-physical integration with the presence of big data is also discussed. Research in this area is only in its infancy, but with the rapid fusion of information and data technology, more exciting and agenda-setting topics are likely to emerge on the horizon
- …