937 research outputs found
BigDataBench: a Big Data Benchmark Suite from Internet Services
As architecture, systems, and data management communities pay greater
attention to innovative big data systems and architectures, the pressure of
benchmarking and evaluating these systems rises. Considering the broad use of
big data systems, big data benchmarks must include diversity of data and
workloads. Most of the state-of-the-art big data benchmarking efforts target
evaluating specific types of applications or system software stacks, and hence
they are not qualified for serving the purposes mentioned above. This paper
presents our joint research efforts on this issue with several industrial
partners. Our big data benchmark suite BigDataBench not only covers broad
application scenarios, but also includes diverse and representative data sets.
BigDataBench is publicly available from http://prof.ict.ac.cn/BigDataBench .
Also, we comprehensively characterize 19 big data workloads included in
BigDataBench with varying data inputs. On a typical state-of-practice
processor, Intel Xeon E5645, we have the following observations: First, in
comparison with the traditional benchmarks: including PARSEC, HPCC, and
SPECCPU, big data applications have very low operation intensity; Second, the
volume of data input has non-negligible impact on micro-architecture
characteristics, which may impose challenges for simulation-based big data
architecture research; Last but not least, corroborating the observations in
CloudSuite and DCBench (which use smaller data inputs), we find that the
numbers of L1 instruction cache misses per 1000 instructions of the big data
applications are higher than in the traditional benchmarks; also, we find that
L3 caches are effective for the big data applications, corroborating the
observation in DCBench.Comment: 12 pages, 6 figures, The 20th IEEE International Symposium On High
Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando,
Florida, US
Big Data Analysis: Implementations of Hadoop Map Reduce, Yarn and Spark
Nowadays, with the increasingly important role of technology, the internet and huge size of data, it has become not only possible, but necessary for management and analyzing these data, where it is difficult to process and retrieve information related to that data. Moreover, the amount of memory consumed by such data reached to terabytes or petabytes, which make it difficult for processing, analyzed, and retrieving. Also, many techniques have been carried to process big data. The dealing with the statistical programs became very hard. There are a number of algorithms that is used in big data processing, such as Mapreduce. Many obstructions and challenges face the big data processing as: poor bounded-time performance in heavy activities and high-priced cost. In this study, different big data implementations are demonstrated, also, we propose open issues and challenges raised on big data implementations. The findings compares several big data platforms which are; Hadoop, Yarn and Spark. Finally, we provide useful recommendations for further research about the best one between these implementations to process the data according to specific bases. Keywords: Big data, Mapreduce, Hadoop, Spark, Yarn.
Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts
The paradigm of big data is characterized by the need to collect and process
data sets of great volume, arriving at the systems with great velocity, in a
variety of formats. Spark is a widely used big data processing system that can
be integrated with Hadoop to provide powerful abstractions to developers, such
as distributed storage through HDFS and resource management through YARN. When
all the required configurations are made, Spark can also provide quality
attributes, such as scalability, fault tolerance, and security. However, all of
these benefits come at the cost of complexity, with high memory requirements,
and additional latency in processing. An alternative approach is to use a lean
software stack, like Unicage, that delegates most control back to the
developer. In this work we evaluated the performance of big data processing
with Spark versus Unicage, in a cluster environment hosted in the IBM Cloud.
Two sets of experiments were performed: batch processing of unstructured data
sets, and query processing of structured data sets. The input data sets were of
significant size, ranging from 64 GB to 8192 GB in volume. The results show
that the performance of Unicage scripts is superior to Spark for search
workloads like grep and select, but that the abstractions of distributed
storage and resource management from the Hadoop stack enable Spark to execute
workloads with inter-record dependencies, such as sort and join, with correct
outputs.Comment: 10 pages, 14 figure
Data modeling with NoSQL : how, when and why
Tese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201
BDEv 3.0: energy efficiency and microarchitectural characterization of Big Data processing frameworks
This is a post-peer-review, pre-copyedit version of an article published in Future Generation Computer Systems. The final authenticated version is available online at: https://doi.org/10.1016/j.future.2018.04.030[Abstract] As the size of Big Data workloads keeps increasing, the evaluation of distributed frameworks becomes a crucial task in order to identify potential performance bottlenecks that may delay the processing of large datasets. While most of the existing works generally focus only on execution time and resource utilization, analyzing other important metrics is key to fully understanding the behavior of these frameworks. For example, microarchitecture-level events can bring meaningful insights to characterize the interaction between frameworks and hardware. Moreover, energy consumption is also gaining increasing attention as systems scale to thousands of cores. This work discusses the current state of the art in evaluating distributed processing frameworks, while extending our Big Data Evaluator tool (BDEv) to extract energy efficiency and microarchitecture-level metrics from the execution of representative Big Data workloads. An experimental evaluation using BDEv demonstrates its usefulness to bring meaningful information from popular frameworks such as Hadoop, Spark and Flink.Ministerio de EconomÃa, Industria y Competitividad; TIN2016-75845-PMinisterio de Educación; FPU14/02805Ministerio de Educación; FPU15/0338
- …