400 research outputs found

    A Comparative Study of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

    Get PDF
    Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those not only aim to improve performance, but also provide high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop. But without comparison data available, how would data scientists know which system they should choose? This research compares: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from the perspectives of performance, usability and practicality for batch-oriented data analytics. We propose and apply a methodology which guides the conception of multidimensional software comparisons and the presentation of their results. The methodology was effective, providing direction and structure to the comparison, and should serve as helpful for future comparisons. The results confirm that Spark and Flink are superior to Hadoop MapReduce in performance and usability. Spark and Flink were similar in all three considerations, however as per the methodology, readers have the flexibility to adjust weightings to their needs, which could differentiate them. We also report on the design, execution and results of a large-scale usability study with a cohort of masters students, who learn and work with all three platforms, solving different use cases in data science contexts. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink. These results were included in the usability component of the multidimensional comparison

    Mate Marote: a BigData platform for massive scale educational interventions

    Get PDF
    In this paper we present Mate Marote, a web platform for massive scale educational interventions. We focus on the scaling requirements needed on these kind of deployments. We show the designed architecture, how these decisions solve the imposed requirements and the implementation. To test this development, we performed a small pilot intervention where the whole system was evaluated. We conclude that Mate Marote is ready for production deployment and enabled to middleto- massive scale interventions. For this purpose, we have deployed this platform in CEIBAL program at Uruguay with more than 100K potential users.Sociedad Argentina de Informática e Investigación Operativ

    Mate Marote: a BigData platform for massive scale educational interventions

    Get PDF
    In this paper we present Mate Marote, a web platform for massive scale educational interventions. We focus on the scaling requirements needed on these kind of deployments. We show the designed architecture, how these decisions solve the imposed requirements and the implementation. To test this development, we performed a small pilot intervention where the whole system was evaluated. We conclude that Mate Marote is ready for production deployment and enabled to middleto- massive scale interventions. For this purpose, we have deployed this platform in CEIBAL program at Uruguay with more than 100K potential users.Sociedad Argentina de Informática e Investigación Operativ

    Towards Making Distributed RDF processing FLINker

    Get PDF
    In the last decade, the Resource Description Framework (RDF) has become the de-facto standard for publishing semantic data on the Web. This steady adoption has led to a significant increase in the number and volume of available RDF datasets, exceeding the capabilities of traditional RDF stores. This scenario has introduced severe big semantic data challenges when it comes to managing and querying RDF data at Web scale. Despite the existence of various off-the-shelf Big Data platforms, processing RDF in a distributed environment remains a significant challenge. In this position paper, based on an indepth analysis of the state of the art, we propose to manage large RDF datasets in Flink, a well-known scalable distributed Big Data processing framework. Our approach, which we refer to as FLINKer extends the native graph abstraction of Flink, called Gelly, with RDF graph and SPARQL query processing capabilities

    Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture

    Full text link
    We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data

    BDEv 3.0: energy efficiency and microarchitectural characterization of Big Data processing frameworks

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Future Generation Computer Systems. The final authenticated version is available online at: https://doi.org/10.1016/j.future.2018.04.030[Abstract] As the size of Big Data workloads keeps increasing, the evaluation of distributed frameworks becomes a crucial task in order to identify potential performance bottlenecks that may delay the processing of large datasets. While most of the existing works generally focus only on execution time and resource utilization, analyzing other important metrics is key to fully understanding the behavior of these frameworks. For example, microarchitecture-level events can bring meaningful insights to characterize the interaction between frameworks and hardware. Moreover, energy consumption is also gaining increasing attention as systems scale to thousands of cores. This work discusses the current state of the art in evaluating distributed processing frameworks, while extending our Big Data Evaluator tool (BDEv) to extract energy efficiency and microarchitecture-level metrics from the execution of representative Big Data workloads. An experimental evaluation using BDEv demonstrates its usefulness to bring meaningful information from popular frameworks such as Hadoop, Spark and Flink.Ministerio de Economía, Industria y Competitividad; TIN2016-75845-PMinisterio de Educación; FPU14/02805Ministerio de Educación; FPU15/0338
    • …
    corecore