4,881 research outputs found

    Multi-Objective Big Data Optimization with jMetal and Spark

    Get PDF
    Big Data Optimization is the term used to refer to optimization problems which have to manage very large amounts of data. In this paper, we focus on the parallelization of metaheuristics with the Apache Spark cluster computing system for solving multi-objective Big Data Optimization problems. Our purpose is to study the influence of accessing data stored in the Hadoop File System (HDFS) in each evaluation step of a metaheuristic and to provide a software tool to solve these kinds of problems. This tool combines the jMetal multi-objective optimization framework with Apache Spark. We have carried out experiments to measure the performance of the proposed parallel infrastructure in an environment based on virtual machines in a local cluster comprising up to 100 cores. We obtained interesting results for computational e ort and propose guidelines to face multi-objective Big Data Optimization problems.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Petuum: A New Platform for Distributed Machine Learning on Big Data

    Full text link
    What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, allowing ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters.Comment: 15 pages, 10 figures, final version in KDD 2015 under the same titl
    corecore