734 research outputs found

    A Novel Ant based Clustering of Gene Expression Data using MapReduce Framework

    Get PDF
    Genes which exhibit similar patterns are often functionally related. Microarray technology provides a unique tool to examine how a cells gene expression pattern chang es under various conditions. Analyzing and interpreting these gene expression data is a challenging task. Clustering is one of the useful and popular methods to extract useful patterns from these gene expression data. In this paper multi colony ant based clustering approach is proposed. The whole processing procedure is divided into two parts: The first is the construction of Minimum spanning tree from the gene expression data using MapReduce version of ant colony optimization techniques. The second part is clustering, which is done by cutting the costlier edges from the minimum spanning tree, followed by one step k - means clustering procedure. Applied to different file sizes of gene expression data over different number of processors, the proposed approach exhibits good scalability and accuracy

    Experimental Analysis of Algorithms for Coflow Scheduling

    Full text link
    Modern data centers face new scheduling challenges in optimizing job-level performance objectives, where a significant challenge is the scheduling of highly parallel data flows with a common performance goal (e.g., the shuffle operations in MapReduce applications). Chowdhury and Stoica introduced the coflow abstraction to capture these parallel communication patterns, and Chowdhury et al. proposed effective heuristics to schedule coflows efficiently. In our previous paper, we considered the strongly NP-hard problem of minimizing the total weighted completion time of coflows with release dates, and developed the first polynomial-time scheduling algorithms with O(1)-approximation ratios. In this paper, we carry out a comprehensive experimental analysis on a Facebook trace and extensive simulated instances to evaluate the practical performance of several algorithms for coflow scheduling, including the approximation algorithms developed in our previous paper. Our experiments suggest that simple algorithms provide effective approximations of the optimal, and that the performance of our approximation algorithms is relatively robust, near optimal, and always among the best compared with the other algorithms, in both the offline and online settings.Comment: 29 pages, 8 figures, 11 table

    GIANT: Globally Improved Approximate Newton Method for Distributed Optimization

    Full text link
    For distributed computing environment, we consider the empirical risk minimization problem and propose a distributed and communication-efficient Newton-type optimization method. At every iteration, each worker locally finds an Approximate NewTon (ANT) direction, which is sent to the main driver. The main driver, then, averages all the ANT directions received from workers to form a {\it Globally Improved ANT} (GIANT) direction. GIANT is highly communication efficient and naturally exploits the trade-offs between local computations and global communications in that more local computations result in fewer overall rounds of communications. Theoretically, we show that GIANT enjoys an improved convergence rate as compared with first-order methods and existing distributed Newton-type methods. Further, and in sharp contrast with many existing distributed Newton-type methods, as well as popular first-order methods, a highly advantageous practical feature of GIANT is that it only involves one tuning parameter. We conduct large-scale experiments on a computer cluster and, empirically, demonstrate the superior performance of GIANT.Comment: Fixed some typos. Improved writin

    A Survey on Automatic Parameter Tuning for Big Data Processing Systems

    Get PDF
    Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe
    corecore