734 research outputs found
A Novel Ant based Clustering of Gene Expression Data using MapReduce Framework
Genes which exhibit similar patterns are often functionally related. Microarray technology provides a unique tool to examine how a cells gene expression pattern chang es under various conditions. Analyzing and interpreting these gene expression data is a challenging task. Clustering is one of the useful and popular methods to extract useful patterns from these gene expression data. In this paper multi colony ant based clustering approach is proposed. The whole processing procedure is divided into two parts: The first is the construction of Minimum spanning tree from the gene expression data using MapReduce version of ant colony optimization techniques. The second part is clustering, which is done by cutting the costlier edges from the minimum spanning tree, followed by one step k - means clustering procedure. Applied to different file sizes of gene expression data over different number of processors, the proposed approach exhibits good scalability and accuracy
Experimental Analysis of Algorithms for Coflow Scheduling
Modern data centers face new scheduling challenges in optimizing job-level
performance objectives, where a significant challenge is the scheduling of
highly parallel data flows with a common performance goal (e.g., the shuffle
operations in MapReduce applications). Chowdhury and Stoica introduced the
coflow abstraction to capture these parallel communication patterns, and
Chowdhury et al. proposed effective heuristics to schedule coflows efficiently.
In our previous paper, we considered the strongly NP-hard problem of minimizing
the total weighted completion time of coflows with release dates, and developed
the first polynomial-time scheduling algorithms with O(1)-approximation ratios.
In this paper, we carry out a comprehensive experimental analysis on a
Facebook trace and extensive simulated instances to evaluate the practical
performance of several algorithms for coflow scheduling, including the
approximation algorithms developed in our previous paper. Our experiments
suggest that simple algorithms provide effective approximations of the optimal,
and that the performance of our approximation algorithms is relatively robust,
near optimal, and always among the best compared with the other algorithms, in
both the offline and online settings.Comment: 29 pages, 8 figures, 11 table
GIANT: Globally Improved Approximate Newton Method for Distributed Optimization
For distributed computing environment, we consider the empirical risk
minimization problem and propose a distributed and communication-efficient
Newton-type optimization method. At every iteration, each worker locally finds
an Approximate NewTon (ANT) direction, which is sent to the main driver. The
main driver, then, averages all the ANT directions received from workers to
form a {\it Globally Improved ANT} (GIANT) direction. GIANT is highly
communication efficient and naturally exploits the trade-offs between local
computations and global communications in that more local computations result
in fewer overall rounds of communications. Theoretically, we show that GIANT
enjoys an improved convergence rate as compared with first-order methods and
existing distributed Newton-type methods. Further, and in sharp contrast with
many existing distributed Newton-type methods, as well as popular first-order
methods, a highly advantageous practical feature of GIANT is that it only
involves one tuning parameter. We conduct large-scale experiments on a computer
cluster and, empirically, demonstrate the superior performance of GIANT.Comment: Fixed some typos. Improved writin
A Survey on Automatic Parameter Tuning for Big Data Processing Systems
Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe
- …