134 research outputs found
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
Scheduling MapReduce Jobs under Multi-Round Precedences
We consider non-preemptive scheduling of MapReduce jobs with multiple tasks
in the practical scenario where each job requires several map-reduce rounds. We
seek to minimize the average weighted completion time and consider scheduling
on identical and unrelated parallel processors. For identical processors, we
present LP-based O(1)-approximation algorithms. For unrelated processors, the
approximation ratio naturally depends on the maximum number of rounds of any
job. Since the number of rounds per job in typical MapReduce algorithms is a
small constant, our scheduling algorithms achieve a small approximation ratio
in practice. For the single-round case, we substantially improve on previously
best known approximation guarantees for both identical and unrelated
processors. Moreover, we conduct an experimental analysis and compare the
performance of our algorithms against a fast heuristic and a lower bound on the
optimal solution, thus demonstrating their promising practical performance
Parallelization of genetic algorithms using Hadoop Map/Reduce
In this paper we present parallel implementation of genetic algorithm using map/reduce programming paradigm. Hadoop implementation of map/reduce library is used for this purpose. We compare our implementation with implementation presented in [1]. These two implementations are compared in solving One Max (Bit counting) problem. The comparison criteria between implementations are fitness convergence, quality of final solution, algorithm scalability, and cloud resource utilization. Our model for parallelization of genetic algorithm shows better performances and fitness convergence than model presented in [1], but our model has lower quality of solution because of species problem
Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce
The kernel -means is an effective method for data clustering which extends
the commonly-used -means algorithm to work on a similarity matrix over
complex data structures. The kernel -means algorithm is however
computationally very complex as it requires the complete data matrix to be
calculated and stored. Further, the kernelized nature of the kernel -means
algorithm hinders the parallelization of its computations on modern
infrastructures for distributed computing. In this paper, we are defining a
family of kernel-based low-dimensional embeddings that allows for scaling
kernel -means on MapReduce via an efficient and unified parallelization
strategy. Afterwards, we propose two methods for low-dimensional embedding that
adhere to our definition of the embedding family. Exploiting the proposed
parallelization strategy, we present two scalable MapReduce algorithms for
kernel -means. We demonstrate the effectiveness and efficiency of the
proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data
Mining (SDM), 201
On an almost-universal hash function family with applications to authentication and secrecy codes
Universal hashing, discovered by Carter and Wegman in 1979, has many
important applications in computer science. MMH, which was shown to be
-universal by Halevi and Krawczyk in 1997, is a well-known universal
hash function family. We introduce a variant of MMH, that we call GRDH,
where we use an arbitrary integer instead of prime and let the keys
satisfy the
conditions (), where are
given positive divisors of . Then via connecting the universal hashing
problem to the number of solutions of restricted linear congruences, we prove
that the family GRDH is an -almost--universal family of
hash functions for some if and only if is odd and
. Furthermore, if these conditions are
satisfied then GRDH is -almost--universal, where is
the smallest prime divisor of . Finally, as an application of our results,
we propose an authentication code with secrecy scheme which strongly
generalizes the scheme studied by Alomair et al. [{\it J. Math. Cryptol.} {\bf
4} (2010), 121--148], and [{\it J.UCS} {\bf 15} (2009), 2937--2956].Comment: International Journal of Foundations of Computer Science, to appea
- …