Search CORE

30,021 research outputs found

Massively-Parallel Feature Selection for Big Data

Author: Borboudakis Giorgos
Christophides Vassilis
Katsogridakis Pavlos
Pratikakis Polyvios
Tsamardinos Ioannis
Publication venue
Publication date: 23/08/2017
Field of study

We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of

p

-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

Performance tradeoffs in static and dynamic load balancing strategies

Author: Bokhart S. H.
Iqbal M. A.
Saltz J. H.
Publication venue
Publication date
Field of study

The problem of uniformly distributing the load of a parallel program over a multiprocessor system was considered. A program was analyzed whose structure permits the computation of the optimal static solution. Then four strategies for load balancing were described and their performance compared. The strategies are: (1) the optimal static assignment algorithm which is guaranteed to yield the best static solution, (2) the static binary dissection method which is very fast but sub-optimal, (3) the greedy algorithm, a static fully polynomial time approximation scheme, which estimates the optimal solution to arbitrary accuracy, and (4) the predictive dynamic load balancing heuristic which uses information on the precedence relationships within the program and outperforms any of the static methods. It is also shown that the overhead incurred by the dynamic heuristic is reduced considerably if it is started off with a static assignment provided by either of the other three strategies

NASA Technical Reports Server

Methods for fast and reliable clustering

Author: Kärkkäinen Ismo
Publication venue: University of Joensuu
Publication date
Field of study

UEF Electronic Publications