4,532 research outputs found
Massively-Parallel Feature Selection for Big Data
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for
feature selection (FS) in Big Data settings (high dimensionality and/or sample
size). To tackle the challenges of Big Data FS PFBP partitions the data matrix
both in terms of rows (samples, training examples) as well as columns
(features). By employing the concepts of -values of conditional independence
tests and meta-analysis techniques PFBP manages to rely only on computations
local to a partition while minimizing communication costs. Then, it employs
powerful and safe (asymptotically sound) heuristics to make early, approximate
decisions, such as Early Dropping of features from consideration in subsequent
iterations, Early Stopping of consideration of features within the same
iteration, or Early Return of the winner in each iteration. PFBP provides
asymptotic guarantees of optimality for data distributions faithfully
representable by a causal network (Bayesian network or maximal ancestral
graph). Our empirical analysis confirms a super-linear speedup of the algorithm
with increasing sample size, linear scalability with respect to the number of
features and processing cores, while dominating other competitive algorithms in
its class
Dynamic Parameter Allocation in Parameter Servers
To keep up with increasing dataset sizes and model complexity, distributed
training has become a necessity for large machine learning tasks. Parameter
servers ease the implementation of distributed parameter management---a key
concern in distributed training---, but can induce severe communication
overhead. To reduce communication overhead, distributed machine learning
algorithms use techniques to increase parameter access locality (PAL),
achieving up to linear speed-ups. We found that existing parameter servers
provide only limited support for PAL techniques, however, and therefore prevent
efficient training. In this paper, we explore whether and to what extent PAL
techniques can be supported, and whether such support is beneficial. We propose
to integrate dynamic parameter allocation into parameter servers, describe an
efficient implementation of such a parameter server called Lapse, and
experimentally compare its performance to existing parameter servers across a
number of machine learning tasks. We found that Lapse provides near-linear
scaling and can be orders of magnitude faster than existing parameter servers
- …