477 research outputs found
Massively-Parallel Feature Selection for Big Data
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for
feature selection (FS) in Big Data settings (high dimensionality and/or sample
size). To tackle the challenges of Big Data FS PFBP partitions the data matrix
both in terms of rows (samples, training examples) as well as columns
(features). By employing the concepts of -values of conditional independence
tests and meta-analysis techniques PFBP manages to rely only on computations
local to a partition while minimizing communication costs. Then, it employs
powerful and safe (asymptotically sound) heuristics to make early, approximate
decisions, such as Early Dropping of features from consideration in subsequent
iterations, Early Stopping of consideration of features within the same
iteration, or Early Return of the winner in each iteration. PFBP provides
asymptotic guarantees of optimality for data distributions faithfully
representable by a causal network (Bayesian network or maximal ancestral
graph). Our empirical analysis confirms a super-linear speedup of the algorithm
with increasing sample size, linear scalability with respect to the number of
features and processing cores, while dominating other competitive algorithms in
its class
Optimistic Concurrency Control for Distributed Unsupervised Learning
Research on distributed machine learning algorithms has focused primarily on
one of two extremes - algorithms that obey strict concurrency constraints or
algorithms that obey few or no such constraints. We consider an intermediate
alternative in which algorithms optimistically assume that conflicts are
unlikely and if conflicts do arise a conflict-resolution protocol is invoked.
We view this "optimistic concurrency control" paradigm as particularly
appropriate for large-scale machine learning algorithms, particularly in the
unsupervised setting. We demonstrate our approach in three problem areas:
clustering, feature learning and online facility location. We evaluate our
methods via large-scale experiments in a cluster computing environment.Comment: 25 pages, 5 figure
Big data analytics: Machine learning and Bayesian learning perspectives—What is done? What is not?
Big data analytics provides an interdisciplinary framework that is essential to support the current trend for solving real-world problems collaboratively. The progression of big data analytics framework must be clearly understood so that novel approaches can be developed to advance this state-of-the-art discipline. An ignorance of observing the progression of this fast-growing discipline may lead to duplications in research and waste of efforts. Its main companion field, machine learning, helps solve many big data analytics problems; therefore, it is also important to understand the progression of machine learning in the big data analytics framework. One of the current research efforts in big data analytics is the integration of deep learning and Bayesian optimization, which can help the automatic initialization and optimization of hyperparameters of deep learning and enhance the implementation of iterative algorithms in software. The hyperparameters include the weights used in deep learning, and the number of clusters in Bayesian mixture models that characterize data heterogeneity. The big data analytics research also requires computer systems and software that are capable of storing, retrieving, processing, and analyzing big data that are generally large, complex, heterogeneous, unstructured, unpredictable, and exposed to scalability problems. Therefore, it is appropriate to introduce a new research topic—transformative knowledge discovery—that provides a research ground to study and develop smart machine learning models and algorithms that are automatic, adaptive, and cognitive to address big data analytics problems and challenges. The new research domain will also create research opportunities to work on this interdisciplinary research space and develop solutions to support research in other disciplines that may not have expertise in the research area of big data analytics. For example, the research, such as detection and characterization of retinal diseases in medical sciences and the classification of highly interacting species in environmental sciences can benefit from the knowledge and expertise in big data analytics
- …