55,367 research outputs found
String and Membrane Gaussian Processes
In this paper we introduce a novel framework for making exact nonparametric
Bayesian inference on latent functions, that is particularly suitable for Big
Data tasks. Firstly, we introduce a class of stochastic processes we refer to
as string Gaussian processes (string GPs), which are not to be mistaken for
Gaussian processes operating on text. We construct string GPs so that their
finite-dimensional marginals exhibit suitable local conditional independence
structures, which allow for scalable, distributed, and flexible nonparametric
Bayesian inference, without resorting to approximations, and while ensuring
some mild global regularity constraints. Furthermore, string GP priors
naturally cope with heterogeneous input data, and the gradient of the learned
latent function is readily available for explanatory analysis. Secondly, we
provide some theoretical results relating our approach to the standard GP
paradigm. In particular, we prove that some string GPs are Gaussian processes,
which provides a complementary global perspective on our framework. Finally, we
derive a scalable and distributed MCMC scheme for supervised learning tasks
under string GP priors. The proposed MCMC scheme has computational time
complexity and memory requirement , where
is the data size and the dimension of the input space. We illustrate the
efficacy of the proposed approach on several synthetic and real-world datasets,
including a dataset with millions input points and attributes.Comment: To appear in the Journal of Machine Learning Research (JMLR), Volume
1
HPC-oriented Canonical Workflows for Machine Learning Applications in Climate and Weather Prediction
Machine learning (ML) applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing (HPC) power are paving the way. Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers. Even though the FAIR principle is well known to many scientists, research communities are slow to adopt them. Canonical Workflow Framework for Research (CWFR) provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers. This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate, focusing on HPC and big data. Specifically, we discuss Fair Digital Object (FDO) and Research Object (RO) in the DeepRain project to achieve granular reproducibility. DeepRain is a project that aims to improve precipitation forecast in Germany by using ML. Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access. We suggest the Juypter notebook as a single reproducible experiment. In addition, we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface
Performance evaluation of a distributed clustering approach for spatial datasets
The analysis of big data requires powerful, scalable, and accurate data analytics techniques that the traditional data mining and
machine learning do not have as a whole. Therefore, new data analytics frameworks are needed to deal with the big data challenges such as
volumes, velocity, veracity, variety of the data. Distributed data mining
constitutes a promising approach for big data sets, as they are usually
produced in distributed locations, and processing them on their local
sites will reduce significantly the response times, communications, etc. In
this paper, we propose to study the performance of a distributed clustering, called Dynamic Distributed Clustering (DDC). DDC has the ability
to remotely generate clusters and then aggregate them using an efficient
aggregation algorithm. The technique is developed for spatial datasets.
We evaluated the DDC using two types of communications (synchronous
and asynchronous), and tested using various load distributions. The experimental results show that the approach has super-linear speed-up,
scales up very well, and can take advantage of the recent programming
models, such as MapReduce model, as its results are not affected by the
types of communication
SigSpace – Class-Based Feature Representation for Scalable and Distributed Machine Learning
Title from PDF of title page, viewed on October 25, 2016Thesis advisor: Yugyung LeeVitaIncludes bibliographical references (pages 70-71)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2016In the era of big data, it is essential to explore the opportunities in discovering knowledge
from big data. However, traditional machine learning approaches are not well fit
to analyze the full value of big data. Explicitly, current research and practice of Machine
learning do not fully support some important features for big data analytics such as
incremental learning, distributed learning, and fuzzy matching.
In this thesis, we propose a unique feature representation, named the SigSpace.
It is designed for a class-level incremental learning in support to distributed learning and
fuzzy matching. In SigSpace, a class-based model was built by an evaluation and extension
of existing machine learning models, i.e., K-means and Self-Organizing Maps
(SOM). The Machine learning with SigSpace is modeled as a feature set with standard
machine learning algorithms like Random Forests, Decision Tree etc., and a class model
using L1 (Manhattan distance) and L2 (Euclidean distance) norms.
iii
In order to provide supporting evidence for the effectiveness of SigSpace, we have
conducted comprehensive experiments as follows: Firstly, multiple experiments were
conducted to evaluate the SigSpace model in image classification using large scale image
datasets including Caltech-101, Caltech-256, ImageNet, UEC FOOD 256, MNIST
with image features like Pixels, SIFT, and Local Binary Pattern. Secondly, SigSpace was
evaluated in the audio classification context with imperative audio features extracted from
real-time audio datasets. The SigSpace system was implanted using a Big data analytics tool, Apache
Spark(MLLib) with the capability of parallel and distributed learning and recognition.
The experiments of multinomial classification were conducted with 6 to 1000 classes,
space requirements in megabytes to terabytes, and learning time ranging from minutes to
days. Although there has been a slight accuracy decrease (approximately 5%) in the overall
performance, SigSpace is very efficient, in terms of space as well as runtime performance
for learning and recognition. Thus, the current evaluation confirms that SigSpace
has a significant approach for distributed and scalable Machine learning with big data.Introduction -- Background and related work -- Proposed solution: SigSpace -- Implementation and evaluation -- Conclusion and future wor
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.Facultad de Informátic
- …