55,359 research outputs found

    String and Membrane Gaussian Processes

    Full text link
    In this paper we introduce a novel framework for making exact nonparametric Bayesian inference on latent functions, that is particularly suitable for Big Data tasks. Firstly, we introduce a class of stochastic processes we refer to as string Gaussian processes (string GPs), which are not to be mistaken for Gaussian processes operating on text. We construct string GPs so that their finite-dimensional marginals exhibit suitable local conditional independence structures, which allow for scalable, distributed, and flexible nonparametric Bayesian inference, without resorting to approximations, and while ensuring some mild global regularity constraints. Furthermore, string GP priors naturally cope with heterogeneous input data, and the gradient of the learned latent function is readily available for explanatory analysis. Secondly, we provide some theoretical results relating our approach to the standard GP paradigm. In particular, we prove that some string GPs are Gaussian processes, which provides a complementary global perspective on our framework. Finally, we derive a scalable and distributed MCMC scheme for supervised learning tasks under string GP priors. The proposed MCMC scheme has computational time complexity O(N)\mathcal{O}(N) and memory requirement O(dN)\mathcal{O}(dN), where NN is the data size and dd the dimension of the input space. We illustrate the efficacy of the proposed approach on several synthetic and real-world datasets, including a dataset with 66 millions input points and 88 attributes.Comment: To appear in the Journal of Machine Learning Research (JMLR), Volume 1

    HPC-oriented Canonical Workflows for Machine Learning Applications in Climate and Weather Prediction

    Get PDF
    Machine learning (ML) applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing (HPC) power are paving the way. Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers. Even though the FAIR principle is well known to many scientists, research communities are slow to adopt them. Canonical Workflow Framework for Research (CWFR) provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers. This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate, focusing on HPC and big data. Specifically, we discuss Fair Digital Object (FDO) and Research Object (RO) in the DeepRain project to achieve granular reproducibility. DeepRain is a project that aims to improve precipitation forecast in Germany by using ML. Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access. We suggest the Juypter notebook as a single reproducible experiment. In addition, we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface

    Performance evaluation of a distributed clustering approach for spatial datasets

    Get PDF
    The analysis of big data requires powerful, scalable, and accurate data analytics techniques that the traditional data mining and machine learning do not have as a whole. Therefore, new data analytics frameworks are needed to deal with the big data challenges such as volumes, velocity, veracity, variety of the data. Distributed data mining constitutes a promising approach for big data sets, as they are usually produced in distributed locations, and processing them on their local sites will reduce significantly the response times, communications, etc. In this paper, we propose to study the performance of a distributed clustering, called Dynamic Distributed Clustering (DDC). DDC has the ability to remotely generate clusters and then aggregate them using an efficient aggregation algorithm. The technique is developed for spatial datasets. We evaluated the DDC using two types of communications (synchronous and asynchronous), and tested using various load distributions. The experimental results show that the approach has super-linear speed-up, scales up very well, and can take advantage of the recent programming models, such as MapReduce model, as its results are not affected by the types of communication

    SigSpace – Class-Based Feature Representation for Scalable and Distributed Machine Learning

    Get PDF
    Title from PDF of title page, viewed on October 25, 2016Thesis advisor: Yugyung LeeVitaIncludes bibliographical references (pages 70-71)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2016In the era of big data, it is essential to explore the opportunities in discovering knowledge from big data. However, traditional machine learning approaches are not well fit to analyze the full value of big data. Explicitly, current research and practice of Machine learning do not fully support some important features for big data analytics such as incremental learning, distributed learning, and fuzzy matching. In this thesis, we propose a unique feature representation, named the SigSpace. It is designed for a class-level incremental learning in support to distributed learning and fuzzy matching. In SigSpace, a class-based model was built by an evaluation and extension of existing machine learning models, i.e., K-means and Self-Organizing Maps (SOM). The Machine learning with SigSpace is modeled as a feature set with standard machine learning algorithms like Random Forests, Decision Tree etc., and a class model using L1 (Manhattan distance) and L2 (Euclidean distance) norms. iii In order to provide supporting evidence for the effectiveness of SigSpace, we have conducted comprehensive experiments as follows: Firstly, multiple experiments were conducted to evaluate the SigSpace model in image classification using large scale image datasets including Caltech-101, Caltech-256, ImageNet, UEC FOOD 256, MNIST with image features like Pixels, SIFT, and Local Binary Pattern. Secondly, SigSpace was evaluated in the audio classification context with imperative audio features extracted from real-time audio datasets. The SigSpace system was implanted using a Big data analytics tool, Apache Spark(MLLib) with the capability of parallel and distributed learning and recognition. The experiments of multinomial classification were conducted with 6 to 1000 classes, space requirements in megabytes to terabytes, and learning time ranging from minutes to days. Although there has been a slight accuracy decrease (approximately 5%) in the overall performance, SigSpace is very efficient, in terms of space as well as runtime performance for learning and recognition. Thus, the current evaluation confirms that SigSpace has a significant approach for distributed and scalable Machine learning with big data.Introduction -- Background and related work -- Proposed solution: SigSpace -- Implementation and evaluation -- Conclusion and future wor

    SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

    Get PDF
    The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.Facultad de Informátic
    corecore