166 research outputs found

    A Big Data Cleaning Method for Drinking-Water Streaming Data

    Get PDF
    Abstract A HA_Cart_AdaBoost model is proposed to clean the data in drinking-water-quality data. First, the data that do not follow the normal distribution are regarded as outliers and eliminated. Next, the optimal control theory of nonlinear partial differential equations (PDEs) is introduced into the cart decision tree, and the cart decision with the specified depth is used. As a weak classifier of AdaBoost, the tree uses the HA_Cart_AdaBoost model to compensate for the eliminated data, then it fits and predicts the missing values of the data stream, realizes the cleaning of drinking-water-quality data, and finally uses the big data Hadoop architecture for real-time storage and analysing streaming data. The experimental results show that compared with the most advanced data cleaning methods, after the optimal control theory of nonlinear PDEs is introduced into the cart decision tree, the stability and accuracy of the HA_Cart_AdaBoost model for water quality data cleaning are greatly improved. Taking pH as an example, the HA_Cart_AdaBoost model shows a minimum improvement of 2.25% and a maximum improvement of 53.33% in terms of RMSE, and a minimum improvement of 13.51% and a maximum improvement of 78.08% in terms of MAE

    Knowledge-based Data Processing for Multilingual Natural Language Analysis

    Get PDF
    Natural Language Processing (NLP) aids the empowerment of intelligent machines by enhancing human language understanding for linguistic-based human-computer communication. Recent developments in processing power, as well as the availability of large volumes of linguistic data, have enhanced the demand for data-driven methods for automatic semantic analysis. This paper proposes multilingual data processing using feature extraction with classification using deep learning architectures. Here, the input text data has been collected based on various languages and processed to remove missing values and null values. The processed data has been extracted using Histogram Equalization based Global Local Entropy (HEGLE) and classified using Kernel-based Radial basis Function (Ker_Rad_BF). These architectures could be utilized to process natural language. We present solutions to the multilingual sentiment analysis issue in this research article by implementing algorithms, and we compare precision factors to discover the optimum option for multilingual sentiment analysis. For the HASOC dataset, the proposed HEGLE_ Ker_Rad_BF achieved an accuracy of 98%, a precision of 97%, a recall of 90.5%, an f-1 score of 85%, RMSE of 55.6% and a loss curve analysis attained 44%. For the TRAC dataset, the accuracy of 98%, the precision attained is 97%, the Recall is 91%, the F-1 score is 87%, and the RMSE of the proposed neural network is 55%

    Understanding ML driven HPC: Applications and Infrastructure

    Full text link
    We recently outlined the vision of "Learning Everywhere" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper, which is part of the Learning Everywhere series, we discuss "how" learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper identifies several modes --- substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for MLaroundHPC benchmarks.Comment: Invited talk to "Visionary Track" at IEEE eScience 2019. arXiv admin note: text overlap with arXiv:1806.04731 by other author

    International Conference on Nonlinear Differential Equations and Applications

    Get PDF
    Dear Participants, Colleagues and Friends It is a great honour and a privilege to give you all a warmest welcome to the first Portugal-Italy Conference on Nonlinear Differential Equations and Applications (PICNDEA). This conference takes place at the Colégio Espírito Santo, University of Évora, located in the beautiful city of Évora, Portugal. The host institution, as well the associated scientific research centres, are committed to the event, hoping that it will be a benchmark for scientific collaboration between the two countries in the area of mathematics. The main scientific topics of the conference are Ordinary and Partial Differential Equations, with particular regard to non-linear problems originating in applications, and its treatment with the methods of Numerical Analysis. The fundamental main purpose is to bring together Italian and Portuguese researchers in the above fields, to create new, and amplify previous collaboration, and to follow and discuss new topics in the area

    Dynamic Data Mining: Methodology and Algorithms

    No full text
    Supervised data stream mining has become an important and challenging data mining task in modern organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions. To address these three challenges, this thesis proposes the novel dynamic data mining (DDM) methodology by effectively applying supervised ensemble models to data stream mining. DDM can be loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired by the idea that although the underlying concepts in a data stream are time-varying, their distinctions can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in order to classify incoming examples of similar concepts. First, following the general paradigm of DDM, we examine the different concept-drifting stream mining scenarios and propose corresponding effective and efficient data mining algorithms. • To address concept drift caused merely by changes of variable distributions, which we term pseudo concept drift, base models built on categorized streaming data are organized and selected in line with their corresponding variable distribution characteristics. • To address concept drift caused by changes of variable and class joint distributions, which we term true concept drift, an effective data categorization scheme is introduced. A group of working models is dynamically organized and selected for reacting to the drifting concept. Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce easily six effective algorithms for mining data streams with skewed class distributions. In addition, we also introduce a new ensemble model approach for batch learning, following the same methodology. Both theoretical and empirical studies demonstrate its effectiveness. Future work would be targeted at improving the effectiveness and efficiency of the proposed algorithms. Meantime, we would explore the possibilities of using the integration framework to solve other open stream mining research problems

    Incremental learning of concept drift from imbalanced data

    Get PDF
    Learning data sampled from a nonstationary distribution has been shown to be a very challenging problem in machine learning, because the joint probability distribution between the data and classes evolve over time. Thus learners must adapt their knowledge base, including their structure or parameters, to remain as strong predictors. This phenomenon of learning from an evolving data source is akin to learning how to play a game while the rules of the game are changed, and it is traditionally referred to as learning concept drift. Climate data, financial data, epidemiological data, spam detection are examples of applications that give rise to concept drift problems. An additional challenge arises when the classes to be learned are not represented (approximately) equally in the training data, as most machine learning algorithms work well only when the class distributions are balanced. However, rare categories are commonly faced in real-world applications, which leads to skewed or imbalanced datasets. Fraud detection, rare disease diagnosis, anomaly detection are examples of applications that feature imbalanced datasets, where data from category are severely underrepresented. Concept drift and class imbalance are traditionally addressed separately in machine learning, yet data streams can experience both phenomena. This work introduces Learn++.NIE (nonstationary & imbalanced environments) and Learn++.CDS (concept drift with SMOTE) as two new members of the Learn++ family of incremental learning algorithms that explicitly and simultaneously address the aforementioned phenomena. The former addresses concept drift and class imbalance through modified bagging-based sampling and replacing a class independent error weighting mechanism - which normally favors majority class - with a set of measures that emphasize good predictive accuracy on all classes. The latter integrates Learn++.NSE, an algorithm for concept drift, with the synthetic sampling method known as SMOTE, to cope with class imbalance. This research also includes a thorough evaluation of Learn++.CDS and Learn++.NIE on several real and synthetic datasets and on several figures of merit, showing that both algorithms are able to learn in some of the most difficult learning environments

    A Probabilistic Digital Twin for Leak Localization in Water Distribution Networks Using Generative Deep Learning

    Get PDF
    Localizing leakages in large water distribution systems is an important and ever-present problem. Due to the complexity originating from water pipeline networks, too few sensors, and noisy measurements, this is a highly challenging problem to solve. In this work, we present a methodology based on generative deep learning and Bayesian inference for leak localization with uncertainty quantification. A generative model, utilizing deep neural networks, serves as a probabilistic surrogate model that replaces the full equations, while at the same time also incorporating the uncertainty inherent in such models. By embedding this surrogate model into a Bayesian inference scheme, leaks are located by combining sensor observations with a model output approximating the true posterior distribution for possible leak locations. We show that our methodology enables producing fast, accurate, and trustworthy results. It showed a convincing performance on three problems with increasing complexity. For a simple test case, the Hanoi network, the average topological distance (ATD) between the predicted and true leak location ranged from 0.3 to 3 with a varying number of sensors and level of measurement noise. For two more complex test cases, the ATD ranged from 0.75 to 4 and from 1.5 to 10, respectively. Furthermore, accuracies upwards of 83%, 72%, and 42% were achieved for the three test cases, respectively. The computation times ranged from 0.1 to 13 s, depending on the size of the neural network employed. This work serves as an example of a digital twin for a sophisticated application of advanced mathematical and deep learning techniques in the area of leak detection

    Deep learning for internet of underwater things and ocean data analytics

    Get PDF
    The Internet of Underwater Things (IoUT) is an emerging technological ecosystem developed for connecting objects in maritime and underwater environments. IoUT technologies are empowered by an extreme number of deployed sensors and actuators. In this thesis, multiple IoUT sensory data are augmented with machine intelligence for forecasting purposes

    A Survey on Graph Representation Learning Methods

    Full text link
    Graphs representation learning has been a very active research area in recent years. The goal of graph representation learning is to generate graph representation vectors that capture the structure and features of large graphs accurately. This is especially important because the quality of the graph representation vectors will affect the performance of these vectors in downstream tasks such as node classification, link prediction and anomaly detection. Many techniques are proposed for generating effective graph representation vectors. Two of the most prevalent categories of graph representation learning are graph embedding methods without using graph neural nets (GNN), which we denote as non-GNN based graph embedding methods, and graph neural nets (GNN) based methods. Non-GNN graph embedding methods are based on techniques such as random walks, temporal point processes and neural network learning methods. GNN-based methods, on the other hand, are the application of deep learning on graph data. In this survey, we provide an overview of these two categories and cover the current state-of-the-art methods for both static and dynamic graphs. Finally, we explore some open and ongoing research directions for future work
    • …
    corecore