166 research outputs found
A Big Data Cleaning Method for Drinking-Water Streaming Data
Abstract A HA_Cart_AdaBoost model is proposed to clean the data in drinking-water-quality data. First, the data that do not follow the normal distribution are regarded as outliers and eliminated. Next, the optimal control theory of nonlinear partial differential equations (PDEs) is introduced into the cart decision tree, and the cart decision with the specified depth is used. As a weak classifier of AdaBoost, the tree uses the HA_Cart_AdaBoost model to compensate for the eliminated data, then it fits and predicts the missing values of the data stream, realizes the cleaning of drinking-water-quality data, and finally uses the big data Hadoop architecture for real-time storage and analysing streaming data. The experimental results show that compared with the most advanced data cleaning methods, after the optimal control theory of nonlinear PDEs is introduced into the cart decision tree, the stability and accuracy of the HA_Cart_AdaBoost model for water quality data cleaning are greatly improved. Taking pH as an example, the HA_Cart_AdaBoost model shows a minimum improvement of 2.25% and a maximum improvement of 53.33% in terms of RMSE, and a minimum improvement of 13.51% and a maximum improvement of 78.08% in terms of MAE
Knowledge-based Data Processing for Multilingual Natural Language Analysis
Natural Language Processing (NLP) aids the empowerment of intelligent machines by enhancing human language understanding for linguistic-based human-computer communication. Recent developments in processing power, as well as the availability of large volumes of linguistic data, have enhanced the demand for data-driven methods for automatic semantic analysis. This paper proposes multilingual data processing using feature extraction with classification using deep learning architectures. Here, the input text data has been collected based on various languages and processed to remove missing values and null values. The processed data has been extracted using Histogram Equalization based Global Local Entropy (HEGLE) and classified using Kernel-based Radial basis Function (Ker_Rad_BF). These architectures could be utilized to process natural language. We present solutions to the multilingual sentiment analysis issue in this research article by implementing algorithms, and we compare precision factors to discover the optimum option for multilingual sentiment analysis. For the HASOC dataset, the proposed HEGLE_ Ker_Rad_BF achieved an accuracy of 98%, a precision of 97%, a recall of 90.5%, an f-1 score of 85%, RMSE of 55.6% and a loss curve analysis attained 44%. For the TRAC dataset, the accuracy of 98%, the precision attained is 97%, the Recall is 91%, the F-1 score is 87%, and the RMSE of the proposed neural network is 55%
Understanding ML driven HPC: Applications and Infrastructure
We recently outlined the vision of "Learning Everywhere" which captures the
possibility and impact of how learning methods and traditional HPC methods can
be coupled together. A primary driver of such coupling is the promise that
Machine Learning (ML) will give major performance improvements for traditional
HPC simulations. Motivated by this potential, the ML around HPC class of
integration is of particular significance. In a related follow-up paper, we
provided an initial taxonomy for integrating learning around HPC methods. In
this paper, which is part of the Learning Everywhere series, we discuss "how"
learning methods and HPC simulations are being integrated to enhance effective
performance of computations. This paper identifies several modes ---
substitution, assimilation, and control, in which learning methods integrate
with HPC simulations and provide representative applications in each mode. This
paper discusses some open research questions and we hope will motivate and
clear the ground for MLaroundHPC benchmarks.Comment: Invited talk to "Visionary Track" at IEEE eScience 2019. arXiv admin
note: text overlap with arXiv:1806.04731 by other author
International Conference on Nonlinear Differential Equations and Applications
Dear Participants, Colleagues and Friends
It is a great honour and a privilege to give you all a warmest welcome to the first Portugal-Italy Conference on Nonlinear Differential Equations and Applications (PICNDEA).
This conference takes place at the ColĂ©gio EspĂrito Santo, University of Évora, located in the beautiful city of Évora, Portugal. The host institution, as well the associated scientific research centres, are committed to the event, hoping that it will be a benchmark for scientific collaboration between the two countries in the area of mathematics.
The main scientific topics of the conference are Ordinary and Partial Differential Equations, with particular regard to non-linear problems originating in applications, and its treatment with the methods of Numerical Analysis. The fundamental main purpose is to bring together Italian and Portuguese researchers in the above fields, to create new, and amplify previous collaboration, and to follow and discuss new topics in the area
Dynamic Data Mining: Methodology and Algorithms
Supervised data stream mining has become an important and challenging data mining task in modern
organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples
and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions.
To address these three challenges, this thesis proposes the novel dynamic data mining (DDM)
methodology by effectively applying supervised ensemble models to data stream mining. DDM can be
loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired
by the idea that although the underlying concepts in a data stream are time-varying, their distinctions
can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in
order to classify incoming examples of similar concepts.
First, following the general paradigm of DDM, we examine the different concept-drifting stream
mining scenarios and propose corresponding effective and efficient data mining algorithms.
• To address concept drift caused merely by changes of variable distributions, which we term
pseudo concept drift, base models built on categorized streaming data are organized and
selected in line with their corresponding variable distribution characteristics.
• To address concept drift caused by changes of variable and class joint distributions, which we
term true concept drift, an effective data categorization scheme is introduced. A group of
working models is dynamically organized and selected for reacting to the drifting concept.
Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by
DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce
easily six effective algorithms for mining data streams with skewed class distributions.
In addition, we also introduce a new ensemble model approach for batch learning, following the same
methodology. Both theoretical and empirical studies demonstrate its effectiveness.
Future work would be targeted at improving the effectiveness and efficiency of the proposed
algorithms. Meantime, we would explore the possibilities of using the integration framework to solve
other open stream mining research problems
Incremental learning of concept drift from imbalanced data
Learning data sampled from a nonstationary distribution has been shown to be a very challenging problem in machine learning, because the joint probability distribution between the data and classes evolve over time. Thus learners must adapt their knowledge base, including their structure or parameters, to remain as strong predictors. This phenomenon of learning from an evolving data source is akin to learning how to play a game while the rules of the game are changed, and it is traditionally referred to as learning concept drift. Climate data, financial data, epidemiological data, spam detection are examples of applications that give rise to concept drift problems. An additional challenge arises when the classes to be learned are not represented (approximately) equally in the training data, as most machine learning algorithms work well only when the class distributions are balanced. However, rare categories are commonly faced in real-world applications, which leads to skewed or imbalanced datasets. Fraud detection, rare disease diagnosis, anomaly detection are examples of applications that feature imbalanced datasets, where data from category are severely underrepresented. Concept drift and class imbalance are traditionally addressed separately in machine learning, yet data streams can experience both phenomena. This work introduces Learn++.NIE (nonstationary & imbalanced environments) and Learn++.CDS (concept drift with SMOTE) as two new members of the Learn++ family of incremental learning algorithms that explicitly and simultaneously address the aforementioned phenomena. The former addresses concept drift and class imbalance through modified bagging-based sampling and replacing a class independent error weighting mechanism - which normally favors majority class - with a set of measures that emphasize good predictive accuracy on all classes. The latter integrates Learn++.NSE, an algorithm for concept drift, with the synthetic sampling method known as SMOTE, to cope with class imbalance. This research also includes a thorough evaluation of Learn++.CDS and Learn++.NIE on several real and synthetic datasets and on several figures of merit, showing that both algorithms are able to learn in some of the most difficult learning environments
A Probabilistic Digital Twin for Leak Localization in Water Distribution Networks Using Generative Deep Learning
Localizing leakages in large water distribution systems is an important and ever-present problem. Due to the complexity originating from water pipeline networks, too few sensors, and noisy measurements, this is a highly challenging problem to solve. In this work, we present a methodology based on generative deep learning and Bayesian inference for leak localization with uncertainty quantification. A generative model, utilizing deep neural networks, serves as a probabilistic surrogate model that replaces the full equations, while at the same time also incorporating the uncertainty inherent in such models. By embedding this surrogate model into a Bayesian inference scheme, leaks are located by combining sensor observations with a model output approximating the true posterior distribution for possible leak locations. We show that our methodology enables producing fast, accurate, and trustworthy results. It showed a convincing performance on three problems with increasing complexity. For a simple test case, the Hanoi network, the average topological distance (ATD) between the predicted and true leak location ranged from 0.3 to 3 with a varying number of sensors and level of measurement noise. For two more complex test cases, the ATD ranged from 0.75 to 4 and from 1.5 to 10, respectively. Furthermore, accuracies upwards of 83%, 72%, and 42% were achieved for the three test cases, respectively. The computation times ranged from 0.1 to 13 s, depending on the size of the neural network employed. This work serves as an example of a digital twin for a sophisticated application of advanced mathematical and deep learning techniques in the area of leak detection
Deep learning for internet of underwater things and ocean data analytics
The Internet of Underwater Things (IoUT) is an emerging technological ecosystem developed for connecting objects in maritime and underwater environments. IoUT technologies are empowered by an extreme number of deployed sensors and actuators. In this thesis, multiple IoUT sensory data are augmented with machine intelligence for forecasting purposes
A Survey on Graph Representation Learning Methods
Graphs representation learning has been a very active research area in recent
years. The goal of graph representation learning is to generate graph
representation vectors that capture the structure and features of large graphs
accurately. This is especially important because the quality of the graph
representation vectors will affect the performance of these vectors in
downstream tasks such as node classification, link prediction and anomaly
detection. Many techniques are proposed for generating effective graph
representation vectors. Two of the most prevalent categories of graph
representation learning are graph embedding methods without using graph neural
nets (GNN), which we denote as non-GNN based graph embedding methods, and graph
neural nets (GNN) based methods. Non-GNN graph embedding methods are based on
techniques such as random walks, temporal point processes and neural network
learning methods. GNN-based methods, on the other hand, are the application of
deep learning on graph data. In this survey, we provide an overview of these
two categories and cover the current state-of-the-art methods for both static
and dynamic graphs. Finally, we explore some open and ongoing research
directions for future work
- …