1,651 research outputs found
An Online Sparse Streaming Feature Selection Algorithm
Online streaming feature selection (OSFS), which conducts feature selection
in an online manner, plays an important role in dealing with high-dimensional
data. In many real applications such as intelligent healthcare platform,
streaming feature always has some missing data, which raises a crucial
challenge in conducting OSFS, i.e., how to establish the uncertain relationship
between sparse streaming features and labels. Unfortunately, existing OSFS
algorithms never consider such uncertain relationship. To fill this gap, we in
this paper propose an online sparse streaming feature selection with
uncertainty (OS2FSU) algorithm. OS2FSU consists of two main parts: 1) latent
factor analysis is utilized to pre-estimate the missing data in sparse
streaming features before con-ducting feature selection, and 2) fuzzy logic and
neighborhood rough set are employed to alleviate the uncertainty between
estimated streaming features and labels during conducting feature selection. In
the experiments, OS2FSU is compared with five state-of-the-art OSFS algorithms
on six real datasets. The results demonstrate that OS2FSU outperforms its
competitors when missing data are encountered in OSFS
Online Tool Condition Monitoring Based on Parsimonious Ensemble+
Accurate diagnosis of tool wear in metal turning process remains an open
challenge for both scientists and industrial practitioners because of
inhomogeneities in workpiece material, nonstationary machining settings to suit
production requirements, and nonlinear relations between measured variables and
tool wear. Common methodologies for tool condition monitoring still rely on
batch approaches which cannot cope with a fast sampling rate of metal cutting
process. Furthermore they require a retraining process to be completed from
scratch when dealing with a new set of machining parameters. This paper
presents an online tool condition monitoring approach based on Parsimonious
Ensemble+, pENsemble+. The unique feature of pENsemble+ lies in its highly
flexible principle where both ensemble structure and base-classifier structure
can automatically grow and shrink on the fly based on the characteristics of
data streams. Moreover, the online feature selection scenario is integrated to
actively sample relevant input attributes. The paper presents advancement of a
newly developed ensemble learning algorithm, pENsemble+, where online active
learning scenario is incorporated to reduce operator labelling effort. The
ensemble merging scenario is proposed which allows reduction of ensemble
complexity while retaining its diversity. Experimental studies utilising
real-world manufacturing data streams and comparisons with well known
algorithms were carried out. Furthermore, the efficacy of pENsemble was
examined using benchmark concept drift data streams. It has been found that
pENsemble+ incurs low structural complexity and results in a significant
reduction of operator labelling effort.Comment: this paper has been published by IEEE Transactions on Cybernetic
Graph Summarization
The continuous and rapid growth of highly interconnected datasets, which are
both voluminous and complex, calls for the development of adequate processing
and analytical techniques. One method for condensing and simplifying such
datasets is graph summarization. It denotes a series of application-specific
algorithms designed to transform graphs into more compact representations while
preserving structural patterns, query answers, or specific property
distributions. As this problem is common to several areas studying graph
topologies, different approaches, such as clustering, compression, sampling, or
influence detection, have been proposed, primarily based on statistical and
optimization methods. The focus of our chapter is to pinpoint the main graph
summarization methods, but especially to focus on the most recent approaches
and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie
Distributed context discovering for predictive modeling
Click prediction has applications in various areas such as advertising, search and online sales. Usually user-intent information such as query terms and previous click history is used in click prediction. However, this information is not always available. For example, there are no queries from users on the webpages of content publishers, such as personal blogs. The available information for click prediction in this scenario are implicitly derived from users, such as visiting time and IP address. Thus, the existing approaches utilizing user-intent information may be inapplicable in this scenario; and the click prediction problem in this scenario remains unexplored to our knowledge. In addition, the challenges in handling skewed data streams also exist in prediction, since there is often a heavy traffic on webpages and few visitors click on them. In this thesis, we propose to use the pattern-based classification approach to tackle the click prediction problem. Attributes in webpage visits are combined by a pattern mining algorithm to enhance their power in prediction. To make the pattern-based classification handle skewed data streams, we adopt a sliding window to capture recent data, and an undersampling technique to handle the skewness. As a side problem raised by the pattern-based approach, mining patterns from large datasets is addressed by a distributed pattern sampling algorithm proposed by us. This algorithm shows its scalability in experiments. We validate our pattern-based approach in click prediction on a real-world dataset from a Dutch portal website. The experiments show our pattern-based approach can achieve an average AUC of 0.675 over a period of 36 days with a 5-day sized sliding window, which surpasses the baseline, a statically trained classification model without patterns by 0.002. Besides, the average weighted F-measure of our approach is 0.009 higher than the baseline. Therefore, our proposed approach can slightly improve classification performance; yet whether this improvement worth deployment in real scenarios remains a question. Click prediction has applications in various areas such as advertising, search and online sales. Usually user-intent information such as query terms and previous click history is used in click prediction. However, this information is not always available. For example, there are no queries from users on the webpages of content publishers, such as personal blogs. The available information for click prediction in this scenario are implicitly derived from users, such as visiting time and IP address. Thus, the existing approaches utilizing user-intent information may be inapplicable in this scenario; and the click prediction problem in this scenario remains unexplored to our knowledge. In addition, the challenges in handling skewed data streams also exist in prediction, since there is often a heavy traffic on webpages and few visitors click on them. In this thesis, we propose to use the pattern-based classification approach to tackle the click prediction problem. Attributes in webpage visits are combined by a pattern mining algorithm to enhance their power in prediction. To make the pattern-based classification handle skewed data streams, we adopt a sliding window to capture recent data, and an undersampling technique to handle the skewness. As a side problem raised by the pattern-based approach, mining patterns from large datasets is addressed by a distributed pattern sampling algorithm proposed by us. This algorithm shows its scalability in experiments. We validate our pattern-based approach in click prediction on a real-world dataset from a Dutch portal website. The experiments show our pattern-based approach can achieve an average AUC of 0.675 over a period of 36 days with a 5-day sized sliding window, which surpasses the baseline, a statically trained classification model without patterns by 0.002. Besides, the average weighted F-measure of our approach is 0.009 higher than the baseline. Therefore, our proposed approach can slightly improve classification performance; yet whether this improvement worth deployment in real scenarios remains a question
Stream Learning in Energy IoT Systems: A Case Study in Combined Cycle Power Plants
The prediction of electrical power produced in combined cycle power plants is a key challenge in the electrical power and energy systems field. This power production can vary depending on environmental variables, such as temperature, pressure, and humidity. Thus, the business problem is how to predict the power production as a function of these environmental conditions, in order to maximize the profit. The research community has solved this problem by applying Machine Learning techniques, and has managed to reduce the computational and time costs in comparison with the traditional thermodynamical analysis. Until now, this challenge has been tackled from a batch learning perspective, in which data is assumed to be at rest, and where models do not continuously integrate new information into already constructed models. We present an approach closer to the Big Data and Internet of Things paradigms, in which data are continuously arriving and where models learn incrementally, achieving significant enhancements in terms of data processing (time, memory and computational costs), and obtaining competitive performances. This work compares and examines the hourly electrical power prediction of several streaming regressors, and discusses about the best technique in terms of time processing and predictive performance to be applied on this streaming scenario.This work has been partially supported by the EU project iDev40. This project has received funding
from the ECSEL Joint Undertaking (JU) under grant agreement No 783163. The JU receives support from the
European Union’s Horizon 2020 research and innovation programme and Austria, Germany, Belgium, Italy,
Spain, Romania. It has also been supported by the Basque Government (Spain) through the project VIRTUAL
(KK-2018/00096), and by Ministerio de EconomĂa y Competitividad of Spain (Grant Ref. TIN2017-85887-C2-2-P)
Data science applications to connected vehicles: Key barriers to overcome
The connected vehicles will generate huge amount of pervasive and real time data, at very high frequencies. This poses new challenges for Data science. How to analyse these data and how to address short-term and long-term storage are some of the key barriers to overcome.JRC.C.6-Economics of Climate Change, Energy and Transpor
Feature-based multi-class classification and novelty detection for fault diagnosis of industrial machinery
Given the strategic role that maintenance assumes in achieving profitability and competitiveness, many industries are dedicating many efforts and resources to improve their maintenance approaches. The concept of the Smart Factory and the possibility of highly connected plants enable the collection of massive data that allow equipment to be monitored continuously and real-time feedback on their health status. The main issue met by industries is the lack of data corresponding to faulty conditions, due to environmental and safety issues that failed machinery might cause, besides the production loss and product quality issues. In this paper, a complete and easy-to-implement procedure for streaming fault diagnosis and novelty detection, using different Machine Learning techniques, is applied to an industrial machinery sub-system. The paper aims to offer useful guidelines to practitioners to choose the best solution for their systems, including a model hyperparameter optimization technique that supports the choice of the best model. Results indicate that the methodology is easy, fast, and accurate. Few training data guarantee a high accuracy and a high generalization ability of the classification models, while the integration of a classifier and an anomaly detector reduces the number of false alarms and the computational time
- …