14 research outputs found

    Ensemble model-based method for time series sensors’ data validation and imputation applied to a real waste water treatment plant

    Get PDF
    Intelligent Decision Support Systems (IDSSs) integrate different Artificial Intelligence (AI) techniques with the aim of taking or supporting human-like decisions. To this end, these techniques are based on the available data from the target process. This implies that invalid or missing data could trigger incorrect decisions and therefore, undesirable situations in the supervised process. This is even more important in environmental systems, which incorrect malfunction could jeopardise related ecosystems. In data-driven applications such as IDSS, data quality is a basal problem that should be addressed for the sake of the overall systems’ performance. In this paper, a data validation and imputation methodology for time-series is presented. This methodology is integrated in an IDSS software tool which generates suitable control set-points to control the process. The data validation and imputation approach presented here is focused on the imputation step, and it is based on an ensemble of different prediction models obtained for the sensors involved in the process. A Case-Based Reasoning (CBR) approach is used for data imputation, i.e., similar past situations to the current one can propose new values for the missing ones. The CBR model is complemented with other prediction models such as Auto Regressive (AR) models or Artificial Neural Network (ANN) models. Then, the different obtained predictions are ensembled to obtain a better prediction performance than the obtained by each individual prediction model separately. Furthermore, the use of a meta-prediction model, trained using the predictions of all individual models as inputs, is proposed and compared with other ensemble methods to validate its performance. Finally, this approach is illustrated in a real Waste Water Treatment Plant (WWTP) case study using one of the most relevant measures for the correct operation of the WWTPs IDSS, i.e., the ammonia sensor, and considering real faults, showing promising results with improved performance when using the ensemble approach presented here compared against the prediction obtained by each individual model separately.The authors acknowledge the partial support of this work by the Industrial Doctorate Programme (2017DI-006) and the Research Consolidated Groups/Centres Grant (2017 SGR 574) from the Catalan Agency of University and Research Grants Management (AGAUR), from Catalan Government.Peer ReviewedPostprint (published version

    Strategies for Imputation of High-Resolution Environmental Data in Clinical Randomized Controlled Trials.

    Full text link
    Time series data collected in clinical trials can have varying degrees of missingness, adding challenges during statistical analyses. An additional layer of complexity is introduced for missing data in randomized controlled trials (RCT), where researchers must remain blinded between intervention and control groups. Such restriction severely limits the applicability of conventional imputation methods that would utilize other participants' data for improved performance. This paper explores and compares various methods to impute high-resolution temperature logger data in RCT settings. In addition to the conventional non-parametric approaches, we propose a spline regression (SR) approach that captures the dynamics of indoor temperature by time of day that is unique to each participant. We investigate how the inclusion of external temperature and energy use can improve the model performance. Results show that SR imputation results in 16% smaller root mean squared error (RMSE) compared to conventional imputation methods, with the gap widening to 22% when more than half of data is missing. The SR method is particularly useful in cases where missingness occurs simultaneously for multiple participants, such as concurrent battery failures. We demonstrate how proper modelling of periodic dynamics can lead to significantly improved imputation performance, even with limited data

    An improved k-nearest neighbours method for traffic time series imputation

    Full text link

    Interconnected Services for Time-Series Data Management in Smart Manufacturing Scenarios

    Get PDF
    xvii, 218 p.The rise of Smart Manufacturing, together with the strategic initiatives carried out worldwide, have promoted its adoption among manufacturers who are increasingly interested in boosting data-driven applications for different purposes, such as product quality control, predictive maintenance of equipment, etc. However, the adoption of these approaches faces diverse technological challenges with regard to the data-related technologies supporting the manufacturing data life-cycle. The main contributions of this dissertation focus on two specific challenges related to the early stages of the manufacturing data life-cycle: an optimized storage of the massive amounts of data captured during the production processes and an efficient pre-processing of them. The first contribution consists in the design and development of a system that facilitates the pre-processing task of the captured time-series data through an automatized approach that helps in the selection of the most adequate pre-processing techniques to apply to each data type. The second contribution is the design and development of a three-level hierarchical architecture for time-series data storage on cloud environments that helps to manage and reduce the required data storage resources (and consequently its associated costs). Moreover, with regard to the later stages, a thirdcontribution is proposed, that leverages advanced data analytics to build an alarm prediction system that allows to conduct a predictive maintenance of equipment by anticipating the activation of different types of alarms that can be produced on a real Smart Manufacturing scenario

    Forecasting of Photovoltaic Power Production

    Get PDF
    Solar irradiance and temperature are some weather parameters that affect the amount of power photovoltaic cells can generate. Based on these and past power production, future production can be predicted. Knowing" future generation may help the integration of this renewable energy source on an even larger scale than today, as well as optimize the use of them today. In this thesis, forecasting of future power generation was made by an artificial neural network (ANN) model, a support vector regression (SVR) model, an auto-regressive integrated moving average (ARIMA) model, a quantile regression neural network (QRNN) model, an ensemble model of ANN and SVR, an ANN ensemble model and an ANN model using only numerical weather predictions (NWPs) as inputs. Correlation techniques and principal component analysis were used for feature reduction for all models. The research questions for this thesis are, "How will the models perform using random train data to predict August 2021, compared to a random test sample? Will the ensemble models perform better than the standalone models, and will the quantile regression neural network make accurate prediction intervals? How well will the predictions be if the ANN model only uses NWP data as inputs, compared to both historical power and NWPs?". As well as to answer these questions, the objective of this thesis is to provide a model or multiple models that can accurately predict future power production for the PV power system in Lillesand. All models can predict future power production, but some with less accuracy than others. Of all models, as expected, both ensemble models performed best overall for both tests. The SVR model did however perform with the lowest MAE for the August test. For different fits, these results will probably slightly change, but it is expected that the ensemble models will still perform best overall

    Proceedings of the ECMLPKDD 2015 Doctoral Consortium

    Get PDF
    ECMLPKDD 2015 Doctoral Consortium was organized for the second time as part of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD), organised in Porto during September 7-11, 2015. The objective of the doctoral consortium is to provide an environment for students to exchange their ideas and experiences with peers in an interactive atmosphere and to get constructive feedback from senior researchers in machine learning, data mining, and related areas. These proceedings collect together and document all the contributions of the ECMLPKDD 2015 Doctoral Consortium

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Democratizing machine learning

    Get PDF
    Modelle des maschinellen Lernens sind zunehmend in der Gesellschaft verankert, oft in Form von automatisierten Entscheidungsprozessen. Ein wesentlicher Grund dafür ist die verbesserte Zugänglichkeit von Daten, aber auch von Toolkits für maschinelles Lernen, die den Zugang zu Methoden des maschinellen Lernens für Nicht-Experten ermöglichen. Diese Arbeit umfasst mehrere Beiträge zur Demokratisierung des Zugangs zum maschinellem Lernen, mit dem Ziel, einem breiterem Publikum Zugang zu diesen Technologien zu er- möglichen. Die Beiträge in diesem Manuskript stammen aus mehreren Bereichen innerhalb dieses weiten Gebiets. Ein großer Teil ist dem Bereich des automatisierten maschinellen Lernens (AutoML) und der Hyperparameter-Optimierung gewidmet, mit dem Ziel, die oft mühsame Aufgabe, ein optimales Vorhersagemodell für einen gegebenen Datensatz zu finden, zu vereinfachen. Dieser Prozess besteht meist darin ein für vom Benutzer vorgegebene Leistungsmetrik(en) optimales Modell zu finden. Oft kann dieser Prozess durch Lernen aus vorhergehenden Experimenten verbessert oder beschleunigt werden. In dieser Arbeit werden drei solcher Methoden vorgestellt, die entweder darauf abzielen, eine feste Menge möglicher Hyperparameterkonfigurationen zu erhalten, die wahrscheinlich gute Lösungen für jeden neuen Datensatz enthalten, oder Eigenschaften der Datensätze zu nutzen, um neue Konfigurationen vorzuschlagen. Darüber hinaus wird eine Sammlung solcher erforderlichen Metadaten zu den Experimenten vorgestellt, und es wird gezeigt, wie solche Metadaten für die Entwicklung und als Testumgebung für neue Hyperparameter- Optimierungsmethoden verwendet werden können. Die weite Verbreitung von ML-Modellen in vielen Bereichen der Gesellschaft erfordert gleichzeitig eine genauere Untersuchung der Art und Weise, wie aus Modellen abgeleitete automatisierte Entscheidungen die Gesellschaft formen, und ob sie möglicherweise Individuen oder einzelne Bevölkerungsgruppen benachteiligen. In dieser Arbeit wird daher ein AutoML-Tool vorgestellt, das es ermöglicht, solche Überlegungen in die Suche nach einem optimalen Modell miteinzubeziehen. Diese Forderung nach Fairness wirft gleichzeitig die Frage auf, ob die Fairness eines Modells zuverlässig geschätzt werden kann, was in einem weiteren Beitrag in dieser Arbeit untersucht wird. Da der Zugang zu Methoden des maschinellen Lernens auch stark vom Zugang zu Software und Toolboxen abhängt, sind mehrere Beiträge in Form von Software Teil dieser Arbeit. Das R-Paket mlr3pipelines ermöglicht die Einbettung von Modellen in sogenan- nte Machine Learning Pipelines, die Vor- und Nachverarbeitungsschritte enthalten, die im maschinellen Lernen und AutoML häufig benötigt werden. Das mlr3fairness R-Paket hingegen ermöglicht es dem Benutzer, Modelle auf potentielle Benachteiligung hin zu über- prüfen und diese durch verschiedene Techniken zu reduzieren. Eine dieser Techniken, multi-calibration wurde darüberhinaus als seperate Software veröffentlicht.Machine learning artifacts are increasingly embedded in society, often in the form of automated decision-making processes. One major reason for this, along with methodological improvements, is the increasing accessibility of data but also machine learning toolkits that enable access to machine learning methodology for non-experts. The core focus of this thesis is exactly this – democratizing access to machine learning in order to enable a wider audience to benefit from its potential. Contributions in this manuscript stem from several different areas within this broader area. A major section is dedicated to the field of automated machine learning (AutoML) with the goal to abstract away the tedious task of obtaining an optimal predictive model for a given dataset. This process mostly consists of finding said optimal model, often through hyperparameter optimization, while the user in turn only selects the appropriate performance metric(s) and validates the resulting models. This process can be improved or sped up by learning from previous experiments. Three such methods one with the goal to obtain a fixed set of possible hyperparameter configurations that likely contain good solutions for any new dataset and two using dataset characteristics to propose new configurations are presented in this thesis. It furthermore presents a collection of required experiment metadata and how such meta-data can be used for the development and as a test bed for new hyperparameter optimization methods. The pervasion of models derived from ML in many aspects of society simultaneously calls for increased scrutiny with respect to how such models shape society and the eventual biases they exhibit. Therefore, this thesis presents an AutoML tool that allows incorporating fairness considerations into the search for an optimal model. This requirement for fairness simultaneously poses the question of whether we can reliably estimate a model’s fairness, which is studied in a further contribution in this thesis. Since access to machine learning methods also heavily depends on access to software and toolboxes, several contributions in the form of software are part of this thesis. The mlr3pipelines R package allows for embedding models in so-called machine learning pipelines that include pre- and postprocessing steps often required in machine learning and AutoML. The mlr3fairness R package on the other hand enables users to audit models for potential biases as well as reduce those biases through different debiasing techniques. One such technique, multi-calibration is published as a separate software package, mcboost

    Inferring implicit relevance from physiological signals

    Get PDF
    Ongoing growth in data availability and consumption has meant users are increasingly faced with the challenge of distilling relevant information from an abundance of noise. Overcoming this information overload can be particularly difficult in situations such as intelligence analysis, which involves subjectivity, ambiguity, or risky social implications. Highly automated solutions are often inadequate, therefore new methods are needed for augmenting existing analysis techniques to support user decision making. This project investigated the potential for deep learning to infer the occurrence of implicit relevance assessments from users' biometrics. Internal cognitive processes manifest involuntarily within physiological signals, and are often accompanied by 'gut feelings' of intuition. Quantifying unconscious mental processes during relevance appraisal may be a useful tool during decision making by offering an element of objectivity to an inherently subjective situation. Advances in wearable or non-contact sensors have made recording these signals more accessible, whilst advances in artificial intelligence and deep learning have enhanced the discovery of latent patterns within complex data. Together, these techniques might make it possible to transform tacit knowledge into codified knowledge which can be shared. A series of user studies recorded eye gaze movements, pupillary responses, electrodermal activity, heart rate variability, and skin temperature data from participants as they completed a binary relevance assessment task. Participants were asked to explicitly identify which of 40 short-text documents were relevant to an assigned topic. Investigations found this physiological data to contain detectable cues corresponding with relevance judgements. Random forests and artificial neural networks trained on features derived from the signals were able to produce inferences with moderate correlations with the participants' explicit relevance decisions. Several deep learning algorithms trained on the entire physiological time series data were generally unable to surpass the performance of feature-based methods, and instead produced inferences with low correlations with participants' explicit personal truths. Overall, pupillary responses, eye gaze movements, and electrodermal activity offered the most discriminative power, with additional physiological data providing diminishing or adverse returns. Finally, a conceptual design for a decision support system is used to discuss social implications and practicalities of quantifying implicit relevance using deep learning techniques. Potential benefits included assisting with introspection and collaborative assessment, however quantifying intrinsically unknowable concepts using personal data and abstruse artificial intelligence techniques were argued to pose incommensurate risks and challenges. Deep learning techniques therefore have the potential for inferring implicit relevance in information-rich environments, but are not yet fit for purpose. Several avenues worthy of further research are outlined

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications
    corecore