Search CORE

472 research outputs found

Data-driven Soft Sensors in the Process Industry

Author: Abdi
Alhoniemi
Angelov
Angelov
Angelov
Arazo-Bravo
Atkeson
Bastin
Bauer
Bishop
Bogdan Gabrys
Bonne
Breiman
Bro
Casali
Chen
Chen
Chen
Chen
Chen
Choi
Chruy
Davies
Dayal
De Wolf
Desai
Devogelaere
Ding
Dong
Dong
Dote
Doyle
Dunia
Dunia
Dunia
Eriksson
Fellner
Fortuna
Fortuna
Frank
Freund
Funahashi
Gabrielsson
Gabrys
Gabrys
Gabrys
Gabrys
Gama
Geladi
Gomez
Gonzalez
Gonzalez
Goodwin
Gosset
Guyon
Han
Hastie
He
Hodge
Hotelling
Jackson
James
Jang
Jiang
Jolliffe
Jordaan
Jos de Assis
Kadlec
Kadlec
Kalos
Kampjarvi
Kittler
Kohavi
Kohonen
Kordon
Kourti
Kourti
Krogh
Kuncheva
Lee
Lee
Lee
Lee
Li
Li
Lin
Lin
Luo
Macias
Mandic
Marjanovic
Meleiro
Menold
Nauck
Neogi
Nomikos
Nomikos
Nomikos
Opitz
Park
Pearson
Pearson
Petr Kadlec
Poggio
Prasad
Principe
Qin
Qin
Qin
Qin
Radhakrishnan
Rnnar
Rong
Rotem
Ruta
Ruta
Schafer
Scheffer
Serneels
Sibylle Strandt
Stanimirova
Su
Tzanakou
van Sprang
van Sprang
Vapnik
Venkatasubramanian
Venkatasubramanian
Venkatasubramanian
Vilalta
Walczak
Walczak
Walczak
Wang
Wang
Wang
Wang
Warne
Weiss
Widmer
Wold
Wold
Wold
Wolpert
Yan
Yang
Zadeh
Zamprogna
Zamprogna
Zhang
Publication venue: 'Elsevier BV'
Publication date: 01/04/2009
Field of study

In the last two decades Soft Sensors established themselves as a valuable alternative to the traditional means for the acquisition of critical process variables, process monitoring and other tasks which are related to process control. This paper discusses characteristics of the process industry data which are critical for the development of data-driven Soft Sensors. These characteristics are common to a large number of process industry fields, like the chemical industry, bioprocess industry, steel industry, etc. The focus of this work is put on the data-driven Soft Sensors because of their growing popularity, already demonstrated usefulness and huge, though yet not completely realised, potential. A comprehensive selection of case studies covering the three most important Soft Sensor application fields, a general introduction to the most popular Soft Sensor modelling techniques as well as a discussion of some open issues in the Soft Sensor development and maintenance and their possible solutions are the main contributions of this work

Crossref

Bournemouth University Research Online

Predicting software project effort: A grey relational analysis based method

Author: Albrecht
Boehm
Boehm
Boetticher
Breunig
Briand
Briand
Briand
Burgess
Cheung
Deng
Deng
Finnie
Huang
Huang
Jain
Jeffery
Jiang
Jou
Jørgensen
Kadoda
Kemerer
Khotanzad
Kohavi
Li
Liu
Luo
Martin Shepperd
Mitchell
Moløkken
Mukhopadhyay
Myrtveit
Putnam
Qinbao Song
Samson
Shepperd
Siedelecki
Song
Srinivasan
Su
Walkerden
Wang
Wang
Wang
Wittig
Wittig
Publication venue: 'Elsevier BV'
Publication date: 01/06/2011
Field of study

This is the post-print version of the final paper published in Expert Systems with Applications. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2011 Elsevier B.V.The inherent uncertainty of the software development process presents particular challenges for software effort prediction. We need to systematically address missing data values, outlier detection, feature subset selection and the continuous evolution of predictions as the project unfolds, and all of this in the context of data-starvation and noisy data. However, in this paper, we particularly focus on outlier detection, feature subset selection, and effort prediction at an early stage of a project. We propose a novel approach of using grey relational analysis (GRA) from grey system theory (GST), which is a recently developed system engineering theory based on the uncertainty of small samples. In this work we address some of the theoretical challenges in applying GRA to outlier detection, feature subset selection, and effort prediction, and then evaluate our approach on five publicly available industrial data sets using both stepwise regression and Analogy as benchmarks. The results are very encouraging in the sense of being comparable or better than other machine learning techniques and thus indicate that the method has considerable potential.National Natural Science Foundation of Chin

Crossref

Brunel University Research Archive

A Review of Missing Data Handling Techniques for Machine Learning

Author: Babu Sena Paul
Luke Oluwaseye Joel
Wesley Doorsamy
Publication venue: Talent under Liberty in Technology (TULTECH) Registrikood: 80569671
Publication date: 08/09/2022
Field of study

Real-world data are commonly known to contain missing values, and consequently affect the performance of most machine learning algorithms adversely when employed on such datasets. Precisely, missing values are among the various challenges occurring in real-world data. Since the accuracy and efficiency of machine learning models depend on the quality of the data used, there is a need for data analysts and researchers working with data, to seek out some relevant techniques that can be used to handle these inescapable missing values. This paper reviews some state-of-art practices obtained in the literature for handling missing data problems for machine learning. It lists some evaluation metrics used in measuring the performance of these techniques. This study tries to put these techniques and evaluation metrics in clear terms, followed by some mathematical equations. Furthermore, some recommendations to consider when dealing with missing data handling techniques were provided

International Journal of Innovative Technology and Interdisciplinary Sciences (IJITIS)

Condition Monitoring of Wind Turbines Using Intelligent Machine Learning Techniques

Author: Morshedizadeh Majid
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2017
Field of study

Wind Turbine condition monitoring can detect anomalies in turbine performance which have the potential to result in unexpected failure and financial loss. This study examines common Supervisory Control And Data Acquisition (SCADA) data over a period of 20 months for 21 pitch regulated 2.3 MW turbines and is presented in three manuscripts. First, power curve monitoring is targeted applying various types of Artificial Neural Networks to increase modeling accuracy. It is shown how the proposed method can significantly improve network reliability compared with existing models. Then, an advance technique is utilized to create a smoother dataset for network training followed by establishing dynamic ANFIS network. At this stage, designed network aims to predict power generation in future hours. Finally, a recursive principal component analysis is performed to extract significant features to be used as input parameters of the network. A novel fusion technique is then employed to build an advanced model to make predictions of turbines performance with favorably low errors

Scholarship at UWindsor

Rice seed image classiﬁcation based on HOG descriptor with missing values imputation

Author: Nguyen-Quoc Huy
Truong Hoang Vinh
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/08/2020
Field of study

Rice is a primary source of food consumed by almost half of world population. Rice quality mainly depends on the purity of the rice seed. In order to ensure the purity of rice variety, the recognition process is an essential stage. In this paper, we ﬁrstly propose to use histogram of oriented gradient (HOG) descriptor to characterize rice seed images. Since the size of image is totally random and the features extracted by HOG can not be used directly by classiﬁer due to the different dimensions. We apply several imputation methods to ﬁll the missing data for HOG descriptor. The experiment is applied on the VNRICE benchmark dataset to evaluate the proposed approach

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

Missing Data Imputation Using Machine Learning and Natural Language Processing for Clinical Diagnostic Codes

Author: Choudhury Arkopal
Publication venue: University of North Carolina at Chapel Hill Graduate School
Publication date: 01/01/2020
Field of study

Imputation of missing data is a common application in supervised classification problems, where the feature matrix of the training dataset has various degrees of missingness. Most of the former studies do not take into account the presence of the class label in the classification problem with missing data. A widely used solution to this problem is missing data imputation based on the lazy learning technique, k-Nearest Neighbor (KNN) approach. We work on a variant of this imputation algorithm using Gray's distance and Mutual Information (MI), called Class-weighted Gray's

k

-Nearest Neighbor (CGKNN) approach. Gray's distance works well with heterogeneous mixed-type data with missing instances, and we weigh distance with mutual information (MI), a measure of feature relevance, between the features and the class label. This method performs better than traditional methods for classification problems with mixed data, as shown in simulations and applications on University of California, Irvine (UCI) Machine Learning datasets (http://archive.ics.uci.edu/ml/index.php). Data being lost to follow up is a common problem in longitudinal data, especially if it involves multiple visits over a long period of time. If the outcome of interest is present in each time point, despite missing covariates due to follow-up (like outcome ascertained through phone calls), then random forest imputation would be a good imputation technique for the missing covariates. The missingness of the data involves more complicated interactions over time since most of the covariates and the outcome have repeated measurements over time. Random forests are a good non-parametric learning technique which captures complex interactions between mixed type data. We propose a proximity imputation and missForest type covariate imputation with random splits while building the forest. The performance of the imputation techniques used is compared to existing techniques in various simulation settings. The Atherosclerosis Risk in Communities (ARIC) Study Cohort is a longitudinal study which started in 1987-1989 to collect data on participants across 4 states in the USA, aimed at studying the factors behind heart diseases. We consider patients at the 5th visit (occurred in 2013) and enrolled in continuous Medicare Fee-For-Service (FFS) insurance in the last 6 months prior to their visit so that their hospitalization diagnostic (ICD) codes are available. Our aim is to characterize the hospitalization of patients having cognitive status ascertainment (classified into dementia, mild cognitive disorder or no cognitive disorder) in the 5th visit. Diagnostic codes for inpatient and outpatient visits identified from CMS (Centers for Medicare \& Medicaid Services) Medicare FFS data linked with ARIC participant data are stored in the form of International Classification of Diseases and related health problems (ICD) codes. We treat these codes as a bag-of-words model to apply text mining techniques and get meaningful cluster of ICD codes.Doctor of Philosoph

Carolina Digital Repository

Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

Author: Chan WK
Huang J
Keung JW
Li Y-F
Sarro F
Sun H
Yu YT
Publication venue: ELSEVIER SCIENCE INC
Publication date: 01/10/2017
Field of study

Being able to predict software quality is essential, but also it pose significant challenges in software engineering. Historical software project datasets are often being utilized together with various machine learning algorithms for fault-proneness classification. Unfortunately, the missing values in datasets have negative impacts on the estimation accuracy and therefore, could lead to inconsistent results. As a method handling missing data, K nearest neighbor (KNN) imputation gradually gains acceptance in empirical studies by its exemplary performance and simplicity. To date, researchers still call for optimized parameter setting for KNN imputation to further improve its performance. In the work, we develop a novel incomplete-instance based KNN imputation technique, which utilizes a cross-validation scheme to optimize the parameters for each missing value. An experimental assessment is conducted on eight quality datasets under various missingness scenarios. The study also compared the proposed imputation approach with mean imputation and other three KNN imputation approaches. The results show that our proposed approach is superior to others in general. The relatively optimal fixed parameter settings for KNN imputation for software quality data is also determined. It is observed that the classification accuracy is improved or at least maintained by using our approach for missing data imputation

UCL Discovery

Fault diagnosis of chemical processes with incomplete observations: A comparative study

Author: Askarian Mahdieh
Escudero Bakx Gerard
Graells Sobré Moisès
Jalali Farahani Farhang
Mostoufi Navid
Zarghami Reza
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

An important problem to be addressed by diagnostic systems in industrial applications is the estimation of faults with incomplete observations. This work discusses different approaches for handling missing data, and performance of data-driven fault diagnosis schemes. An exploiting classifier and combined methods were assessed in Tennessee-Eastman process, for which diverse incomplete observations were produced. The use of several indicators revealed the trade-off between performances of the different schemes. Support vector machines (SVM) and C4.5, combined with k-nearest neighbourhood (kNN), produce the highest robustness and accuracy, respectively. Bayesian networks (BN) and centroid appear as inappropriate options in terms of accuracy, while Gaussian naive Bayes (GNB) is sensitive to imputation values. In addition, feature selection was explored for further performance enhancement, and the proposed contribution index showed promising results. Finally, an industrial case was studied to assess informative level of incomplete data in terms of the redundancy ratio and generalize the discussion. (C) 2015 Elsevier Ltd. All rights reserved.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Imputation techniques for the reconstruction of missing interconnected data from higher Educational Institutions

Author: Aureli D.
Bruni R.
Daraio C.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

Educational Institutions data constitute the basis for several important analyses on the educational systems; however they often contain not negligible shares of missing values, for several reasons. We consider in this work the relevant case of the European Tertiary Education Register (ETER), describing the Educational Institutions of Europe. The presence of missing values prevents the full exploitation of this database, since several types of analyses that could be performed are currently impracticable. The imputation of artificial data, reconstructed with the aim of being statistically equivalent to the (unknown) missing data, would allow to overcome these problems. A main complication in the imputation of this type of data is given by the correlations that exist among all the variables. We propose several imputation techniques designed to deal with the different types of missing values appearing in these interconnected data. We use these techniques to impute the database. Moreover, we evaluate the accuracy of the proposed approach by artificially introducing missing data, by imputing them, and by comparing imputed and original values. Results show that the information reconstruction does not introduce statistically significant changes in the data and that the imputed values are close enough to the original values

Archivio della ricerca- Università di Roma La Sapienza

Fuzzy C-mean missing data imputation for analogy-based effort estimation

Author: Abang Jawawi Dayang Norhayati
Al Mutlaq Ayman Jalal
Arbain Adila Firdaus
Publication venue: 'The Science and Information Organization'
Publication date: 01/08/2021
Field of study

The accuracy of effort estimation in one of the major factors in the success or failure of software projects. Analogy-Based Estimation (ABE) is a widely accepted estimation model since its flow human nature in selecting analogies similar in nature to the target project. The accuracy of prediction in ABE model in strongly associated with the quality of the dataset since it depends on previous completed projects for estimation. Missing Data (MD) is one of major challenges in software engineering datasets. Several missing data imputation techniques have been investigated by researchers in ABE model. Identification of the most similar donor values from the completed software projects dataset for imputation is a challenging issue in existing missing data techniques adopted for ABE model. In this study, Fuzzy C-Mean Imputation (FCMI), Mean Imputation (MI) and K-Nearest Neighbor Imputation (KNNI) are investigated to impute missing values in Desharnais dataset under different missing data percentages (Desh-Miss1, Desh-Miss2) for ABE model. FCMI-ABE technique is proposed in this study. Evaluation comparison among MI, KNNI, and (ABE-FCMI) is conducted for ABE model to identify the suitable MD imputation method. The results suggest that the use of (ABE-FCMI), rather than MI and KNNI, imputes more reliable values to incomplete software projects in the missing datasets. It was also found that the proposed imputation method significantly improves software development effort prediction of ABE model

Universiti Teknologi Malaysia Institutional Repository