472 research outputs found

    Data-driven Soft Sensors in the Process Industry

    Get PDF
    In the last two decades Soft Sensors established themselves as a valuable alternative to the traditional means for the acquisition of critical process variables, process monitoring and other tasks which are related to process control. This paper discusses characteristics of the process industry data which are critical for the development of data-driven Soft Sensors. These characteristics are common to a large number of process industry fields, like the chemical industry, bioprocess industry, steel industry, etc. The focus of this work is put on the data-driven Soft Sensors because of their growing popularity, already demonstrated usefulness and huge, though yet not completely realised, potential. A comprehensive selection of case studies covering the three most important Soft Sensor application fields, a general introduction to the most popular Soft Sensor modelling techniques as well as a discussion of some open issues in the Soft Sensor development and maintenance and their possible solutions are the main contributions of this work

    Predicting software project effort: A grey relational analysis based method

    Get PDF
    This is the post-print version of the final paper published in Expert Systems with Applications. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2011 Elsevier B.V.The inherent uncertainty of the software development process presents particular challenges for software effort prediction. We need to systematically address missing data values, outlier detection, feature subset selection and the continuous evolution of predictions as the project unfolds, and all of this in the context of data-starvation and noisy data. However, in this paper, we particularly focus on outlier detection, feature subset selection, and effort prediction at an early stage of a project. We propose a novel approach of using grey relational analysis (GRA) from grey system theory (GST), which is a recently developed system engineering theory based on the uncertainty of small samples. In this work we address some of the theoretical challenges in applying GRA to outlier detection, feature subset selection, and effort prediction, and then evaluate our approach on five publicly available industrial data sets using both stepwise regression and Analogy as benchmarks. The results are very encouraging in the sense of being comparable or better than other machine learning techniques and thus indicate that the method has considerable potential.National Natural Science Foundation of Chin

    A Review of Missing Data Handling Techniques for Machine Learning

    Get PDF
    Real-world data are commonly known to contain missing values, and consequently affect the performance of most machine learning algorithms adversely when employed on such datasets. Precisely, missing values are among the various challenges occurring in real-world data. Since the accuracy and efficiency of machine learning models depend on the quality of the data used, there is a need for data analysts and researchers working with data, to seek out some relevant techniques that can be used to handle these inescapable missing values. This paper reviews some state-of-art practices obtained in the literature for handling missing data problems for machine learning. It lists some evaluation metrics used in measuring the performance of these techniques. This study tries to put these techniques and evaluation metrics in clear terms, followed by some mathematical equations. Furthermore, some recommendations to consider when dealing with missing data handling techniques were provided

    Condition Monitoring of Wind Turbines Using Intelligent Machine Learning Techniques

    Get PDF
    Wind Turbine condition monitoring can detect anomalies in turbine performance which have the potential to result in unexpected failure and financial loss. This study examines common Supervisory Control And Data Acquisition (SCADA) data over a period of 20 months for 21 pitch regulated 2.3 MW turbines and is presented in three manuscripts. First, power curve monitoring is targeted applying various types of Artificial Neural Networks to increase modeling accuracy. It is shown how the proposed method can significantly improve network reliability compared with existing models. Then, an advance technique is utilized to create a smoother dataset for network training followed by establishing dynamic ANFIS network. At this stage, designed network aims to predict power generation in future hours. Finally, a recursive principal component analysis is performed to extract significant features to be used as input parameters of the network. A novel fusion technique is then employed to build an advanced model to make predictions of turbines performance with favorably low errors

    Rice seed image classification based on HOG descriptor with missing values imputation

    Get PDF
    Rice is a primary source of food consumed by almost half of world population. Rice quality mainly depends on the purity of the rice seed. In order to ensure the purity of rice variety, the recognition process is an essential stage. In this paper, we firstly propose to use histogram of oriented gradient (HOG) descriptor to characterize rice seed images. Since the size of image is totally random and the features extracted by HOG can not be used directly by classifier due to the different dimensions. We apply several imputation methods to fill the missing data for HOG descriptor. The experiment is applied on the VNRICE benchmark dataset to evaluate the proposed approach

    Missing Data Imputation Using Machine Learning and Natural Language Processing for Clinical Diagnostic Codes

    Get PDF
    Imputation of missing data is a common application in supervised classification problems, where the feature matrix of the training dataset has various degrees of missingness. Most of the former studies do not take into account the presence of the class label in the classification problem with missing data. A widely used solution to this problem is missing data imputation based on the lazy learning technique, k-Nearest Neighbor (KNN) approach. We work on a variant of this imputation algorithm using Gray's distance and Mutual Information (MI), called Class-weighted Gray's kk-Nearest Neighbor (CGKNN) approach. Gray's distance works well with heterogeneous mixed-type data with missing instances, and we weigh distance with mutual information (MI), a measure of feature relevance, between the features and the class label. This method performs better than traditional methods for classification problems with mixed data, as shown in simulations and applications on University of California, Irvine (UCI) Machine Learning datasets (http://archive.ics.uci.edu/ml/index.php). Data being lost to follow up is a common problem in longitudinal data, especially if it involves multiple visits over a long period of time. If the outcome of interest is present in each time point, despite missing covariates due to follow-up (like outcome ascertained through phone calls), then random forest imputation would be a good imputation technique for the missing covariates. The missingness of the data involves more complicated interactions over time since most of the covariates and the outcome have repeated measurements over time. Random forests are a good non-parametric learning technique which captures complex interactions between mixed type data. We propose a proximity imputation and missForest type covariate imputation with random splits while building the forest. The performance of the imputation techniques used is compared to existing techniques in various simulation settings. The Atherosclerosis Risk in Communities (ARIC) Study Cohort is a longitudinal study which started in 1987-1989 to collect data on participants across 4 states in the USA, aimed at studying the factors behind heart diseases. We consider patients at the 5th visit (occurred in 2013) and enrolled in continuous Medicare Fee-For-Service (FFS) insurance in the last 6 months prior to their visit so that their hospitalization diagnostic (ICD) codes are available. Our aim is to characterize the hospitalization of patients having cognitive status ascertainment (classified into dementia, mild cognitive disorder or no cognitive disorder) in the 5th visit. Diagnostic codes for inpatient and outpatient visits identified from CMS (Centers for Medicare \& Medicaid Services) Medicare FFS data linked with ARIC participant data are stored in the form of International Classification of Diseases and related health problems (ICD) codes. We treat these codes as a bag-of-words model to apply text mining techniques and get meaningful cluster of ICD codes.Doctor of Philosoph

    Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

    Get PDF
    Being able to predict software quality is essential, but also it pose significant challenges in software engineering. Historical software project datasets are often being utilized together with various machine learning algorithms for fault-proneness classification. Unfortunately, the missing values in datasets have negative impacts on the estimation accuracy and therefore, could lead to inconsistent results. As a method handling missing data, K nearest neighbor (KNN) imputation gradually gains acceptance in empirical studies by its exemplary performance and simplicity. To date, researchers still call for optimized parameter setting for KNN imputation to further improve its performance. In the work, we develop a novel incomplete-instance based KNN imputation technique, which utilizes a cross-validation scheme to optimize the parameters for each missing value. An experimental assessment is conducted on eight quality datasets under various missingness scenarios. The study also compared the proposed imputation approach with mean imputation and other three KNN imputation approaches. The results show that our proposed approach is superior to others in general. The relatively optimal fixed parameter settings for KNN imputation for software quality data is also determined. It is observed that the classification accuracy is improved or at least maintained by using our approach for missing data imputation

    Fault diagnosis of chemical processes with incomplete observations: A comparative study

    Get PDF
    An important problem to be addressed by diagnostic systems in industrial applications is the estimation of faults with incomplete observations. This work discusses different approaches for handling missing data, and performance of data-driven fault diagnosis schemes. An exploiting classifier and combined methods were assessed in Tennessee-Eastman process, for which diverse incomplete observations were produced. The use of several indicators revealed the trade-off between performances of the different schemes. Support vector machines (SVM) and C4.5, combined with k-nearest neighbourhood (kNN), produce the highest robustness and accuracy, respectively. Bayesian networks (BN) and centroid appear as inappropriate options in terms of accuracy, while Gaussian naive Bayes (GNB) is sensitive to imputation values. In addition, feature selection was explored for further performance enhancement, and the proposed contribution index showed promising results. Finally, an industrial case was studied to assess informative level of incomplete data in terms of the redundancy ratio and generalize the discussion. (C) 2015 Elsevier Ltd. All rights reserved.Peer ReviewedPostprint (author's final draft

    Imputation techniques for the reconstruction of missing interconnected data from higher Educational Institutions

    Get PDF
    Educational Institutions data constitute the basis for several important analyses on the educational systems; however they often contain not negligible shares of missing values, for several reasons. We consider in this work the relevant case of the European Tertiary Education Register (ETER), describing the Educational Institutions of Europe. The presence of missing values prevents the full exploitation of this database, since several types of analyses that could be performed are currently impracticable. The imputation of artificial data, reconstructed with the aim of being statistically equivalent to the (unknown) missing data, would allow to overcome these problems. A main complication in the imputation of this type of data is given by the correlations that exist among all the variables. We propose several imputation techniques designed to deal with the different types of missing values appearing in these interconnected data. We use these techniques to impute the database. Moreover, we evaluate the accuracy of the proposed approach by artificially introducing missing data, by imputing them, and by comparing imputed and original values. Results show that the information reconstruction does not introduce statistically significant changes in the data and that the imputed values are close enough to the original values

    Fuzzy C-mean missing data imputation for analogy-based effort estimation

    Get PDF
    The accuracy of effort estimation in one of the major factors in the success or failure of software projects. Analogy-Based Estimation (ABE) is a widely accepted estimation model since its flow human nature in selecting analogies similar in nature to the target project. The accuracy of prediction in ABE model in strongly associated with the quality of the dataset since it depends on previous completed projects for estimation. Missing Data (MD) is one of major challenges in software engineering datasets. Several missing data imputation techniques have been investigated by researchers in ABE model. Identification of the most similar donor values from the completed software projects dataset for imputation is a challenging issue in existing missing data techniques adopted for ABE model. In this study, Fuzzy C-Mean Imputation (FCMI), Mean Imputation (MI) and K-Nearest Neighbor Imputation (KNNI) are investigated to impute missing values in Desharnais dataset under different missing data percentages (Desh-Miss1, Desh-Miss2) for ABE model. FCMI-ABE technique is proposed in this study. Evaluation comparison among MI, KNNI, and (ABE-FCMI) is conducted for ABE model to identify the suitable MD imputation method. The results suggest that the use of (ABE-FCMI), rather than MI and KNNI, imputes more reliable values to incomplete software projects in the missing datasets. It was also found that the proposed imputation method significantly improves software development effort prediction of ABE model
    corecore