2,320 research outputs found

    Data-based fault detection in chemical processes: Managing records with operator intervention and uncertain labels

    Get PDF
    Developing data-driven fault detection systems for chemical plants requires managing uncertain data labels and dynamic attributes due to operator-process interactions. Mislabeled data is a known problem in computer science that has received scarce attention from the process systems community. This work introduces and examines the effects of operator actions in records and labels, and the consequences in the development of detection models. Using a state space model, this work proposes an iterative relabeling scheme for retraining classifiers that continuously refines dynamic attributes and labels. Three case studies are presented: a reactor as a motivating example, flooding in a simulated de-Butanizer column, as a complex case, and foaming in an absorber as an industrial challenge. For the first case, detection accuracy is shown to increase by 14% while operating costs are reduced by 20%. Moreover, regarding the de-Butanizer column, the performance of the proposed strategy is shown to be 10% higher than the filtering strategy. Promising results are finally reported in regard of efficient strategies to deal with the presented problemPeer ReviewedPostprint (author's final draft

    Unlocking biomarker discovery: Large scale application of aptamer proteomic technology for early detection of lung cancer

    Get PDF
    Lung cancer is the leading cause of cancer deaths, because ~84% of cases are diagnosed at an advanced stage. Worldwide in 2008, ~1.5 million people were diagnosed and ~1.3 million died – a survival rate unchanged since 1960. However, patients diagnosed at an early stage and have surgery experience an 86% overall 5-year survival. New diagnostics are therefore needed to identify lung cancer at this stage. Here we present the first large scale clinical use of aptamers to discover blood protein biomarkers in disease with our breakthrough proteomic technology. This multi-center case-control study was conducted in archived samples from 1,326 subjects from four independent studies of non-small cell lung cancer (NSCLC) in long-term tobacco-exposed populations. We measured >800 proteins in 15uL of serum, identified 44 candidate biomarkers, and developed a 12-protein panel that distinguished NSCLC from controls with 91% sensitivity and 84% specificity in a training set and 89% sensitivity and 83% specificity in a blinded, independent verification set. Performance was similar for early and late stage NSCLC. This is a significant advance in proteomics in an area of high clinical need

    Cost-sensitive Bayesian network learning using sampling

    Get PDF
    A significant advance in recent years has been the development of cost-sensitive decision tree learners, recognising that real world classification problems need to take account of costs of misclassification and not just focus on accuracy. The literature contains well over 50 cost-sensitive decision tree induction algorithms, each with varying performance profiles. Obtaining good Bayesian networks can be challenging and hence several algorithms have been proposed for learning their structure and parameters from data. However, most of these algorithms focus on learning Bayesian networks that aim to maximise the accuracy of classifications. Hence an obvious question that arises is whether it is possible to develop cost-sensitive Bayesian networks and whether they would perform better than cost-sensitive decision trees for minimising classification cost? This paper explores this question by developing a new Bayesian network learning algorithm based on changing the data distribution to reflect the costs of misclassification. The proposed method is explored by conducting experiments on over 20 data sets. The results show that this approach produces good results in comparison to more complex cost-sensitive decision tree algorithms

    Evaluating Classifiers\u27 Optimal Performances Over a Range of Misclassification Costs by Using Cost-Sensitive Classification

    Get PDF
    We believe that using the classification accuracy is not enough to evaluate the performances of classification algorithms. It can be misleading due to overlooking an important element which is the cost if classification is inaccurate. Furthermore, the Receiver Operational Characteristic (ROC) is one of the most popular graphs used to evaluate classifiers performances. However, one of the biggest ROC’s shortcomings is the assumption of equal costs for all misclassified data. Therefore, our goal is to reduce the total cost of decision making by selecting the classifier that has the least total misclassification cost. Nevertheless, the exact misclassification cost is usually unknown and hard to determine. To overcome such hurdle, we classify the data against a range of error costs. Thus, we use the cost range and the operating classification threshold range to show any performance differences among classifiers

    Empirical Assessment of Machine Learning Techniques for Software Requirements Risk Prediction

    Full text link
    [EN] Software risk prediction is the most sensitive and crucial activity of Software Development Life Cycle (SDLC). It may lead to success or failure of a project. The risk should be predicted earlier to make a software project successful. A Model is proposed for the prediction of software requirement risks using requirement risk dataset and machine learning techniques. Also, a comparison is done between multiple classifiers that are K-Nearest Neighbour (KNN), Average One Dependency Estimator (A1DE), Naïve Bayes (NB), Composite Hypercube on Iterated Random Projection (CHIRP), Decision Table (DT), Decision Table/ Naïve Bayes Hybrid Classifier (DTNB), Credal Decision Trees (CDT), Cost-Sensitive Decision Forest (CS-Forest), J48 Decision Tree (J48), and Random Forest (RF) to achieve best suited technique for the model according to the nature of dataset. These techniques are evaluated using various evaluation metrics including CCI (correctly Classified Instances), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE), Root Relative Squared Error (RRSE), precision, recall, F-measure, Matthew¿s Correlation Coefficient (MCC), Receiver Operating Characteristic Area (ROC area), Precision-Recall Curves area (PRC area), and accuracy. The inclusive outcome of this study shows that in terms of reducing error rates, CDT outperforms other techniques achieving 0.013 for MAE, 0.089 for RMSE, 4.498% for RAE, and 23.741% for RRSE. However, in terms of increasing accuracy, DT, DTNB and CDT achieve better results.This work was supported by by Generalitat Valenciana, Conselleria de Innovacion, Universidades, Ciencia y Sociedad Digital, (project AICO/019/224)Naseem, R.; Shaukat, Z.; Irfan, M.; Shah, MA.; Ahmad, A.; Muhammad, F.; Glowacz, A.... (2021). Empirical Assessment of Machine Learning Techniques for Software Requirements Risk Prediction. Electronics. 10(2):1-19. https://doi.org/10.3390/electronics1002016811910

    A novel Big Data analytics and intelligent technique to predict driver's intent

    Get PDF
    Modern age offers a great potential for automatically predicting the driver's intent through the increasing miniaturization of computing technologies, rapid advancements in communication technologies and continuous connectivity of heterogeneous smart objects. Inside the cabin and engine of modern cars, dedicated computer systems need to possess the ability to exploit the wealth of information generated by heterogeneous data sources with different contextual and conceptual representations. Processing and utilizing this diverse and voluminous data, involves many challenges concerning the design of the computational technique used to perform this task. In this paper, we investigate the various data sources available in the car and the surrounding environment, which can be utilized as inputs in order to predict driver's intent and behavior. As part of investigating these potential data sources, we conducted experiments on e-calendars for a large number of employees, and have reviewed a number of available geo referencing systems. Through the results of a statistical analysis and by computing location recognition accuracy results, we explored in detail the potential utilization of calendar location data to detect the driver's intentions. In order to exploit the numerous diverse data inputs available in modern vehicles, we investigate the suitability of different Computational Intelligence (CI) techniques, and propose a novel fuzzy computational modelling methodology. Finally, we outline the impact of applying advanced CI and Big Data analytics techniques in modern vehicles on the driver and society in general, and discuss ethical and legal issues arising from the deployment of intelligent self-learning cars

    Fairness and Interpretability in Machine Learning Models

    Get PDF
    Machine Learning has become more and more prominent in our daily lives as the Information Age and Fourth industrial revolution progresses. Many of these machine learning systems are evaluated in terms of how accurately they are able to predict the correct outcome that are present in existing historical datasets. In the last years we have observed how evaluating machine learning systems in this way has allowed decision making systems to treat certain groups unfairly. Some authors have proposed methods to overcome this. These methods include new metrics which incorporate measures of unfairly treating individuals based on group affiliation, probabilistic graphical models that assume dataset labels are inherently unfair and use dataset to infer the true fair labels as well as tree based methods that introduce new splitting criterions for fairness. We have evaluated these methods on datasets used in fairness research and evaluated if the results claimed by the authors are reproducible. Additionally, we have implemented new interpretability methods on top of the proposed methods to more explicitly explain their behaviour. We have found that some of the models do not achieve their claimed results and do not learn behaviour to achieve fairness while other models do achieve better predictions in terms of fairness by affirmative actions. This thesis show that machine learning interpretability and new machine learning models and approaches are necessary to achieve more fair decision making systems

    Review of the machine learning methods in the classification of phishing attack

    Get PDF
    The development of computer networks today has increased rapidly. This can be seen based on the trend of computer users around the world, whereby they need to connect their computer to the Internet. This shows that the use of Internet networks is very important, whether for work purposes or access to social media accounts. However, in widely using this computer network, the privacy of computer users is in danger, especially for computer users who do not install security systems in their computer. This problem will allow hackers to hack and commit network attacks. This is very dangerous, especially for Internet users because hackers can steal confidential information such as bank login account or social media login account. The attacks that can be made include phishing attacks. The goal of this study is to review the types of phishing attacks and current methods used in preventing them. Based on the literature, the machine learning method is widely used to prevent phishing attacks. There are several algorithms that can be used in the machine learning method to prevent these attacks. This study focused on an algorithm that was thoroughly made and the methods in implementing this algorithm are discussed in detail
    • …
    corecore