115 research outputs found

    Ensemble model-based method for time series sensors’ data validation and imputation applied to a real waste water treatment plant

    Get PDF
    Intelligent Decision Support Systems (IDSSs) integrate different Artificial Intelligence (AI) techniques with the aim of taking or supporting human-like decisions. To this end, these techniques are based on the available data from the target process. This implies that invalid or missing data could trigger incorrect decisions and therefore, undesirable situations in the supervised process. This is even more important in environmental systems, which incorrect malfunction could jeopardise related ecosystems. In data-driven applications such as IDSS, data quality is a basal problem that should be addressed for the sake of the overall systems’ performance. In this paper, a data validation and imputation methodology for time-series is presented. This methodology is integrated in an IDSS software tool which generates suitable control set-points to control the process. The data validation and imputation approach presented here is focused on the imputation step, and it is based on an ensemble of different prediction models obtained for the sensors involved in the process. A Case-Based Reasoning (CBR) approach is used for data imputation, i.e., similar past situations to the current one can propose new values for the missing ones. The CBR model is complemented with other prediction models such as Auto Regressive (AR) models or Artificial Neural Network (ANN) models. Then, the different obtained predictions are ensembled to obtain a better prediction performance than the obtained by each individual prediction model separately. Furthermore, the use of a meta-prediction model, trained using the predictions of all individual models as inputs, is proposed and compared with other ensemble methods to validate its performance. Finally, this approach is illustrated in a real Waste Water Treatment Plant (WWTP) case study using one of the most relevant measures for the correct operation of the WWTPs IDSS, i.e., the ammonia sensor, and considering real faults, showing promising results with improved performance when using the ensemble approach presented here compared against the prediction obtained by each individual model separately.The authors acknowledge the partial support of this work by the Industrial Doctorate Programme (2017DI-006) and the Research Consolidated Groups/Centres Grant (2017 SGR 574) from the Catalan Agency of University and Research Grants Management (AGAUR), from Catalan Government.Peer ReviewedPostprint (published version

    Extracting Rules for Diagnosis of Diabetes Using Genetic Programming

    Get PDF
    Background: Diabetes is a global health challenge that cusses high incidence of major social and economic consequences. As such, early prevention or identification of those people at risk is crucial for reducing the problems caused by it. The aim of study was to extract the rules for diabetes diagnosing using genetic programming. Methods: This study utilized the PIMA dataset of the University of California, Irvine. This dataset consists of the information of 768 Pima heritage women, including 500 healthy persons and 268 persons with diabetes. Regarding the missing values and outliers in this dataset, the K-nearest neighbor and k-means methods are applied respectively. Moreover, a genetic programming model (GP) was conducted to diagnose diabetes as well as to determine the most important factors affecting it. Accuracy, sensitivity and specificity of the proposed model on the PIMA dataset were obtained as 79.32, 58.96 and 90.74%, respectively. Results: The experimental results of our model on PIMA revealed that age, PG concentration, BMI, Tri Fold Thick and Serum Ins were effective in diabetes mellitus and increased risk of diabetes. In addition, the good performance of the model coupled with the simplicity and comprehensiveness of the extracted rules is also shown by the experimental results. Conclusions: GPs can effectively implement the rules for diagnosing diabetes. Both BMI and PG Concentration are also the most important factors to increase the risk of suffering from diabetes. Keywords: Diabetes, PIMA, Genetic programming, KNNi, K-means, Missing value, Outlier detection, Rule extraction

    Extracting Rules for Diagnosis of Diabetes Using Genetic Programming

    Get PDF
    Background: Diabetes is a global health challenge that cusses high incidence of major social and economic consequences. As such, early prevention or identification of those people at risk is crucial for reducing the problems caused by it. The aim of study was to extract the rules for diabetes diagnosing using genetic programming. Methods: This study utilized the PIMA dataset of the University of California, Irvine. This dataset consists of the information of 768 Pima heritage women, including 500 healthy persons and 268 persons with diabetes. Regarding the missing values and outliers in this dataset, the K-nearest neighbor and k-means methods are applied respectively. Moreover, a genetic programming model (GP) was conducted to diagnose diabetes as well as to determine the most important factors affecting it. Accuracy, sensitivity and specificity of the proposed model on the PIMA dataset were obtained as 79.32, 58.96 and 90.74%, respectively. Results: The experimental results of our model on PIMA revealed that age, PG concentration, BMI, Tri Fold Thick and Serum Ins were effective in diabetes mellitus and increased risk of diabetes. In addition, the good performance of the model coupled with the simplicity and comprehensiveness of the extracted rules is also shown by the experimental results. Conclusions: GPs can effectively implement the rules for diagnosing diabetes. Both BMI and PG Concentration are also the most important factors to increase the risk of suffering from diabetes. Keywords: Diabetes, PIMA, Genetic programming, KNNi, K-means, Missing value, Outlier detection, Rule extraction

    Representation learning in finance

    Get PDF
    Finance studies often employ heterogeneous datasets from different sources with different structures and frequencies. Some data are noisy, sparse, and unbalanced with missing values; some are unstructured, containing text or networks. Traditional techniques often struggle to combine and effectively extract information from these datasets. This work explores representation learning as a proven machine learning technique in learning informative embedding from complex, noisy, and dynamic financial data. This dissertation proposes novel factorization algorithms and network modeling techniques to learn the local and global representation of data in two specific financial applications: analysts’ earnings forecasts and asset pricing. Financial analysts’ earnings forecast is one of the most critical inputs for security valuation and investment decisions. However, it is challenging to fully utilize this type of data due to the missing values. This work proposes one matrix-based algorithm, “Coupled Matrix Factorization,” and one tensor-based algorithm, “Nonlinear Tensor Coupling and Completion Framework,” to impute missing values in analysts’ earnings forecasts and then use the imputed data to predict firms’ future earnings. Experimental analysis shows that missing value imputation and representation learning by coupled matrix/tensor factorization from the observed entries improve the accuracy of firm earnings prediction. The results confirm that representing financial time-series in their natural third-order tensor form improves the latent representation of the data. It learns high-quality embedding by overcoming information loss of flattening data in spatial or temporal dimensions. Traditional asset pricing models focus on linear relationships among asset pricing factors and often ignore nonlinear interaction among firms and factors. This dissertation formulates novel methods to identify nonlinear asset pricing factors and develops asset pricing models that capture global and local properties of data. First, this work proposes an artificial neural network “auto enco der” based model to capture the latent asset pricing factors from the global representation of an equity index. It also shows that autoencoder effectively identifies communal and non-communal assets in an index to facilitate portfolio optimization. Second, the global representation is augmented by propagating information from local communities, where the network determines the strength of this information propagation. Based on the Laplacian spectrum of the equity market network, a network factor “Z-score” is proposed to facilitate pertinent information propagation and capture dynamic changes in network structures. Finally, a “Dynamic Graph Learning Framework for Asset Pricing” is proposed to combine both global and local representations of data into one end-to-end asset pricing model. Using graph attention mechanism and information diffusion function, the proposed model learns new connections for implicit networks and refines connections of explicit networks. Experimental analysis shows that the proposed model incorporates information from negative and positive connections, captures the network evolution of the equity market over time, and outperforms other state-of-the-art asset pricing and predictive machine learning models in stock return prediction. In a broader context, this is a pioneering work in FinTech, particularly in understanding complex financial market structures and developing explainable artificial intelligence models for finance applications. This work effectively demonstrates the application of machine learning to model financial networks, capture nonlinear interactions on data, and provide investors with powerful data-driven techniques for informed decision-making

    PENGARUH RASIO KEUANGAN, CORPORATE GOVERNANCE, DAN MAKROEKONOMI TERHADAP FINANCIAL DISTRESS PADA PERUSAHAAN YANG TERDAFTAR DI INDONESIA STOCK EXCHANGE TAHUN 2014-2018

    Get PDF
    Stakeholders should understand the financial distress prediction of a company as an early warning system for the possibility of bankruptcy in the future. This study aims to examine the effect of financial ratios that are proxied by debt to assets ratio (DAR) and total assets turnover (TATO), corporate governance with the proxy for institutional ownership (KEPINS), managerial ownership (KEPMAN), board size (UKDEDIR), and macroeconomics proxied by inflation (INF) and exchange rates (NT) on the financial distress prediction. The parameter used to measure financial distress is the Taffler model. The period of this study for 5 years from 2014 to 2018. The population of this study is 224 infrastructure, utilities & transportation and trade, services & investment sector companies listed on the Indonesia Stock Exchange. Purposive sampling is used to select 56 companie samples in this study that analyzed using logistic regression. Processing research data analysis using the SPSS version 18. The results show that financial ratios have an effect on financial distress, specifically debt to asset ratio is significantly positive and total assets turnover is significantly negative on financial distress. Corporate governance and macroeconomics variable do not affect on financial distress. This shows that financial ratios are a barometer of financial health. The company must control debt and properly utilize the assets because this can cause financial distress. Keywords: corporate governance; financial distress; financial ratio; macroeconomics

    Improving prediction of opioid use disorder with machine learning algorithms

    Get PDF
    Title from PDF of title page, viewed August 9, 2023Thesis advisor: Farid Nait-AbdesselamVitaIncludes bibliographical references (pages 47-53)Thesis (M.S.)--Department of Computer Science and Electrical Engineering. University of Missouri--Kansas City, 2023Opioid Use Disorder (OUD) refers to a concerning pattern of opioid use that results in considerable impairment or distress. It is a chronic and treatable condition that can impact individuals irrespective of their race, gender, income, or social standing. Therefore, accurately predicting individuals at high risk of developing opioid use disorder is of utmost significance. Extensive research has been undertaken to forecast opioid use disorder and determine the most effective predictors that can enhance accuracy levels. This research proposes an improvement to these studies by utilizing two feature selection techniques, Boruta and Recursive Feature Elimination (RFE), to identify the best predictor through a majority voting system. The selected feature, combined with 10 demographic, socioeconomic, physical, and psychological predictors (gender, age, employment Status (Full time, Part time,..), income, Level of the impact of the disorder on an adults life in 4 role domains (home management, work, social life, close relationships with other, education, age when first drink alcoholic beverage, any mental illness, age when first used Marijuana, overall health), resulted in better prediction accuracy using various machine learning classification algorithms. To build a labeled dataset, responses from the 2018 and 2019 edition of the National Survey on Drug Use and Health (NSDUH) were collected. This dataset was used to train and test several classification models (Artificial Neural Network, Naïve Bayes and XgBoost), with the average recall, precision, f1 score, accuracy and AUC compared using 50 iterations. The results showed that considering “psychotherapeutic dependence or abuse” or “illicit drug other than marijuana dependence or abuse” have significant impact on improving accuracy metrics of used machine learning algorithms. Both Xgboost and naïve bayes was able to predict patients at risk for OUD with an AUC of 0.995. This work can aid healthcare providers in determining appropriate preventive care and resources for at-risk patients.Introduction -- Background and related work -- Proposed work -- Results and evaluations -- Conclusion and future wor

    환자 보고 성과 지표를 활용한 한국인 폐암 무병 생존자 생존 예측 모형 개발

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 의과대학 의과학과, 2018. 2. 윤영호.Introduction: The prediction of lung cancer survival is a crucial factor for successful cancer survivorship and follow-up planning. The principal objective of this study is to construct a novel Korean prognostic model of 5-year survival within lung cancer disease-free survivors using socio-clinical and HRQOL variables and to compare its predictive performance with the prediction model based on the traditional known clinical variables. Diverse techniques such as Cox proportional hazard model and machine learning technologies (MLT) were applied to the modeling process. Methods: Data of 809 survivors, who underwent lung cancer surgery between 1994 and 2002 at two Korean tertiary teaching hospitals, were used. The following variables were selected as independent variables for the prognostic model by using literature reviews and univariate analysis: clinical and socio-demographic variables, including age, sex, stage, metastatic lymph node and incomehealth related quality of life (HRQOL) factors from the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire Core 30Quality of Life Questionnaire Lung Cancer ModuleHospital Anxiety and Depression Scale, and Post-traumatic Growth Inventory. Survivors body mass index before a surgery and physical activity were also chosen. The three prediction modeling features sets included 1) only clinical and socio-demographic variables, 2) only HRQOL and lifestyle factors, and 3) variables from feature set 1 and 2 considered altogether. For each feature set, three Cox proportional hazard regression model were constructed and compared among each other by evaluating their performance in terms of discrimination and calibration ability using the C-statistic and Hosmer-Lemeshow chi-square statistics. Further, four machine learning algorithms using decision tree (DT), random forest (RF), bagging, and adaptive boosting (AdaBoost) were applied to three feature sets and compared with the performances of one another. The performance of the derived predictive models based on MLTs were internally validated by K-fold cross-validation. Results: In the Cox modeling, Model Cox-3 (based on Feature set 3: HRQOL factors added into clinical and socio-demographic variables) showed the highest area under curve (AUC = 0.809) compared with two other Cox regression (Cox-1, 2). When we applied the modeling methods into all other MLT based models, the most effective models were Model DT-3 from DT, Model RF-3 from RF, Model Bag-3 from Bagging, Model AdaBoost-3 from AdaBoost techniques, showing the highest accuracy for each of MLT. Model RF-3, Model Bag-3, Model AdaBoost-3 showed the highest accuracy even after k-fold cross-validation were conducted. Conclusions: Considering that the HRQOLs were added with clinical and socio-demographic variables, the proposed model proved to be useful based on the Cox model or we can apply MLT algorithms in the prediction of lung cancer survival. Improved accuracy for lung cancer survival prediction model has the potential to help clinicians and survivors make more meaningful decisions about future plans and their support to cancer care.I. INTRODUCTION 1 A. Background 1 1. Lung cancer statistics 1 2. The importance of suggesting survival prediction model to cancer survivors 4 3. HRQOL and lifestyle measurement as important predictors for lung cancer survival 5 4. Traditional survival analysis versus machine learning techniques (MLTs) 7 B. Hypothesis and objectives 10 1. Hypothesis 10 2. Objectives 10 II. MATERIALS AND METHODS 12 A. Study subjects 12 1. Subject selection 12 2. Data collection 13 2.1. Socio-demographic and clinical variables 15 2.2. Patient lifestyle characteristics 17 3. Study process 20 B. Prognostic variables selection and data preprocessing 22 1. Prognostic variables selection 22 1.1. Literature review for the selection of candidate predictors 22 1.2. Grading the evidence and mapping into the conceptual framework 25 1.3. Examination of prognosis variables selection from statistical analyses 28 2. Data preprocessing 29 2.1. Data cleaning, missing imputation 29 2.2. Test of multi-collinearity 29 2.3. Decisions of cut-off points 30 2.4. Data sampling for data balancing, SMOTE 31 2.5. Data splitting (holdout strategy) 32 C. Model development 33 1. Cox model development 34 3. Random forest model 38 4. Bagging (bootstrap aggregating) 40 5. Adaptive boosting (AdaBoost) 42 D. Model validation 44 1. Model validation for Cox model 44 1.1. Discrimination for Cox model 44 1.2. Calibration for Cox model 44 2. Model validation of other MLTs 45 3. K-fold Cross Validation for MLT based prediction models to avoid over-fitting 46 III. RESULTS 48 A. Literature review for selection of candidate predictors 48 1. Selection of candidate prognostic factors with literature review 48 2. Model constructing feature sets with selecting prognostic factors 51 B. Baseline characteristics 52 1. Demographics of participants characteristics and survival data 52 2. Candidate selection from statistical analyses 54 2.1. Univariate analysis of HRQOL mean scores between non-event and event groups 54 2.2. Univariate analysis of BMI, weight change, and MET of lung cancer survivors 58 3. Final candidate variable selection for phased modeling 60 4. Result of data preprocessing 62 4.1. Missing imputation 62 C. Model development 64 1. Cox model development 65 1.1. Prediction model based on Cox regression analysis 67 1.2. Final prediction model equation for Cox models 71 2. Decision tree model development 72 2.1. Assessment of the relative importance and model developing 72 2.2. Selecting CP value for decision tree pruning using rpart packages 74 3. Random forest model development 76 4. Bagged decision tree model development 78 5. AdaBoost model development 79 6. Developed models applied with MLTs 81 D. Model validation and performance 88 1. Cox proportional hazard ratio model internal validation 88 1.1. Discrimination 88 1.2. Calibration 91 2. Comparison model performance of Cox model and other MLTs 96 IV. DISCUSSION 106 A. Literature review for selection of candidate predictors 107 B. Model development using Cox and other MLTs 109 C. Model validation of Cox regression model and application of the predictive models to other MLT based models 112 D. Clinical and practical implications 114 E. Strengths and limitations of this study 117 CONCLUSION 119 REFERENCES 120 국문 초록 133 APPENDIX 135Docto

    The Data Science Design Manual

    Get PDF

    Anonymizing and Trading Person-specific Data with Trust

    Get PDF
    In the past decade, data privacy, security, and trustworthiness have gained tremendous attention from research communities, and these are still active areas of research with the proliferation of cloud services and social media applications. The data is growing at a rapid pace. It has become an integral part of almost every industry and business, including commercial and non-profit organizations. It often contains person-specific information and a data custodian who holds it must be responsible for managing its use, disclosure, accuracy and privacy protection. In this thesis, we present three research problems. The first two problems address the concerns of stakeholders on privacy protection, data trustworthiness, and profit distribution in the online market for trading person-specific data. The third problem addresses the health information custodians (HICs) concern on privacy-preserving healthcare network data publishing. Our first research problem is identified in cloud-based data integration service where data providers collaborate with their trading partners in order to deliver quality data mining services. Data-as-a-Service (DaaS) enables data integration to serve the demands of data consumers. Data providers face challenges not only to protect private data over the cloud but also to legally adhere to privacy compliance rules when trading person-specific data. We propose a model that allows the collaboration of multiple data providers for integrating their data and derives the contribution of each data provider by valuating the incorporated cost factors. This model serves as a guide for business decision-making, such as estimating the potential privacy risk and finding the sub-optimal value for publishing mashup data. Experiments on real-life data demonstrate that our approach can identify the sub-optimal value in data mashup for different privacy models, including K-anonymity, LKC-privacy, and ϵ-differential privacy, with various anonymization algorithms and privacy parameters. Second, consumers demand a good quality of data for accurate analysis and effective decision- making while the data providers intend to maximize their profits by competing with peer providers. In addition, the data providers or custodians must conform to privacy policies to avoid potential penalties for privacy breaches. To address these challenges, we propose a two-fold solution: (1) we present the first information entropy-based trust computation algorithm, IEB_Trust, that allows a semi-trusted arbitrator to detect the covert behavior of a dishonest data provider and chooses the qualified providers for a data mashup, and (2) we incorporate the Vickrey-Clarke-Groves (VCG) auction mechanism for the valuation of data providers’ attributes into the data mashup process. Experiments on real-life data demonstrate the robustness of our approach in restricting dishonest providers from participation in the data mashup and improving the efficiency in comparison to provenance-based approaches. Furthermore, we derive the monetary shares for the chosen providers from their information utility and trust scores over the differentially private release of the integrated dataset under their joint privacy requirements. Finally, we address the concerns of HICs of exchanging healthcare data to provide better and more timely services while mitigating the risk of exposing patients’ sensitive information to privacy threats. We first model a complex healthcare dataset using a heterogeneous information network that consists of multi-type entities and their relationships. We then propose DiffHetNet, an edge-based differentially private algorithm, to protect the sensitive links of patients from inbound and outbound attacks in the heterogeneous health network. We evaluate the performance of our proposed method in terms of information utility and efficiency on different types of real-life datasets that can be modeled as networks. Experimental results suggest that DiffHetNet generally yields less information loss and is significantly more efficient in terms of runtime in comparison with existing network anonymization methods. Furthermore, DiffHetNet is scalable to large network datasets

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications
    corecore