5,894 research outputs found

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    An empirical study on credit evaluation of SMEs based on detailed loan data

    Get PDF
    Small and micro-sized Enterprises (SMEs) are an important part of Chinese economic system.The establishment of credit evaluating model of SMEs can effectively help financial intermediaries to reveal credit risk of enterprises and reduce the cost of enterprises information acquisition. Besides it can also serve as a guide to investors which also helps companies with good credit. This thesis conducts an empirical study based on loan data from a Chinese bank of loans granted to SMEs. The study aims to develop a data-driven model that can accurately predict if a given loan has an acceptable risk from the bank’s perspective, or not. Furthermore, we test different methods to deal with the problem of unbalanced class and uncredible sample. Lastly, the importance of variables is analyzed. Remaining Unpaid Principal, Floating Interest Rate, Time Until Maturity Date, Real Interest Rate, Amount of Loan all have significant effects on the final result of the prediction.The main contribution of this study is to build a credit evaluation model of small and micro enterprises, which not only helps commercial banks accurately identify the credit risk of small and micro enterprises, but also helps to overcome creditdifficulties of small and micro enterprises.As pequenas e microempresas constituem uma parte importante do sistema económico chinês. A definição de um modelo de avaliação de crédito para estas empresas pode ajudar os intermediários financeiros a revelarem o risco de crédito das empresas e a reduzirem o custo de aquisição de informação das empresas. Além disso, pode igualmente servir como guia para os investidores, auxiliando também empresas com bom crédito. Na presente tese apresenta-se um estudo empírico baseado em dados de um banco chinês relativos a empréstimos concedidos a pequenas e microempresas. O estudo visa desenvolver um modelo empírico que possa prever com precisão se um determinado empréstimo tem um risco aceitável do ponto de vista do banco, ou não. Além disso, são efetuados testes com diferentes métodos que permitem lidar com os problemas de classes de dados não balanceadas e de amostras que não refletem o problema real a modelar. Finalmente, é analisada a importância relativa das variáveis. O montante da dívida por pagar, a taxa de juro variável, o prazo até a data de vencimento, a taxa de juro real, o montante do empréstimo, todas têm efeitos significativos no resultado final da previsão. O principal contributo deste estudo é, assim, a construção de um modelo de avaliação de crédito que permite apoiar os bancos comerciais a identificarem com precisão o risco de crédito das pequenas e micro empresas e ajudar também estas empresas a superarem as suas dificuldades de crédito

    Predicting Personality Type from Writing Style

    Get PDF
    The study of personality types gained traction in the early 20th century, when Carl Jung\u27s theory of psychological types attempted to categorize individual differences into the first modern personality typology. Iterating on Jung\u27s theories, the Myers-Briggs Type Indicator (MBTI) tried to categorize each individual into one of sixteen types, with the theory that an individual\u27s personality type manifests in virtually all aspects of their life. This study explores the relationship between an individual\u27s MBTI type and various aspects of their writing style. Using a MBTI-labeled dataset of user posts on a personality forum, three ensemble classifiers were created to predict a user\u27s personality type from their posts with the goal of outperforming existing research as well as outperforming the test-retest reliability of online questionnaire-based personality assessments. With the increasing amount of textual data available today, the creation of an accurate text-based personality classifier would allow for user experience designers and psychologists to better tailor their services for their users

    Machine learning methods to predict the formation of binary compact objects

    Get PDF
    In this thesis I tested different machine learning methods (Random forest, XGBoost) to predict the formation of binary compact objects solely based on initial condition of stellar binaries (masses, metallicity and orbital parameters) and on the parameters of binary evolution (i.e., common envelope efficiency). I trained the machine learning methods on simulations performed with rapid binary population synthesis code (BPS) code SEVN. I found that the tested methods can predict the formation of merging (within the Hubble time) and non-merging binary compact objects. The performance of predicting merging compact objects is found to be 50%, which is equivalent to random chance. This suggests that the methods cannot accurately predict the formation of merging compact objects. However, the methods show good performance in distinguishing between systems that produce bound and non-bound systems.In this thesis I tested different machine learning methods (Random forest, XGBoost) to predict the formation of binary compact objects solely based on initial condition of stellar binaries (masses, metallicity and orbital parameters) and on the parameters of binary evolution (i.e., common envelope efficiency). I trained the machine learning methods on simulations performed with rapid binary population synthesis code (BPS) code SEVN. I found that the tested methods can predict the formation of merging (within the Hubble time) and non-merging binary compact objects. The performance of predicting merging compact objects is found to be 50%, which is equivalent to random chance. This suggests that the methods cannot accurately predict the formation of merging compact objects. However, the methods show good performance in distinguishing between systems that produce bound and non-bound systems

    Quantifying and Explaining Causal Effects of World Bank Aid Projects

    Get PDF
    In recent years, machine learning methods have enabled us to predict with good precision using large training data, such as deep learning. However, for many problems, we care more about causality than prediction. For example, instead of knowing that smoking is statistically associated with lung cancer, we are more interested in knowing that smoking is the cause of lung cancer. With causality, we can understand how the world progresses and how impacts are made on an outcome by influencing the cause. This thesis explores how to quantify the causal effects of a treatment on an observable outcome in the presence of heterogeneity. We focus on investigating the causal impacts that World Bank projects have on environmental changes. This high dimensional World Bank data set includes covariates from various sources and of different types, including time series data, such as the Normalized Difference Vegetation Index (NDVI) values, temperature and precipitation, spatial data such as longitude and latitude, and many other features such as distance to roads and rivers. We estimate the heterogeneous causal effect of World Bank projects on the change of NDVI values. Based on causal tree and causal forest proposed by Athey, we described the challenges we met and lessons we learned when applying these two methods to an actual World Bank data set. We show our observations of the heterogeneous causal effect of the World Bank projects on the change of environment. as we do not have the ground truth for the World Bank data set, we validate the results using synthetic data for simulation studies. The synthetic data is sampled from distributions fitted with the World Bank data set. We compared the results among various causal inference methods and observed that feature scaling is very important to generating meaningful data and results. in addition, we investigate the performance of the causal forest with various parameters such as leaf size, number of confounders, and data size. Causal forest is a black-box model, and the results from it cannot be easily interpreted. The results are also hard for humans to understand. By taking advantage of the tree structure, the neighbors of the project to be explained are selected. The weights are assigned to the neighbors according to dynamic distance metrics. We can learn a linear regression model with the neighbors and interpret the results with the help of the learned linear regression model. in summary, World Bank projects have small impacts on the change to the environment, and the result of an individual project can be interpreted using a linear regression model learned from closed projects

    The importance of Quality Assurance as a Data Scientist: Commom pitfalls, examples and solutions found while validationand developing supervised binary classification models

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsIn today’s information era, where Data galvanizes change, companies are aiming towards competitive advantage by mining this important resource to achieve actionable insights, knowledge, and wisdom. However, to minimize bias and obtain robust long-term solutions, the methodologies that are devised from Data Science and Machine Learning approaches benefit from being carefully validated by a Quality Assurance Data Scientist, who understands not only both business rules and analytics tasks, but also understands and recommends Quality Assurance guidelines and validations. Through my experience as a Data Scientist at EDP Distribuição, I identify and systematically report on seven key Quality Assurance guidelines that helped achieve more reliable products and provided three practical examples where validation was key in discerning improvements

    Predicting residential demand: applying random forest to predict housing demand in Cape Town

    Get PDF
    The literature shows that Random Forest is a suitable technique to predict a target variable for a household with completely unseen characteristics. The models produced in this paper show that the characteristics of a household can be used to predict the Type of Dwelling, the Tenure and the Number of Bedrooms to varying degrees of accuracy. While none of the sets of models produced indicate a high degree of predictive accuracy relative to hurdle rates, the paper does demonstrate the value that the Random Forest technique offers in moving closer to an understanding of the complex nature of housing demand. A key finding is that the Census variables available for the models are not discriminatory enough to enable the high degree of accuracy expected from a predictive model

    Statistical and Machine Learning Models for Remote Sensing Data Mining - Recent Advancements

    Get PDF
    This book is a reprint of the Special Issue entitled "Statistical and Machine Learning Models for Remote Sensing Data Mining - Recent Advancements" that was published in Remote Sensing, MDPI. It provides insights into both core technical challenges and some selected critical applications of satellite remote sensing image analytics
    • …
    corecore