319 research outputs found

    Prediksi Employee Churn Dengan Uplift Modeling Menggunakan Algoritma Logistic Regression

    Get PDF
    Pada sebuah perusahaan, karyawan merupakan aset yang berharga dan dapat menunjang kesuksesan perusahaan tersebut. Namun, hilangnya tenaga kerja dapat merugikan perusahaan. Kondisi ini disebut dengan Employee Churn. Salah satu solusi untuk mengatasi Employee Churn adalah dengan menerapkan model Uplift Modeling. Dalam penelitian ini, penulis menganalisa penerapan Logistic Regression terhadap Uplift Modeling dalam permasalahan Employee Churn. Data yang diteliti adalah data karyawan dari IBM HR Analytics. Hasil prediksi pada penelitian ini mendapat akurasi sebesar 64,40%, sedangkan hasil preskripsi menghasilkan hasil yang cukup baik apabila menerapkan waktu kerja tambahan pada karyawan. Berdasarkan hasil yang didapat, diketahui bahwa para karyawan justru cenderung bertahan di perusahaan apabila diberikan waktu kerja tambahan

    Essays on using machine learning for causal inference

    Get PDF
    Um Daten am effektivsten zu nutzen, muss die moderne Ökonometrie ihren Werkzeugkasten an Modellen erweitern und neu denken. Das Feld, in dem diese Transformation am besten beobachtet werden kann, ist die kausale Inferenz. Diese Dissertation verfolgt die Absicht Probleme zu untersuchen, Lösungen zu präsentieren und neue Methoden zu entwickeln Machine Learning zu benutzen, um kausale Parameter zu schätzen. Dafür werden in der Dissertation zuerst verschiedene neuartige Methoden, welche als Ziel haben heterogene Treatment Effekte zu messen, eingeordnet. Im zweiten Schritt werden, basierend auf diesen Methoden, Richtlinien für ihre Anwendung in der Praxis aufgestellt. Der Parameter von Interesse ist der „conditional average treatment effect“ (CATE). Es kann gezeigt werden, dass ein Vergleich mehrerer Methoden gegenüber der Verwendung einer einzelnen Methode vorzuziehen ist. Ein spezieller Fokus liegt dabei auf dem Aufteilen und Gewichten der Stichprobe, um den Verlust in Effizienz wettzumachen. Ein unzulängliches Kontrollieren für die Variation durch verschiedene Teilstichproben führt zu großen Unterschieden in der Präzision der geschätzten Parameter. Wird der CATE durch Bilden von Quantilen in Gruppen unterteilt, führt dies zu robusteren Ergebnissen in Bezug auf die Varianz. Diese Dissertation entwickelt und untersucht nicht nur Methoden für die Schätzung der Heterogenität in Treatment Effekten, sondern auch für das Identifizieren von richtigen Störvariablen. Hierzu schlägt diese Dissertation sowohl die „outcome-adaptive random forest“ Methode vor, welche automatisiert Variablen klassifiziert, als auch „supervised randomization“ für eine kosteneffiziente Selektion der Zielgruppe. Einblicke in wichtige Variablen und solche, welche keine Störung verursachen, ist besonders in der Evaluierung von Politikmaßnahmen aber auch im medizinischen Sektor wichtig, insbesondere dann, wenn kein randomisiertes Experiment möglich ist.To use data effectively, modern econometricians need to expand and rethink their toolbox. One field where such a transformation has already started is causal inference. This thesis aims to explore further issues, provide solutions, and develop new methods on how machine learning can be used to estimate causal parameters. I categorize novel methods to estimate heterogeneous treatment effects and provide a practitioner’s guide for implementation. The parameter of interest is the conditional average treatment effect (CATE). It can be shown that an ensemble of methods is preferable to relying on one method. A special focus, with respect to the CATE, is set on the comparison of such methods and the role of sample splitting and cross-fitting to restore efficiency. Huge differences in the estimated parameter accuracy can occur if the sampling uncertainty is not correctly accounted for. One feature of the CATE is a coarser representation through quantiles. Estimating groups of the CATE leads to more robust estimates with respect to the sampling uncertainty and the resulting high variance. This thesis not only develops and explores methods to estimate treatment effect heterogeneity but also to identify confounding variables as well as observations that should receive treatment. For these two tasks, this thesis proposes the outcome-adaptive random forest for automatic variable selection, as well as supervised randomization for a cost-efficient selection of the target group. Insights into important variables and those that are not true confounders are very helpful for policy evaluation and in the medical sector when randomized control trials are not possible

    Exploring uplift modeling with high class imbalance

    Get PDF
    Uplift modeling refers to individual level causal inference. Existing research on the topic ignores one prevalent and important aspect: high class imbalance. For instance in online environments uplift modeling is used to optimally target ads and discounts, but very few users ever end up clicking an ad or buying. One common approach to deal with imbalance in classification is by undersampling the dataset. In this work, we show how undersampling can be extended to uplift modeling. We propose four undersampling methods for uplift modeling. We compare the proposed methods empirically and show when some methods have a tendency to break down. One key observation is that accounting for the imbalance is particularly important for uplift random forests, which explains the poor performance of the model in earlier works. Undersampling is also crucial for class-variable transformation based models.Peer reviewe

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    Risk Analytics in Econometrics

    Get PDF
    [eng] This thesis addresses the framework of risk analytics as a compendium of four main pillars: (i) big data, (ii) intensive programming, (iii) advanced analytics and machine learning, and (iv) risk analysis. Under the latter mainstay, this PhD dissertation reviews potential hazards known as “extreme events” that could negatively impact the wellbeing of people, profitability of firms, or the economic stability of a country, but which also have been underestimated or incorrectly treated by traditional modelling techniques. The objective of this thesis is to develop econometric and machine learning algorithms that can improve the predictive capacity of those extreme events and improve the comprehension of the phenomena contrary to some modern advanced methods which are black boxes in terms of interpretation. This thesis presents seven chapters that provide a methodological contribution to the existing literature by building techniques that transform the new valuable insights of big data into more accurate predictions that support decisions under risk, and increase robustness for more reliable and real results. This PhD thesis focuses uniquely on extremal events which are trigged into a binary variable, mostly known as class-imbalanced data and rare events in binary response, in other words, whose classes that are not equally distributed. The scope of research tackle real cases studies in the field of risk and insurance, where it is highly important to specify a level of claims of an event in order to foresee its impact and to provide a personalized treatment. After Chapter 1 corresponding to the introduction, Chapter 2 proposes a weighting mechanism to incorporated in the weighted likelihood estimation of a generalized linear model to improve the predictive performance of the highest and lowest deciles of prediction. Chapter 3 proposes two different weighting procedures for a logistic regression model with complex survey data or specific sampling designed data. Its objective is to control the randomness of data and provide more sensitivity to the estimated model. Chapter 4 proposes a rigorous review of trials with modern and classical predictive methods to uncover and discuss the efficiency of certain methods over others, and which and how gaps in machine learning literature can be addressed efficiently. Chapter 5 proposes a novel boosting-based method that overcomes certain existing methods in terms of predictive accuracy and also, recovers some interpretation of the model with imbalanced data. Chapter 6 develops another boosting-based algorithm which is able to improve the predictive capacity of rare events and get approximated as a generalized linear model in terms of interpretation. And finally, Chapter 7 includes the conclusions and final remarks. The present thesis highlights the importance of developing alternative modelling algorithms that reduces uncertainty, especially when there are potential limitations that impede to know all the previous factors that influence on the presence of a rare event or imbalanced-data phenomenon. This thesis merges two important approaches in modelling predictive literature as they are: “econometrics” and “machine learning”. All in all, this thesis contributes to enhance the methodology of how empirical analysis in many experimental and non-experimental sciences have being doing so far
    • …
    corecore