    The Business Insight Index – Evaluating Customer Insights through Hybrid Models

    Customer segmentation and target analysis are two essential tasks when identifying a company’s customers. To perform these tasks, this thesis develops and applies hybrid data-mining models, integrating clustering and decision trees. The hybrid models are applied to the life-logging camera company Narrative, in order to gain insights into their customer data. From previous research, we found that these hybrid models lacked means for evaluating the amount of insights proposed to decision makers. For this reason, we created, tested, and validated a new evaluation measure – the Description Tree Index. Through experiments on five separate datasets, we conclude that the measure enables decision makers to evaluate the insights gained through the hybrid model. In each case, the index generates the best results for the expected number of segments. We then integrated the Description Tree Index with existing evaluation models to form a Business Insight Index. This index evaluates customer segmentation and target analysis from both a business and data-mining perspective. By applying the index to the Narrative data, we found four customer segments to present the most insights.Att öka förstĂ„elsen av sina kunder har utvecklats till en allt viktigare uppgift för dagens företag. För att fĂ„ bĂ€ttre kunduppfattning utvecklas i detta examensarbete ett nytt mĂ„tt som bedömer kundinformation. Ett allt vanligare problem för affĂ€rsverksamheter Ă€r att försĂ€ljnings- och marknadsföringskostnader ökar. Detta beror pĂ„ att dagens kunder har alltmer olikartade köpbeteenden. För att kunna prioritera de mest vĂ€rdeskapande kunderna krĂ€vs numera att en segmentering genomförs. Att segmentera kunder innebĂ€r att dela upp en marknad i mindre delar utefter olika kundegenskaper, exempelvis Ă„lder, kön eller inkomstnivĂ„. Att segmentera kunder har dock blivit en allt mer komplex uppgift, dĂ„ informationen om kundernas egenskaper och beteende ökat lavinartat de senaste Ă„ren. Bara pĂ„ det sociala mediet Facebook laddas över 500 terabyte data upp varje dag. Med hjĂ€lp av data mining kan segmentering och prioritering utföras Ă€ven pĂ„ stora mĂ€ngder data. Data mining bestĂ„r av verktyg och tekniker för att hitta mönster, samband och trender i data. Dessa insikter kan sedan utnyttjas av beslutsfattare för att skapa konkurrensfördelar. The Business Insight Index För att kunna utvĂ€rdera kundinsikterna skapas i examensarbetet ett nytt mĂ„tt – Business Insight Index (BII). Detta mĂ„tt kan anvĂ€ndas för att avgöra om en kundsegmentering Ă€r mer kvalitativ Ă€n annan. Genom att utvĂ€rdera mĂ€ngden information som görs tillgĂ€nglig till beslutsfattare kan mĂ„ttet förbĂ€ttra kundsegmenteringsprocessen. BII uppvisar goda resultat vid test pĂ„ fem datafiler dĂ€r segmenteringen Ă€r kĂ€nd sedan tidigare. För varje datafil genererar mĂ„ttet bĂ€st resultat för de förvĂ€ntade segmenten. Traditionella evalueringsmetoder hĂ„ller inte mĂ„ttet Vanligtvis utvĂ€rderas segmentering inom data mining genom att mĂ€ta hur inbördes lika segmenten Ă€r i förhĂ„llande till hur olika de Ă€r sinsemellan. Dessa mĂ„tt tar dock inte hĂ€nsyn till mĂ€ngden information som förmedlas till sĂ€ljare och marknadsförare. BĂ€ttre kundinformation kan pĂ„ sikt leda till konkurrensfördelar och ökad försĂ€ljning. DĂ€rför Ă€r det viktigt att dels utvĂ€rdera segmentens kvalitativa egenskaper, men Ă€ven till vilken grad dessa kan förstĂ„s och kommuniceras. Narrative För att hjĂ€lpa företaget Narrative att segmentera sina kunder utnyttjas BII. Narrative Ă€r ett Linköpingsbaserat företag som marknadsför lifelogging-kameror. Var 30:e sekund tar dessa bilder, vilka kan laddas upp till företagets servrar. Kunder kan sedan komma Ă„t korten via företagets mobil-app. Genom att dela in företagets kunder i segment och sedan utvĂ€rdera dessa, fĂ„r Narrative information om vilka vĂ€rdedrivare kunderna ser i produkten. Är exempelvis hög bildkvalitet viktigare Ă€n anpassningsmöjligheter till sociala medier? Eller Ă€r bildfrekvensen den viktigaste faktorn? DĂ„ företaget identifierar kamerans vĂ€rdedrivare kan produkten utvecklas och marknadsföras till de olika segmenten. Integrering av segmentering och beslutstrĂ€d Genom att analysera segmenten i beslutstrĂ€d framkommer vilka egenskaper som Ă€r utmĂ€rkande för kunderna. BeslutstrĂ€det förutsĂ€ger vilka vĂ€rden som kommer krĂ€vas för att en kund ska placeras i ett specifikt segment. Detta verktyg Ă€r fördelaktigt dĂ„ det möjliggör en visualisering av kunderna som enkelt kan förstĂ„s av och förklaras för beslutsfattare

    Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection

    Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science

    A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition

    Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques—oversampling, under-sampling and synthetic minority over-sampling (SMOTE)—along with four popular classification methods—logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates

    Learning from accidents : machine learning for safety at railway stations

    In railway systems, station safety is a critical aspect of the overall structure, and yet, accidents at stations still occur. It is time to learn from these errors and improve conventional methods by utilizing the latest technology, such as machine learning (ML), to analyse accidents and enhance safety systems. ML has been employed in many fields, including engineering systems, and it interacts with us throughout our daily lives. Thus, we must consider the available technology in general and ML in particular in the context of safety in the railway industry. This paper explores the employment of the decision tree (DT) method in safety classification and the analysis of accidents at railway stations to predict the traits of passengers affected by accidents. The critical contribution of this study is the presentation of ML and an explanation of how this technique is applied for ensuring safety, utilizing automated processes, and gaining benefits from this powerful technology. To apply and explore this method, a case study has been selected that focuses on the fatalities caused by accidents at railway stations. An analysis of some of these fatal accidents as reported by the Rail Safety and Standards Board (RSSB) is performed and presented in this paper to provide a broader summary of the application of supervised ML for improving safety at railway stations. Finally, this research shows the vast potential of the innovative application of ML in safety analysis for the railway industry
