    Application of bagging, boosting and stacking to intrusion detection

    This paper investigates the possibility of using ensemble algorithms to improve the performance of network intrusion detection systems. We use an ensemble of three different methods, bagging, boosting and stacking, in order to improve the accuracy and reduce the false positive rate. We use four different data mining algorithms, naïve bayes, J48 (decision tree), JRip (rule induction) and iBK( nearest neighbour), as base classifiers for those ensemble methods. Our experiment shows that the prototype which implements four base classifiers and three ensemble algorithms achieves an accuracy of more than 99% in detecting known intrusions, but failed to detect novel intrusions with the accuracy rates of around just 60%. The use of bagging, boosting and stacking is unable to significantly improve the accuracy. Stacking is the only method that was able to reduce the false positive rate by a significantly high amount (46.84%); unfortunately, this method has the longest execution time and so is insufficient to implement in the intrusion detection fiel

    A Comparative Performance Analysis of Hybrid and Classical Machine Learning Method in Predicting Diabetes

    Diabetes mellitus is one of medical science's most important research topics because of the disease's severe consequences. High blood glucose levels characterize it. Early detection of diabetes is made possible by machine learning techniques with their intelligent capabilities to accurately predict diabetes and prevent its complications. Therefore, this study aims to find a machine learning approach that can more accurately predict diabetes. This study compares the performance of various classical machine learning models with the hybrid machine learning approach. The hybrid model includes the homogenous model, which comprises Random Forest, AdaBoost, XGBoost, Extra Trees, Gradient Booster, and the heterogeneous model that uses stacking ensemble methods. The stacking ensemble or stacked generalization approach is a meta-classifier in which multiple learners collaborate for prediction. The performance of the homogeneous hybrid models, Stacked Generalization and the classic machine learning methods such as Naive Bayes and Multilayer Perceptron, k-Nearest Neighbour, and support vector machine are compared. The experimental analysis using Pima Indians and the early-stage diabetes dataset demonstrates that the hybrid models achieve higher accuracy in diagnosing diabetes than the classical models. In the comparison of all the hybrid models, the heterogeneous model using the Stacked Generalization approach outperformed other models by achieving 83.9% and 98.5%.

    Who performs better? AVMs vs hedonic models

    Purpose: In the literature there are numerous tests that compare the accuracy of automated valuation models (AVMs). These models first train themselves with price data and property characteristics, then they are tested by measuring their ability to predict prices. Most of them compare the effectiveness of traditional econometric models against the use of machine learning algorithms. Although the latter seem to offer better performance, there is not yet a complete survey of the literature to confirm the hypothesis. Design/methodology/approach: All tests comparing regression analysis and AVMs machine learning on the same data set have been identified. The scores obtained in terms of accuracy were then compared with each other. Findings: Machine learning models are more accurate than traditional regression analysis in their ability to predict value. Nevertheless, many authors point out as their limit their black box nature and their poor inferential abilities. Practical implications: AVMs machine learning offers a huge advantage for all real estate operators who know and can use them. Their use in public policy or litigation can be critical. Originality/value: According to the author, this is the first systematic review that collects all the articles produced on the subject done comparing the results obtained

    Parametric and non-parametric methods in mass appraisal on poorly developed real estate markets

    Purpose: The objective of the article is to identify machine learning methods that provide the best real estate appraisals for small-sized samples, particularly on poorly developed markets. A hypothesis is verified according to which machine learning methods result in more accurate appraisals than multiple regression models do, taking into account sample sizes. Design/Methodology/Approach: Four types of regression were employed in the study: a multiple regression model, a ridge regression model, random forest regression and k nearest neighbours regression. A sampling scheme was proposed which enables defining the impact of a sample size in training datasets on the accuracy of appraisals in test datasets. Findings: The research enabled drawing several conclusions. First of all, the greater the training set was, the more precise the appraisals in a test set were. The conclusion drawn is that a reduction of a training set causes the deterioration of modelling results, but such deterioration is not substantial. Secondly, ridge regression model appeared to be the best model, and thereby the one most resistant to a low number of data. This model, apart from demonstrating the greatest resistance, additionally has the advantage of being a parametric, hence allowing inference. Practical Implications: Presented considerations are important, for instance in the case of valuations conducted for fiscal purposes, when it becomes necessary to determine the value of every type of real properties, even the ones featuring sporadically occurring states of properties. Originality/Value: The study contains modelling of the values defined by property appraisers, and not prices, as in the majority of studies. This decision enabled increasing the diversity of states of real estate properties, thereby including in the modelling process not just those real properties which are most typically traded.

    A comparative study of tree-based models for churn prediction : a case study in the telecommunication sector

    Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Marketing Research e CRMIn the recent years the topic of customer churn gains an increasing importance, which is the phenomena of the customers abandoning the company to another in the future. Customer churn plays an important role especially in the more saturated industries like telecommunication industry. Since the existing customers are very valuable and the acquisition cost of new customers is very high nowadays. The companies want to know which of their customers and when are they going to churn to another provider, so that measures can be taken to retain the customers who are at risk of churning. Such measures could be in the form of incentives to the churners, but the downside is the wrong classification of a churners will cost the company a lot, especially when incentives are given to some non-churner customers. The common challenge to predict customer churn will be how to pre-process the data and which algorithm to choose, especially when the dataset is heterogeneous which is very common for telecommunication companies' datasets. The presented thesis aims at predicting customer churn for telecommunication sector using different decision tree algorithms and its ensemble models

    A BIM and machine learning integration framework for automated property valuation

    Property valuation contributes significantly to market economic activities, while it has been continuously questioned on its low transparency, inaccuracy and inefficiency. With Big Data applications in real estate domain growing fast, computer-aided valuation systems such as AI-enhanced automated valuation models (AVMs) have the potential to address these issues. While a plethora of research has focused on improving predictive performance of AVMs, little effort has been made on information requirements for valuation models. As the amount of data in BIM is rising exponentially, the value-relevant design information has not been widely utilized for property valuation. This paper presents a system that leverages a holistic data interpretation, improves information exchange between AEC projects and property valuation, and automates specific workflows for property valuation. A mixed research method was adopted combining the archival literature research, qualitative and quantitative data analysis. A BIM and Machine learning (ML) integration framework for automated property valuation was proposed which contains a fundamental database interpretation, an IFC-based information extraction and an automated valuation model based on genetic algorithm optimized machine learning (GA-GBR). The main findings indicated: (1) Partial information requirements can be extracted from BIM models, (2) Property valuation can be performed in a more accurate and efficient way. This research contributes to managing information exchange between AEC projects and property valuation and supporting automated property valuation. It was suggested that the infusion of BIM, ML and other emerging digital technologies might add values to property valuation and the construction industry

    Property valuation with interpretable machine learning

    Property valuation is an important task for various stakeholders, including banks, local authorities, property developers, and brokers. As a result of the characteristics of the real estate market, such as the infrequency of trades, limited supply, negotiated prices, and small submarkets with unique traits, there is no clear market value for properties. Traditionally property valuations are done by expert appraisers. Property valuation can also be done accurately with machine learning methods, but the lack of interpretability with accurate machine learning methods can limit the adoption of those methods. Interpretable machine learning methods could be a solution to this issue, but there are concerns related to the accuracy of these methods. This thesis aims to evaluate the feasibility of interpretable machine learning methods in property valuation by comparing a promising interpretable method to a more complex machine learning method that has had good results in property valuation previously. The promising interpretable method and the well-performed machine learning method are chosen based on previous literature. The two chosen methods, Extreme Gradient Boosting (XGB) and Explainable Boosting Machine (EBM) are compared in terms of prediction accuracy of properties in six big municipalities of Denmark. In addition to the accuracy comparison, the interpretability of the EBM is highlighted. The accuracy of the XGB method is better, even though there are no big differences between the two methods in individual municipalities. The interpretability of the EBM is good, as it is possible to understand, how the model makes predictions in general, and how individual predictions are made

    Apprentissage statistique de modèles de comportement multimodal pour les agents conversationnels interactifs

    Face to face interaction is one of the most fundamental forms of human communication. It is a complex multimodal and coupled dynamic system involving not only speech but of numerous segments of the body among which gaze, the orientation of the head, the chest and the body, the facial and brachiomanual movements, etc. The understanding and the modeling of this type of communication is a crucial stage for designing interactive agents capable of committing (hiring) credible conversations with human partners. Concretely, a model of multimodal behavior for interactive social agents faces with the complex task of generating gestural scores given an analysis of the scene and an incremental estimation of the joint objectives aimed during the conversation. The objective of this thesis is to develop models of multimodal behavior that allow artificial agents to engage into a relevant co-verbal communication with a human partner. While the immense majority of the works in the field of human-agent interaction (HAI) is scripted using ruled-based models, our approach relies on the training of statistical models from tracks collected during exemplary interactions, demonstrated by human trainers. In this context, we introduce "sensorimotor" models of behavior, which perform at the same time the recognition of joint cognitive states and the generation of the social signals in an incremental way. In particular, the proposed models of behavior have to estimate the current unit of interaction ( IU) in which the interlocutors are jointly committed and to predict the co-verbal behavior of its human trainer given the behavior of the interlocutor(s). The proposed models are all graphical models, i.e. Hidden Markov Models (HMM) and Dynamic Bayesian Networks (DBN). The models were trained and evaluated - in particular compared with classic classifiers - using datasets collected during two different interactions. Both interactions were carefully designed so as to collect, in a minimum amount of time, a sufficient number of exemplars of mutual attention and multimodal deixis of objects and places. Our contributions are completed by original methods for the interpretation and comparative evaluation of the properties of the proposed models. By comparing the output of the models with the original scores, we show that the HMM, thanks to its properties of sequential modeling, outperforms the simple classifiers in term of performances. The semi-Markovian models (HSMM) further improves the estimation of sensorimotor states thanks to duration modeling. Finally, thanks to a rich structure of dependency between variables learnt from the data, the DBN has the most convincing performances and demonstrates both the best performance and the most faithful multimodal coordination to the original multimodal events.L'interaction face-à-face représente une des formes les plus fondamentales de la communication humaine. C'est un système dynamique multimodal et couplé – impliquant non seulement la parole mais de nombreux segments du corps dont le regard, l'orientation de la tête, du buste et du corps, les gestes faciaux et brachio-manuels, etc – d'une grande complexité. La compréhension et la modélisation de ce type de communication est une étape cruciale dans le processus de la conception des agents interactifs capables d'engager des conversations crédibles avec des partenaires humains. Concrètement, un modèle de comportement multimodal destiné aux agents sociaux interactifs fait face à la tâche complexe de générer un comportement multimodal étant donné une analyse de la scène et une estimation incrémentale des objectifs conjoints visés au cours de la conversation. L'objectif de cette thèse est de développer des modèles de comportement multimodal pour permettre aux agents artificiels de mener une communication co-verbale pertinente avec un partenaire humain. Alors que l'immense majorité des travaux dans le domaine de l'interaction humain-agent repose essentiellement sur des modèles à base de règles, notre approche se base sur la modélisation statistique des interactions sociales à partir de traces collectées lors d'interactions exemplaires, démontrées par des tuteurs humains. Dans ce cadre, nous introduisons des modèles de comportement dits "sensori-moteurs", qui permettent à la fois la reconnaissance des états cognitifs conjoints et la génération des signaux sociaux d'une manière incrémentale. En particulier, les modèles de comportement proposés ont pour objectif d'estimer l'unité d'interaction (IU) dans laquelle sont engagés de manière conjointe les interlocuteurs et de générer le comportement co-verbal du tuteur humain étant donné le comportement observé de son/ses interlocuteur(s). Les modèles proposés sont principalement des modèles probabilistes graphiques qui se basent sur les chaînes de markov cachés (HMM) et les réseaux bayésiens dynamiques (DBN). Les modèles ont été appris et évalués – notamment comparés à des classifieurs classiques – sur des jeux de données collectés lors de deux différentes interactions face-à-face. Les deux interactions ont été soigneusement conçues de manière à collecter, en un minimum de temps, un nombre suffisant d'exemplaires de gestion de l'attention mutuelle et de deixis multimodale d'objets et de lieux. Nos contributions sont complétées par des méthodes originales d'interprétation et d'évaluation des propriétés des modèles proposés. En comparant tous les modèles avec les vraies traces d'interactions, les résultats montrent que le modèle HMM, grâce à ses propriétés de modélisation séquentielle, dépasse les simples classifieurs en terme de performances. Les modèles semi-markoviens (HSMM) ont été également testé et ont abouti à un meilleur bouclage sensori-moteur grâce à leurs propriétés de modélisation des durées des états. Enfin, grâce à une structure de dépendances riche apprise à partir des données, le modèle DBN a les performances les plus probantes et démontre en outre la coordination multimodale la plus fidèle aux évènements multimodaux originaux