A comparative analysis of tree-based models classifying imbalanced breath alcohol data

Alcañiz, Manuela; Ramon, Lluís; Santolino, Miguel

research

A comparative analysis of tree-based models classifying imbalanced breath alcohol data

Authors: Manuela Alcañiz
Lluís Ramon
Miguel Santolino
Publication date: 27 February 2018
Publisher: Sociedad de Estadística e Investigación Operativa

Abstract

When applied to binary data, most classification algorithms behave well provided the dataset is balanced. However, when one single class includes the majority of cases, a good predictive performance for the minority class is not easy to achieve. We examine the strengths and weaknesses of three tree-based models when dealing with imbalanced data.We also explore sampling and cost sensitive methods as strategies for improving machine learning algorithms. An application to a large dataset of breath alcohol content tests performed in Catalonia (Spain) to detect drunk drivers is shown. The Random Forest method proved to be the model of choice if a high performance is required, while down- sampling strategies resulted in a significant reduction in computing time. When predicting alcohol impairment, the area of control (built-up or not), hour of day and drivers age were the most relevant variables for classification

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Diposit Digital de la Universitat de Barcelona

oai:diposit.ub.edu:2445/120281

Last time updated on 18/04/2018