Search CORE

2,071 research outputs found

Imbalanced Ensemble Classifier for learning from imbalanced business school data set

Author: Chakraborty Tanujit
Publication venue
Publication date: 17/10/2018
Field of study

Private business schools in India face a common problem of selecting quality students for their MBA programs to achieve the desired placement percentage. Generally, such data sets are biased towards one class, i.e., imbalanced in nature. And learning from the imbalanced dataset is a difficult proposition. This paper proposes an imbalanced ensemble classifier which can handle the imbalanced nature of the dataset and achieves higher accuracy in case of the feature selection (selection of important characteristics of students) cum classification problem (prediction of placements based on the students' characteristics) for Indian business school dataset. The optimal value of an important model parameter is found. Numerical evidence is also provided using Indian business school dataset to assess the outstanding performance of the proposed classifier

arXiv.org e-Print Archive

A Descriptive Study of Variable Discretization and Cost-Sensitive Logistic Regression on Imbalanced Credit Data

Author: Priestley Jennifer
Ray Herman
Tan Soon
Zhang Lili
Publication venue: DigitalCommons@Kennesaw State University
Publication date: 01/07/2019
Field of study

Training classification models on imbalanced data tends to result in bias towards the majority class. In this paper, we demonstrate how variable discretization and cost-sensitive logistic regression help mitigate this bias on an imbalanced credit scoring dataset, and further show the application of the variable discretization technique on the data from other domains, demonstrating its potential as a generic technique for classifying imbalanced data beyond credit scoring. The performance measurements include ROC curves, Area under ROC Curve (AUC), Type I Error, Type II Error, accuracy, and F1 score. The results show that proper variable discretization and cost-sensitive logistic regression with the best class weights can reduce the model bias and/or variance. From the perspective of the algorithm, cost-sensitive logistic regression is beneficial for increasing the value of predictors even if they are not in their optimized forms while maintaining monotonicity. From the perspective of predictors, the variable discretization performs better than cost-sensitive logistic regression, provides more reasonable coefficient estimates for predictors which have nonlinear relationships against their empirical logit, and is robust to penalty weights on misclassifications of events and non-events determined by their apriori proportions

arXiv.org e-Print Archive

PubMed Central

DigitalCommons@Kennesaw State University

Empirical study of dimensionality reduction methodologies for classification problems

Author: Orbegoso Barrantes Daniel Gerald
Publication venue
Publication date: 01/01/2017
Field of study

Cuando hablamos de “Dimensionality Reduction” en Informática o “Big Data” nos referimos al proceso de reducción de variables previamente examinadas de un conjunto de datos para poder así obtener un conjunto de variables menor que nos permitirá construir un modelo de datos igual o con mejor precisión y menor cantidad de datos. Con este propósito se aplican técnicas de “Feature Selection” y “Feature Extraction”, con la primera de ellas extraemos un conjunto de características importantes de un dataset mediante el uso de distintos algoritmos de “machine learning”, mientras que con la segunda obtendremos un nuevo conjunto de características obtenidas a partir de las características originales. En este trabajo de fin de grado hacemos un estudio empírico sobre las distintas metodologías para clasificación de problemas utilizando un dataset médico llamado NCS-1 de pacientes clínicos con distintas patologías médicas, estudiamos los distintos algoritmos que se pueden aplicar a cada caso determinado con dicho dataset, y finalmente con los datos obtenidos realizamos un benchmark que nos permite entender mejor los distintos modelos estudiados.When we speak about Dimensionality reduction in informatics or big data, we refer to the process of reducing the number of random variables under consideration, and so, obtaining a set of principle variables which allow us to build a data model with the same or similar accuracy and a lower amount of data. For this purpose, we apply feature selection and feature extraction techniques. With feature selection we select a subset of the original feature set using techniques of machine learning, and with feature extraction we are going to build a new set of features from the original feature set. In this Project, we are going to make an empirical study about the different methodologies for classification problems using a medical dataset called NCS-1 of clinical patients with different medical pathologies, we study the different algorithms that can be applied for each case with this dataset, and finally with obtained data developing a Benchmark to understand the different applied models.Grado en Ingeniería Informátic

e_Buah - Biblioteca Digital de la Universidad de Alcalá