2,071 research outputs found
Imbalanced Ensemble Classifier for learning from imbalanced business school data set
Private business schools in India face a common problem of selecting quality
students for their MBA programs to achieve the desired placement percentage.
Generally, such data sets are biased towards one class, i.e., imbalanced in
nature. And learning from the imbalanced dataset is a difficult proposition.
This paper proposes an imbalanced ensemble classifier which can handle the
imbalanced nature of the dataset and achieves higher accuracy in case of the
feature selection (selection of important characteristics of students) cum
classification problem (prediction of placements based on the students'
characteristics) for Indian business school dataset. The optimal value of an
important model parameter is found. Numerical evidence is also provided using
Indian business school dataset to assess the outstanding performance of the
proposed classifier
A Descriptive Study of Variable Discretization and Cost-Sensitive Logistic Regression on Imbalanced Credit Data
Training classification models on imbalanced data tends to result in bias towards the majority class. In this paper, we demonstrate how variable discretization and cost-sensitive logistic regression help mitigate this bias on an imbalanced credit scoring dataset, and further show the application of the variable discretization technique on the data from other domains, demonstrating its potential as a generic technique for classifying imbalanced data beyond credit scoring. The performance measurements include ROC curves, Area under ROC Curve (AUC), Type I Error, Type II Error, accuracy, and F1 score. The results show that proper variable discretization and cost-sensitive logistic regression with the best class weights can reduce the model bias and/or variance. From the perspective of the algorithm, cost-sensitive logistic regression is beneficial for increasing the value of predictors even if they are not in their optimized forms while maintaining monotonicity. From the perspective of predictors, the variable discretization performs better than cost-sensitive logistic regression, provides more reasonable coefficient estimates for predictors which have nonlinear relationships against their empirical logit, and is robust to penalty weights on misclassifications of events and non-events determined by their apriori proportions
Empirical study of dimensionality reduction methodologies for classification problems
Cuando hablamos de “Dimensionality Reduction” en Informática o “Big Data” nos referimos al proceso de reducción de variables previamente examinadas de un conjunto de datos para poder asà obtener un conjunto de variables menor que nos permitirá construir un modelo de datos igual o con mejor precisión y menor cantidad de datos.
Con este propĂłsito se aplican tĂ©cnicas de “Feature Selection” y “Feature Extraction”, con la primera de ellas extraemos un conjunto de caracterĂsticas importantes de un dataset mediante el uso de distintos algoritmos de “machine learning”, mientras que con la segunda obtendremos un nuevo conjunto de caracterĂsticas obtenidas a partir de las caracterĂsticas originales.
En este trabajo de fin de grado hacemos un estudio empĂrico sobre las distintas metodologĂas para clasificaciĂłn de problemas utilizando un dataset mĂ©dico llamado NCS-1 de pacientes clĂnicos con distintas patologĂas mĂ©dicas, estudiamos los distintos algoritmos que se pueden aplicar a cada caso determinado con dicho dataset, y finalmente con los datos obtenidos realizamos un benchmark que nos permite entender mejor los distintos modelos estudiados.When we speak about Dimensionality reduction in informatics or big data, we refer to the process of reducing the number of random variables under consideration, and so, obtaining a set of principle variables which allow us to build a data model with the same or similar accuracy and a lower amount of data.
For this purpose, we apply feature selection and feature extraction techniques. With feature selection we select a subset of the original feature set using techniques of machine learning, and with feature extraction we are going to build a new set of features from the original feature set.
In this Project, we are going to make an empirical study about the different methodologies for classification problems using a medical dataset called NCS-1 of clinical patients with different medical pathologies, we study the different algorithms that can be applied for each case with this dataset, and finally with obtained data developing a Benchmark to understand the different applied models.Grado en IngenierĂa Informátic
- …