Search CORE

21,472 research outputs found

A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition

Author: Delen Dursun
Kasap Nihat
Meesad Phayung
Thammasiri Dech
Publication venue: 'Elsevier BV'
Publication date: 01/08/2013
Field of study

Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques—oversampling, under-sampling and synthetic minority over-sampling (SMOTE)—along with four popular classification methods—logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates

Crossref

Sabanci University Research Database

Scaling properties of protein family phylogenies

Author: A Wagner
A Wagner
Alejandro Herrada
AM Simons
AO Mooers
AØ Mooers
B Burlando
B Burlando
BC Daniels
C Guyer
C Guyer
C Roth
Carlos M Duarte
D Garlaschelli
D Lee
DH Erwin
DJ Aldous
DJ Ford
E Hernández-García
EA Herrada
EF Harding
Emilio Hernández-García
EV Koonin
G Apic
GU Yule
HM Savage
I Pinelis
J Camacho
J Masel
JA Cotton
JA Cotton
JC Willis
JFY Brookfield
JR Banavar
K Klemm
KMA Chan
KP Dial
LL Cavalli-Sforza
M Kirkpatrick
M Sackin
M Sales-Pardo
M Stich
MA Huynen
MGB Blum
MGB Blum
MO Dayhoff
N Saitou
NM Luscombe
O Gascuel
PM Harrison
PRA Campos
R Dawkins
R Desper
R Unger
RE Lenski
S Guindon
S Keller-Schmidt
SB Carroll
SB Heard
SB Heard
SC Morris
T Grantham
T Hughes
TJ Davies
V Kunin
Víctor M Eguíluz
WJ Bruno
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

One of the classical questions in evolutionary biology is how evolutionary processes are coupled at the gene and species level. With this motivation, we compare the topological properties (mainly the depth scaling, as a characterization of balance) of a large set of protein phylogenies with a set of species phylogenies. The comparative analysis shows that both sets of phylogenies share remarkably similar scaling behavior, suggesting the universality of branching rules and of the evolutionary processes that drive biological diversification from gene to species level. In order to explain such generality, we propose a simple model which allows us to estimate the proportion of evolvability/robustness needed to approximate the scaling behavior observed in the phylogenies, highlighting the relevance of the robustness of a biological system (species or protein) in the scaling properties of the phylogenetic trees. Thus, the rules that govern the incapability of a biological system to diversify are equally relevant both at the gene and at the species level.Comment: Replaced with final published versio

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Digital.CSIC

Understanding Search Trees via Statistical Physics

Author: B Chauvin
B Reed
D E Knuth
D S Dean
D S Dean
David S. Dean
E Ben-Naim
H M Mahmoud
H M Mahmoud
H-H Chern
J M Robson
J M Robson
L Devroye
M Drmota
P Flajolet
P Flajolet
P L Krapivsky
P L Krapivsky
R A Finkel
R M Bradley
S N Majumdar
S N Majumdar
S N Majumdar
S N Majumdar
Satya N. Majumdar
W Saarloos van
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 20/10/2004
Field of study

We study the random m-ary search tree model (where m stands for the number of branches of a search tree), an important problem for data storage in computer science, using a variety of statistical physics techniques that allow us to obtain exact asymptotic results. In particular, we show that the probability distributions of extreme observables associated with a random search tree such as the height and the balanced height of a tree have a traveling front structure. In addition, the variance of the number of nodes needed to store a data string of a given size N is shown to undergo a striking phase transition at a critical value of the branching ratio m_c=26. We identify the mechanism of this phase transition, show that it is generic and occurs in various other problems as well. New results are obtained when each element of the data string is a D-dimensional vector. We show that this problem also has a phase transition at a critical dimension, D_c= \pi/\sin^{-1}(1/\sqrt{8})=8.69363...Comment: 11 pages, 8 .eps figures included. Invited contribution to STATPHYS-22 held at Bangalore (India) in July 2004. To appear in the proceedings of STATPHYS-2

arXiv.org e-Print Archive

Crossref