72,053 research outputs found

    A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition

    Get PDF
    Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques—oversampling, under-sampling and synthetic minority over-sampling (SMOTE)—along with four popular classification methods—logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates

    On The Stability of Interpretable Models

    Full text link
    Interpretable classification models are built with the purpose of providing a comprehensible description of the decision logic to an external oversight agent. When considered in isolation, a decision tree, a set of classification rules, or a linear model, are widely recognized as human-interpretable. However, such models are generated as part of a larger analytical process. Bias in data collection and preparation, or in model's construction may severely affect the accountability of the design process. We conduct an experimental study of the stability of interpretable models with respect to feature selection, instance selection, and model selection. Our conclusions should raise awareness and attention of the scientific community on the need of a stability impact assessment of interpretable models

    A Bayesian phylogenetic hidden Markov model for B cell receptor sequence analysis.

    Get PDF
    The human body generates a diverse set of high affinity antibodies, the soluble form of B cell receptors (BCRs), that bind to and neutralize invading pathogens. The natural development of BCRs must be understood in order to design vaccines for highly mutable pathogens such as influenza and HIV. BCR diversity is induced by naturally occurring combinatorial "V(D)J" rearrangement, mutation, and selection processes. Most current methods for BCR sequence analysis focus on separately modeling the above processes. Statistical phylogenetic methods are often used to model the mutational dynamics of BCR sequence data, but these techniques do not consider all the complexities associated with B cell diversification such as the V(D)J rearrangement process. In particular, standard phylogenetic approaches assume the DNA bases of the progenitor (or "naive") sequence arise independently and according to the same distribution, ignoring the complexities of V(D)J rearrangement. In this paper, we introduce a novel approach to Bayesian phylogenetic inference for BCR sequences that is based on a phylogenetic hidden Markov model (phylo-HMM). This technique not only integrates a naive rearrangement model with a phylogenetic model for BCR sequence evolution but also naturally accounts for uncertainty in all unobserved variables, including the phylogenetic tree, via posterior distribution sampling

    Making biodiversity measures accessible to non-specialists: An innovative method for rapid assessment of urban biodiversity

    Get PDF
    Urban biodiversity studies provide important inputs to studying the interactions between human societies and ecological systems. However, existing urban biodiversity methods are time intensive and/or too complex for the purposes of rapid biodiversity assessment of large urban sites. In this paper the authors present a biodiversity assessment method that is innovative in its approach, is reliable, and from which the data generated can be presented in an understandable way to non-ecologists. This method is based on measuring the land cover of different vegetation structures and the diversity of vascular plants, and then combining these into an overall biodiversity score. The land cover of vegetation structures was recorded by using a checklist in combination with Tandy’s Isovist Technique and the Domin cover scale. Vascular plant diversity was recorded at genus level by walking along defined transects within circular sampling areas of sixty five meter radius and using a checklist. A scoring procedure assigns an overall biodiversity score to different combinations of land cover of vegetation structures and vascular plant diversity. This method was tested in three urban locations in the United Kingdom which differed according to size, design and land use. Descriptive statistics of the resulting biodiversity scores differentiated between the biodiversity distribution within each one of the three locations, as well as across them. The main strength of this rapid biodiversity assessment method is its simplicity. Furthermore, by producing accurate results this biodiversity assessment method can be most useful in rapidly identifying areas where more detailed ecological surveys are needed
    corecore