873 research outputs found

    A Multi-Gene Genetic Programming Application for Predicting Students Failure at School

    Full text link
    Several efforts to predict student failure rate (SFR) at school accurately still remains a core problem area faced by many in the educational sector. The procedure for forecasting SFR are rigid and most often times require data scaling or conversion into binary form such as is the case of the logistic model which may lead to lose of information and effect size attenuation. Also, the high number of factors, incomplete and unbalanced dataset, and black boxing issues as in Artificial Neural Networks and Fuzzy logic systems exposes the need for more efficient tools. Currently the application of Genetic Programming (GP) holds great promises and has produced tremendous positive results in different sectors. In this regard, this study developed GPSFARPS, a software application to provide a robust solution to the prediction of SFR using an evolutionary algorithm known as multi-gene genetic programming. The approach is validated by feeding a testing data set to the evolved GP models. Result obtained from GPSFARPS simulations show its unique ability to evolve a suitable failure rate expression with a fast convergence at 30 generations from a maximum specified generation of 500. The multi-gene system was also able to minimize the evolved model expression and accurately predict student failure rate using a subset of the original expressionComment: 14 pages, 9 figures, Journal paper. arXiv admin note: text overlap with arXiv:1403.0623 by other author

    An Integrated Framework Based on Latent Variational Autoencoder for Providing Early Warning of At-Risk Students

    Get PDF
    The rapid development of learning technologies has enabled online learning paradigm to gain great popularity in both high education and K-12, which makes the prediction of student performance become one of the most popular research topics in education. However, the traditional prediction algorithms are originally designed for balanced dataset, while the educational dataset typically belongs to highly imbalanced dataset, which makes it more difficult to accurately identify the at-risk students. In order to solve this dilemma, this study proposes an integrated framework (LVAEPre) based on latent variational autoencoder (LVAE) with deep neural network (DNN) to alleviate the imbalanced distribution of educational dataset and further to provide early warning of at-risk students. Specifically, with the characteristics of educational data in mind, LVAE mainly aims to learn latent distribution of at-risk students and to generate at-risk samples for the purpose of obtaining a balanced dataset. DNN is to perform final performance prediction. Extensive experiments based on the collected K-12 dataset show that LVAEPre can effectively handle the imbalanced education dataset and provide much better and more stable prediction results than baseline methods in terms of accuracy and F1.5 score. The comparison of t-SNE visualization results further confirms the advantage of LVAE in dealing with imbalanced issue in educational dataset. Finally, through the identification of the significant predictors of LVAEPre in the experimental dataset, some suggestions for designing pedagogical interventions are put forward

    Exploring and Evaluating the Scalability and Efficiency of Apache Spark using Educational Datasets

    Get PDF
    Research into the combination of data mining and machine learning technology with web-based education systems (known as education data mining, or EDM) is becoming imperative in order to enhance the quality of education by moving beyond traditional methods. With the worldwide growth of the Information Communication Technology (ICT), data are becoming available at a significantly large volume, with high velocity and extensive variety. In this thesis, four popular data mining methods are applied to Apache Spark, using large volumes of datasets from Online Cognitive Learning Systems to explore the scalability and efficiency of Spark. Various volumes of datasets are tested on Spark MLlib with different running configurations and parameter tunings. The thesis convincingly presents useful strategies for allocating computing resources and tuning to take full advantage of the in-memory system of Apache Spark to conduct the tasks of data mining and machine learning. Moreover, it offers insights that education experts and data scientists can use to manage and improve the quality of education, as well as to analyze and discover hidden knowledge in the era of big data

    Penerapan Adaboost untuk Penyelesaian Ketidakseimbangan Kelas pada Penentuan Kelulusan Mahasiswa dengan Metode Decision Tree

    Full text link
    Universitas Pamulang salah satu perguruan tinggi yang memiliki jumlah mahasiswa yang besar, namun dalam data histori terdapat masalah dengan jumlah kelulusan yang tepat waktu dan terlambat (tidak tepat waktu ) yang tidak seimbang. Metode decision tree memiliki kinerja yang baik dalam menangani klasifikasi tepat waktu atau terlambat tetapi decision tree memiliki kelemahan dalam derajat yang tinggi dari ketidakseimbangan kelas (class imbalance). Untuk mengatasi masalah tersebut dapat dilakukan dengan sebuah metode yang dapat menyeimbangkan kelas dan meningkatkan akurasi. Adaboost salah satu metode boosting yang mampu menyeimbangkan kelas dengan memberikan bobot pada tingkat error klasifikasi yang dapat merubah distribusi data. Eksperimen dilakukan dengan menerapkan metode adaboost pada decision tree (DT) untuk mendapatkan hasil yang optimal dan tingkat akurasi yang baik. Hasil ekperimen yang diperoleh dari metode decision tree untuk akurasi sebesar 87,18%, AUC sebesar 0,864, dan RMSE sebesar 0,320, sedangkan hasil dari decision tree dengan adaboost (DTBoost) untuk akurasi sebesar 90,45%, AUC sebesar 0,951, dan RMSE sebesar 0,273, maka dapat disimpulkan dalam penentuan kelulusan mahasiswa dengan metode decision tree dan adaboost terbukti mampu menyelesaikan masalah ketidakseimbangan kelas dan meningkatkan akurasi yang tinggi dan dapat menurunkan tingkat error klasifikasi

    Improving the accuracy of predicting bank depositor' behavior using decision tree

    Get PDF
    Telemarketing is a widely adopted direct marketing technique in banks. Since customers hardly respond positively, data prediction models can help in selecting the most likely prospective customers. We aim to develop a classifier accuracy to predict which customer will subscribe to a long-term deposit proposed by a bank. Accordingly, this paper focuses on a combination of resampling, in order to reduce the imbalanced data, using feature selection, to reduce the complexity of data computing and dimension reduction of inefficiency data modeling. The performed operation has shown an improvement in the performance of the classification algorithm in terms of accuracy. The experimental results were run on a real bank dataset and the J48 decision tree achieved 94.39% accuracy prediction, with 0.975 sensitivity and 0.709 specificity, showing better results when compared to other approaches reported in the existing literature, such as logistic regression (91.79 accuracy; 0.975 sensitivity; 0.495 specificity) and Naive Bayes classifier (90.82% accuracy; 0.961 sensitivity; 0.507 specificity). Furthermore, our resampling and feature selection approach resulted in improved accuracy (94.39%) when compared to a state-of-the-art approach based on a fuzzy algorithm (92.89%).info:eu-repo/semantics/publishedVersio

    Capturing "attrition intensifying" structural traits from didactic interaction sequences of MOOC learners

    Get PDF
    This work is an attempt to discover hidden structural configurations in learning activity sequences of students in Massive Open Online Courses (MOOCs). Leveraging combined representations of video clickstream interactions and forum activities, we seek to fundamentally understand traits that are predictive of decreasing engagement over time. Grounded in the interdisciplinary field of network science, we follow a graph based approach to successfully extract indicators of active and passive MOOC participation that reflect persistence and regularity in the overall interaction footprint. Using these rich educational semantics, we focus on the problem of predicting student attrition, one of the major highlights of MOOC literature in the recent years. Our results indicate an improvement over a baseline ngram based approach in capturing "attrition intensifying" features from the learning activities that MOOC learners engage in. Implications for some compelling future research are discussed.Comment: "Shared Task" submission for EMNLP 2014 Workshop on Modeling Large Scale Social Interaction in Massively Open Online Course

    A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques

    Get PDF
    According to the Oman Education Portal (OEP), data set imbalances are common in student performance. Most of the students are performing welI, while only small cases of students are underperformed. Classification techniques for the imbalanced dataset can yield deceivingly high prediction accuracy. The majority class usually drives the overall predictive accuracy at the expense of having abysmal performance on the minority class. The main objective of this study was to predict students' performance which consisted of imbalanced class distribution, by exploiting different sampling techniques and several data mining classifier models. Three main sampling techniques - synthetic minority over-sampling technique (SMOTE), random under-sampling (RUS), and clustering-based sampling were compared to improve the predictive accuracy in the minority class while maintaining satisfactory overall classification performance. Five different data-mining classifiers - J48, Random Forest, K-Nearest Neighbour, Naïve Bayes, and Logistic Regression were used to predict the student performance. 10-fold cross-validation was utilized to minimize the sampling bias. The classifiers' performance was evaluated using four metrics: accuracy, False Positive (FP), Matthews correlation coefficient (MCC), and Receiver Operating Characteristic (ROC). The OEP datasets between 2018 and 2019 were extracted to assess the efficacy of both sampling techniques and classification methods. The results indicated that the K-Nearest Neighbors combined with the clustering-based sampling technique produced the best classification performance with an MCC value of 98.4% on the 10-fold crossvalidation. The clustering-based sampling techniques improved the overall prediction performance for the minority class. In addition, the most important variables to accurately predict student performance were identified by utilizing the Random Forest model. OEP contains a large amount of data and analyses based on this large and complex data can be useful for OEP stakeholders in improving student performance and identifying students who require additional attention

    IDENTIFICATION OF STUDENTS AT RISK OF LOW PERFORMANCE BY COMBINING RULE-BASED MODELS, ENHANCED MACHINE LEARNING, AND KNOWLEDGE GRAPH TECHNIQUES

    Get PDF
    Technologies and online learning platforms have changed the contemporary educational paradigm, giving institutions more alternatives in a complex and competitive environment. Online learning platforms, learning-based analytics, and data mining tools are increasingly complementing and replacing traditional education techniques. However, academic underachievement, graduation delays, and student dropouts remain common problems in educational institutions. One potential method of preventing these issues is by predicting student performance through the use of institution data and advanced technologies. However, to date, scholars have yet to develop a module that can accurately predict students’ academic achievement and commitment. This dissertation attempts to bridge that gap by presenting a framework that allows instructors to achieve four goals: (1) track and monitor the performance of each student on their course, (2) identify at-risk students during the earliest stages of the course progression (3), enhance the accuracy with which at-risk student performance is predicted, and (4) improve the accuracy of student ranking and development of personalized learning interventions. These goals are achieved via four objectives. Objective One proposes a rule-based strategy and risk factor flag to warn instructors about at-risk students. Objective Two classifies at-risk students using an explainable ML-based model and rule-based approach. It also offers remedial strategies for at-risk students at each checkpoint to address their weaknesses. Objective Three uses ML-based models, GCNs, and knowledge graphs to enhance the prediction results. Objective Four predicts students’ ranking using ML-based models and clustering-based KGEs with the aim of developing personalized learning interventions. It is anticipated that the solution presented in this dissertation will help educational institutions identify and analyze at-risk students on a course-by-course basis and, thereby, minimize course failure rates

    Data Mining

    Get PDF
    Data mining is a branch of computer science that is used to automatically extract meaningful, useful knowledge and previously unknown, hidden, interesting patterns from a large amount of data to support the decision-making process. This book presents recent theoretical and practical advances in the field of data mining. It discusses a number of data mining methods, including classification, clustering, and association rule mining. This book brings together many different successful data mining studies in various areas such as health, banking, education, software engineering, animal science, and the environment
    • …
    corecore