    An ensemble-based decision tree approach for educational data mining

    Nowadays, data mining and machine learning techniques are applied to a variety of different topics (e. g., healthcare and disease, security, decision support, sentiment analysis, education, etc.). Educational data mining investigates the performance of students and gives solutions to enhance the quality of education. The aim of this study is to use different data mining and machine learning algorithms on actual data sets related to students. To this end, we apply two decision tree methods. The methods can create several simple and understandable rules . Moreover, the performance of a decision tree is optimized by using an ensemble technique named Rotation Forest algorithm. Our findings indicate that the Rotation Forest algorithm can enhance the performance of decision trees in terms of different metrics. In addition, we found that the size of tree generated by decision trees ensemble were bigger than simple ones. This means that the proposed methodology can reveal more information concerning simple rules

    Komparasi Kinerja Algoritma C4.5, Gradient Boosting Trees, Random Forests, dan Deep Learning pada Kasus Educational Data Mining

    Penentuan jurusan di SMA Negeri 1 Jogoroto, Jombang, Jawa Timur menggunakan kurikulum 2013, di mana penentuan jurusan siswa tidak hanya melibatkan keinginan siswa, tes peminatan yang dilakukan siswa di SMA pada minggu pertama, tetapi juga dilengkapi dengan nilai siswa semasa di SMP (nilai rapor siswa, nilai Ujian Nasional, serta rekomendasi guru Bimbingan Konseling), rekomendasi orang tua siswa. Selama ini, sekolah menggunakan proses konvensional dalam menentukan jurusan, yaitu menggunakan Microsoft Excel, yang cenderung lama serta rawan akan kekeliruan dalam melakukan penghitungan. Penentuan jurusan ini dilakukan setiap awal ajaran baru pada siswa baru kelas X. Rata-rata setiap tahun, sekolah mengelola siswa sejumlah 290 dengan waktu dan sumber daya manusia yang terbatas. Pada penelitian ini, penggunaan algoritma ID3 tidak cocok karena data bertipe numerik, sedangkan ID3 hanya mampu menggunakan data bertipe nomial maupun polinomial, sehingga diganti algoritma C4.5. Namun, beberapa penelitian mengatakan algoritma C4.5 memiliki kinerja kurang bagus dibandingkan algoritma Gradient Boosting Trees, Random Forests, dan Deep Learning. Untuk itu, dilakukan perbandingan antara keempat metode tersebut untuk melihat keefektifannya dalam menentukan jurusan di SMA. Data yang digunakan pada penelitian ini adalah data penerimaan siswa baru tahun ajaran 2018/2019. Hasil dari penelitian ini menunjukkan jika atribut yang digunakan bertipe polinomial dengan Deep Learning memiliki kinerja paling unggul untuk semua algoritma jika menggunakan fungsi activation ExpRectifier. Sedangkan jika atributnya bertipe numerik, Deep Learning memiliki kinerja paling unggul untuk semua algoritma jika menggunakan fungsi Tanh untuk semua random sampling. Namun, Deep Learning memiliki kinerja paling buruk untuk semua algoritma jika menggunakan loss Function berupa absolut.  Abstract In SMAN 1 Jombang, East Java, the process of determining the students’ majors referred to the 2013 curriculum in which not only the students’ own choices and specialization tests conducted in their first week of SMA were considered but also the student’s SMP grades (a report card, UN scores, and counseling teacher’s recommendation) and parents' recommendation. So far, the school had used Microsoft Excel which required a long time to do and was prone to calculation errors in the process of determination. The process was carried out, with limited time and human resources, at the beginning of a new academic year for grade X students, consisting of 290 students on average. In this present research, the use of ID3 algorithm was not suitable because of its numeric data type instead of nominal or polynomial data. Thus, the C4.5 algorithm was applied, instead. However, the performance of C4.5 algorithm was proved lower than the algorithms of Gradient Boosting Trees, Random Forests, and Deep Learning. Hence, a comparison of performance between them was done to see their effectiveness in the process. The data was the list of new students of the academic year 2018/2019. The results showed that if the attributes are polynomial, the Deep Learning algorithm had the best performance when using the ExpRectifier activation function. When they were numeric, Deep Learning has the most superior performance when using the Tanh function. However, Deep Learning has the worst performance when using the loss function in the form of absolute