23 research outputs found
Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems
Regression problems have been widely studied in machinelearning literature
resulting in a plethora of regression models and performance measures. However,
there are few techniques specially dedicated to solve the problem of how to
incorporate categorical features to regression problems. Usually, categorical
feature encoders are general enough to cover both classification and regression
problems. This lack of specificity results in underperforming regression
models. In this paper,we provide an in-depth analysis of how to tackle high
cardinality categor-ical features with the quantile. Our proposal outperforms
state-of-the-encoders, including the traditional statistical mean target
encoder, when considering the Mean Absolute Error, especially in the presence
of long-tailed or skewed distributions. Besides, to deal with possible
overfitting when there are categories with small support, our encoder benefits
from additive smoothing. Finally, we describe how to expand the encoded values
by creating a set of features with different quantiles. This expanded encoder
provides a more informative output about the categorical feature in question,
further boosting the performance of the regression model.Comment: Accepted at The 18th International Conference on Modeling Decisions
for Artificial Intelligence (MDAI
Exploring Error-Sensitive Attributes and Branches of Decision Trees
The 2013 9th International Conference on Natural Computation (ICNC'13) and the 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'13) were be jointly held from 23-25 July 2013 in Shenyang, China. http://icnc-fskd.lntu.edu.cn
Development of mobile-interfaced machine learning-based predictive models for improving students' performance in programming courses
Student performance modelling (SPM) is a critical step to assessing and improving students' performances in their learning discourse. However, most existing SPM are based on statistical approaches, which on one hand are based on probability, depicting that results are based on estimation; and on the other hand, actual influences of hidden factors that are peculiar to students, lecturers, learning environment and the family, together with their overall effect on student performance have not been exhaustively investigated. In this paper, Student Performance Models (SPM) for improving students' performance in programming courses were developed using M5P Decision Tree (MDT) and Linear Regression Classifier (LRC). The data used was gathered using a structured questionnaire from 295 students in 200 and 300 levels of study who offered Web programming, C or JAVA at Federal University, Oye-Ekiti, Nigeria between 2012 and 2016. Hidden factors that are significant to students' performance in programming were identified. The relevant data gathered, normalized, coded and prepared as variable and factor datasets, and fed into the MDT algorithm and LRC to develop the predictive models. The developed models were obtained, validated and afterwards implemented in an Android 1.0.1 Studio environment. Extended Markup Language (XML) and Java were used for the design of the Graphical User Interface (GUI) and the logical implementation of the developed models as a mobile calculator, respectively. However, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Relative Absolute Error (RAE) and the Root Relative Squared Error (RRSE) were the metrics used to evaluate the robustness of MDT and LRC models. The evaluation results obtained indicate that the variable-based LRC produced the best model in terms of MAE, RMSE, RAE and the RRSE having yielded the least values in all the evaluations conducted. Further results obtained established the strong significance of attitude of students and lecturers, fearful perception of students, erratic power supply, university facilities, student health and students' attendance to the performance of students in programming courses. The variable-based LRC model presented in this paper could provide baseline information about students' performance thereby offering better decision making towards improving teaching/learning outcomes in programming courses
Risk based uncertainty quantification to improve robustness of manufacturing operations
The cyber-physical systems of Industry 4.0 are expected to generate vast amount of in-process data and revolutionise the way data, knowledge and wisdom is captured and reused in manufacturing industries. The goal is to increase profits by dramatically reducing the occurrence of unexpected process results and waste. ISO9001:2015 defines risk as effect of uncertainty. In the 7Epsilon context, the risk is defined as effect of uncertainty on expected results. The paper proposes a novel algorithm to embed risk based thinking in quantifying uncertainty in manufacturing operations during the tolerance synthesis process. This method uses penalty functions to mathematically represent deviation from expected results and solves the tolerance synthesis problem by proposing a quantile regression tree approach. The latter involves non parametric estimation of conditional quantiles of a response variable from in-process data and allows process engineers to discover and visualise optimal ranges that are associated with quality improvements. In order to quantify uncertainty and predict process robustness, a probabilistic approach, based on the likelihood ratio test with bootstrapping, is proposed which uses smoothed probability estimation of conditional probabilities. The mathematical formulation presented in this paper will allow organisations to extend Six Sigma process improvement principles in the Industry 4.0 context and implement the 7 steps of 7Epsilon in order to satisfy the requirements of clauses 6.1 and 7.1.6 of the ISO9001:2015 and the aerospace AS9100:2016 quality standard
Using extreme prior probabilities on the Naive Credal Classifier
The Naive Credal Classifier (NCC) was the first method proposed for Imprecise Classification. It starts from the known Naive Bayes algorithm (NB), which assumes that the attributes are independent given the class variable. Despite this unrealistic assumption, NB and NCC have been successfully used in practical applications. In this work, we propose a new version of NCC, called Extreme Prior Naive Credal Classifier (EP-NCC). Unlike NCC, EP-NCC takes into consideration the lower and upper prior probabilities of the class variable in the estimation of the lower and upper conditional probabilities. We demonstrate that, with our proposed EP-NCC, the predictions are more informative than with NCC without increasing the risk of making erroneous predictions. An experimental analysis carried out in this work shows that EP-NCC significantly outperforms NCC and obtains statistically equivalent results to the algorithm proposed so far for Imprecise Classification based on decision trees, even though EP-NCC is computationally simpler. Therefore, EP-NCC is more suitable to be applied to large datasets for Imprecise Classification than the methods proposed so far in this field. This is an important issue in favor of our proposal due to the increasing amount of data in every area.This work has been supported by UGR-FEDER funds under Project A-TIC-344-UGR20, by the âFEDER/Junta de AndalucĂa-ConsejerĂa de TransformaciĂłn EconĂłmica, Industria, Conocimiento y Universidades â under Project P20_00159, and by research scholarship FPU17/02685
A data mining approach to guide students through the enrollment process based on academic performance
Student academic performance at universities is crucial for education
management systems. Many actions and decisions are made based on it, specifically the enrollment process. During enrollment, students have to decide which courses to sign up for. This research presents the rationale behind the design of a recommender system to support the enrollment process using the studentsâ academic performance
record. To build this system, the CRISP-DM methodology was applied to data from students of the Computer Science Department at University of Lima, PerĂș. One of the main contributions of this work is the use of two synthetic attributes to improve the relevance of the recommendations made. The first attribute estimates the inherent
difficulty of a given course. The second attribute, named potential, is a measure of the competence of a student for a given course based on the grades obtained in relatedcourses. Data was mined using C4.5, KNN (K-nearest neighbor), NaĂŻve Bayes, Bagging and Boosting, and a set of experiments was developed in order to determine the best algorithm for this application domain. Results indicate that Bagging is the best
method regarding predictive accuracy. Based on these results, the âStudent Performance Recommender Systemâ (SPRS) was developed, including a learning engine. SPRS was tested with a sample group of 39 students during the enrollment process. Results showed that the system had a very good performance under real-life conditions
An Analysis of Reduced Error Pruning
Top-down induction of decision trees has been observed to suffer from the
inadequate functioning of the pruning phase. In particular, it is known that
the size of the resulting tree grows linearly with the sample size, even though
the accuracy of the tree does not improve. Reduced Error Pruning is an
algorithm that has been used as a representative technique in attempts to
explain the problems of decision tree learning. In this paper we present
analyses of Reduced Error Pruning in three different settings. First we study
the basic algorithmic properties of the method, properties that hold
independent of the input decision tree and pruning examples. Then we examine a
situation that intuitively should lead to the subtree under consideration to be
replaced by a leaf node, one in which the class label and attribute values of
the pruning examples are independent of each other. This analysis is conducted
under two different assumptions. The general analysis shows that the pruning
probability of a node fitting pure noise is bounded by a function that
decreases exponentially as the size of the tree grows. In a specific analysis
we assume that the examples are distributed uniformly to the tree. This
assumption lets us approximate the number of subtrees that are pruned because
they do not receive any pruning examples. This paper clarifies the different
variants of the Reduced Error Pruning algorithm, brings new insight to its
algorithmic properties, analyses the algorithm with less imposed assumptions
than before, and includes the previously overlooked empty subtrees to the
analysis
Classification of Malaysian students' tendency in choosing TVET after secondary school using analytic hierarchy process and decision tree model: a case study in northern part of Malaysia
Technical and Vocational Education and Training (TVET) programme is a channel to produce skilled workers because it provides theoretical knowledge and practical skills for students. However, the current enrolment in TVET programme among secondary school leavers is still low. Therefore, this study aims to uncover the factors that affect studentsâ tendency in enrolling TVET programme. The perception of students, public, instructors, employers, parents, besides facility, cost and policy have been discovered as the factors in this study. In the first phase, Analytic Hierarchy Process (AHP) is used to determine the level of importance of each factor. Based on the outcome of the first phase, various Decision Tree models are developed to classify the studentsâ tendency in enrolling TVET programme. The respondents in this study are students from the TVET programme and secondary schools. The result of AHP shows that the factor of parents is the most important factor, followed by instructors, employers, students, cost, facility, policy and public. Then, four types of Decision Tree namely ID3, CART, C4.5 and CHAID are generated to classify the studentsâ tendency based on the four most important factors (parents, instructors, employers and students). The ID3 tree in 5-fold Cross Validation is selected as the best model due to its low misclassification rate (0.1355), high accuracy rate (0.8645), high precision rate (0.9143), high recall rate (0.8828), and high F-score (0.8983) with tree depth of 7 and maximum branches of 3. Hence, in the future, this model can be used to classify the studentsâ tendency enrolling in TVET programme. It can also assist the government to implement strategic plans such as organizing campaigns or providing on-the-job training to students in the TVET programme. Therefore, skilled workers that can adapt to new technology and innovation could be produced