23 research outputs found

    Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems

    Full text link
    Regression problems have been widely studied in machinelearning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper,we provide an in-depth analysis of how to tackle high cardinality categor-ical features with the quantile. Our proposal outperforms state-of-the-encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model.Comment: Accepted at The 18th International Conference on Modeling Decisions for Artificial Intelligence (MDAI

    Exploring Error-Sensitive Attributes and Branches of Decision Trees

    Full text link
    The 2013 9th International Conference on Natural Computation (ICNC'13) and the 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'13) were be jointly held from 23-25 July 2013 in Shenyang, China. http://icnc-fskd.lntu.edu.cn

    Development of mobile-interfaced machine learning-based predictive models for improving students' performance in programming courses

    Get PDF
    Student performance modelling (SPM) is a critical step to assessing and improving students' performances in their learning discourse. However, most existing SPM are based on statistical approaches, which on one hand are based on probability, depicting that results are based on estimation; and on the other hand, actual influences of hidden factors that are peculiar to students, lecturers, learning environment and the family, together with their overall effect on student performance have not been exhaustively investigated. In this paper, Student Performance Models (SPM) for improving students' performance in programming courses were developed using M5P Decision Tree (MDT) and Linear Regression Classifier (LRC). The data used was gathered using a structured questionnaire from 295 students in 200 and 300 levels of study who offered Web programming, C or JAVA at Federal University, Oye-Ekiti, Nigeria between 2012 and 2016. Hidden factors that are significant to students' performance in programming were identified. The relevant data gathered, normalized, coded and prepared as variable and factor datasets, and fed into the MDT algorithm and LRC to develop the predictive models. The developed models were obtained, validated and afterwards implemented in an Android 1.0.1 Studio environment. Extended Markup Language (XML) and Java were used for the design of the Graphical User Interface (GUI) and the logical implementation of the developed models as a mobile calculator, respectively. However, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Relative Absolute Error (RAE) and the Root Relative Squared Error (RRSE) were the metrics used to evaluate the robustness of MDT and LRC models. The evaluation results obtained indicate that the variable-based LRC produced the best model in terms of MAE, RMSE, RAE and the RRSE having yielded the least values in all the evaluations conducted. Further results obtained established the strong significance of attitude of students and lecturers, fearful perception of students, erratic power supply, university facilities, student health and students' attendance to the performance of students in programming courses. The variable-based LRC model presented in this paper could provide baseline information about students' performance thereby offering better decision making towards improving teaching/learning outcomes in programming courses

    Risk based uncertainty quantification to improve robustness of manufacturing operations

    Get PDF
    The cyber-physical systems of Industry 4.0 are expected to generate vast amount of in-process data and revolutionise the way data, knowledge and wisdom is captured and reused in manufacturing industries. The goal is to increase profits by dramatically reducing the occurrence of unexpected process results and waste. ISO9001:2015 defines risk as effect of uncertainty. In the 7Epsilon context, the risk is defined as effect of uncertainty on expected results. The paper proposes a novel algorithm to embed risk based thinking in quantifying uncertainty in manufacturing operations during the tolerance synthesis process. This method uses penalty functions to mathematically represent deviation from expected results and solves the tolerance synthesis problem by proposing a quantile regression tree approach. The latter involves non parametric estimation of conditional quantiles of a response variable from in-process data and allows process engineers to discover and visualise optimal ranges that are associated with quality improvements. In order to quantify uncertainty and predict process robustness, a probabilistic approach, based on the likelihood ratio test with bootstrapping, is proposed which uses smoothed probability estimation of conditional probabilities. The mathematical formulation presented in this paper will allow organisations to extend Six Sigma process improvement principles in the Industry 4.0 context and implement the 7 steps of 7Epsilon in order to satisfy the requirements of clauses 6.1 and 7.1.6 of the ISO9001:2015 and the aerospace AS9100:2016 quality standard

    Using extreme prior probabilities on the Naive Credal Classifier

    Get PDF
    The Naive Credal Classifier (NCC) was the first method proposed for Imprecise Classification. It starts from the known Naive Bayes algorithm (NB), which assumes that the attributes are independent given the class variable. Despite this unrealistic assumption, NB and NCC have been successfully used in practical applications. In this work, we propose a new version of NCC, called Extreme Prior Naive Credal Classifier (EP-NCC). Unlike NCC, EP-NCC takes into consideration the lower and upper prior probabilities of the class variable in the estimation of the lower and upper conditional probabilities. We demonstrate that, with our proposed EP-NCC, the predictions are more informative than with NCC without increasing the risk of making erroneous predictions. An experimental analysis carried out in this work shows that EP-NCC significantly outperforms NCC and obtains statistically equivalent results to the algorithm proposed so far for Imprecise Classification based on decision trees, even though EP-NCC is computationally simpler. Therefore, EP-NCC is more suitable to be applied to large datasets for Imprecise Classification than the methods proposed so far in this field. This is an important issue in favor of our proposal due to the increasing amount of data in every area.This work has been supported by UGR-FEDER funds under Project A-TIC-344-UGR20, by the “FEDER/Junta de Andalucía-Consejería de Transformación Económica, Industria, Conocimiento y Universidades ” under Project P20_00159, and by research scholarship FPU17/02685

    A data mining approach to guide students through the enrollment process based on academic performance

    Get PDF
    Student academic performance at universities is crucial for education management systems. Many actions and decisions are made based on it, specifically the enrollment process. During enrollment, students have to decide which courses to sign up for. This research presents the rationale behind the design of a recommender system to support the enrollment process using the students’ academic performance record. To build this system, the CRISP-DM methodology was applied to data from students of the Computer Science Department at University of Lima, PerĂș. One of the main contributions of this work is the use of two synthetic attributes to improve the relevance of the recommendations made. The first attribute estimates the inherent difficulty of a given course. The second attribute, named potential, is a measure of the competence of a student for a given course based on the grades obtained in relatedcourses. Data was mined using C4.5, KNN (K-nearest neighbor), NaĂŻve Bayes, Bagging and Boosting, and a set of experiments was developed in order to determine the best algorithm for this application domain. Results indicate that Bagging is the best method regarding predictive accuracy. Based on these results, the “Student Performance Recommender System” (SPRS) was developed, including a learning engine. SPRS was tested with a sample group of 39 students during the enrollment process. Results showed that the system had a very good performance under real-life conditions

    An Analysis of Reduced Error Pruning

    Full text link
    Top-down induction of decision trees has been observed to suffer from the inadequate functioning of the pruning phase. In particular, it is known that the size of the resulting tree grows linearly with the sample size, even though the accuracy of the tree does not improve. Reduced Error Pruning is an algorithm that has been used as a representative technique in attempts to explain the problems of decision tree learning. In this paper we present analyses of Reduced Error Pruning in three different settings. First we study the basic algorithmic properties of the method, properties that hold independent of the input decision tree and pruning examples. Then we examine a situation that intuitively should lead to the subtree under consideration to be replaced by a leaf node, one in which the class label and attribute values of the pruning examples are independent of each other. This analysis is conducted under two different assumptions. The general analysis shows that the pruning probability of a node fitting pure noise is bounded by a function that decreases exponentially as the size of the tree grows. In a specific analysis we assume that the examples are distributed uniformly to the tree. This assumption lets us approximate the number of subtrees that are pruned because they do not receive any pruning examples. This paper clarifies the different variants of the Reduced Error Pruning algorithm, brings new insight to its algorithmic properties, analyses the algorithm with less imposed assumptions than before, and includes the previously overlooked empty subtrees to the analysis

    Classification of Malaysian students' tendency in choosing TVET after secondary school using analytic hierarchy process and decision tree model: a case study in northern part of Malaysia

    Get PDF
    Technical and Vocational Education and Training (TVET) programme is a channel to produce skilled workers because it provides theoretical knowledge and practical skills for students. However, the current enrolment in TVET programme among secondary school leavers is still low. Therefore, this study aims to uncover the factors that affect students’ tendency in enrolling TVET programme. The perception of students, public, instructors, employers, parents, besides facility, cost and policy have been discovered as the factors in this study. In the first phase, Analytic Hierarchy Process (AHP) is used to determine the level of importance of each factor. Based on the outcome of the first phase, various Decision Tree models are developed to classify the students’ tendency in enrolling TVET programme. The respondents in this study are students from the TVET programme and secondary schools. The result of AHP shows that the factor of parents is the most important factor, followed by instructors, employers, students, cost, facility, policy and public. Then, four types of Decision Tree namely ID3, CART, C4.5 and CHAID are generated to classify the students’ tendency based on the four most important factors (parents, instructors, employers and students). The ID3 tree in 5-fold Cross Validation is selected as the best model due to its low misclassification rate (0.1355), high accuracy rate (0.8645), high precision rate (0.9143), high recall rate (0.8828), and high F-score (0.8983) with tree depth of 7 and maximum branches of 3. Hence, in the future, this model can be used to classify the students’ tendency enrolling in TVET programme. It can also assist the government to implement strategic plans such as organizing campaigns or providing on-the-job training to students in the TVET programme. Therefore, skilled workers that can adapt to new technology and innovation could be produced
    corecore