Search CORE

23 research outputs found

Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems

Author: Masip David
Mougan Carlos
Nin Jordi
Pujol Oriol
Publication venue
Publication date: 04/07/2021
Field of study

Regression problems have been widely studied in machinelearning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper,we provide an in-depth analysis of how to tackle high cardinality categor-ical features with the quantile. Our proposal outperforms state-of-the-encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model.Comment: Accepted at The 18th International Conference on Modeling Decisions for Artificial Intelligence (MDAI

arXiv.org e-Print Archive

Southampton (e-Prints Soton)

Exploring Error-Sensitive Attributes and Branches of Decision Trees

Author: Wu WZ
Publication venue: IEEE, Liaoning Technical Univeristy
Publication date: 01/01/2013
Field of study

The 2013 9th International Conference on Natural Computation (ICNC'13) and the 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'13) were be jointly held from 23-25 July 2013 in Shenyang, China. http://icnc-fskd.lntu.edu.cn

OPUS - University of Technology Sydney

Development of mobile-interfaced machine learning-based predictive models for improving students' performance in programming courses

Author: Adebimpe Esan
Adepoju Adeyanju Ibrahim
Ayodele Oloyede
Bolaji Omodunbi
Funmilola Egbetola
Matthew Fagbola Temitayo
Olatayo Olaniyan
Olumide Obe
Publication venue: SAI Organization
Publication date: 01/01/2018
Field of study

Student performance modelling (SPM) is a critical step to assessing and improving students' performances in their learning discourse. However, most existing SPM are based on statistical approaches, which on one hand are based on probability, depicting that results are based on estimation; and on the other hand, actual influences of hidden factors that are peculiar to students, lecturers, learning environment and the family, together with their overall effect on student performance have not been exhaustively investigated. In this paper, Student Performance Models (SPM) for improving students' performance in programming courses were developed using M5P Decision Tree (MDT) and Linear Regression Classifier (LRC). The data used was gathered using a structured questionnaire from 295 students in 200 and 300 levels of study who offered Web programming, C or JAVA at Federal University, Oye-Ekiti, Nigeria between 2012 and 2016. Hidden factors that are significant to students' performance in programming were identified. The relevant data gathered, normalized, coded and prepared as variable and factor datasets, and fed into the MDT algorithm and LRC to develop the predictive models. The developed models were obtained, validated and afterwards implemented in an Android 1.0.1 Studio environment. Extended Markup Language (XML) and Java were used for the design of the Graphical User Interface (GUI) and the logical implementation of the developed models as a mobile calculator, respectively. However, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Relative Absolute Error (RAE) and the Root Relative Squared Error (RRSE) were the metrics used to evaluate the robustness of MDT and LRC models. The evaluation results obtained indicate that the variable-based LRC produced the best model in terms of MAE, RMSE, RAE and the RRSE having yielded the least values in all the evaluations conducted. Further results obtained established the strong significance of attitude of students and lecturers, fearful perception of students, erratic power supply, university facilities, student health and students' attendance to the performance of students in programming courses. The variable-based LRC model presented in this paper could provide baseline information about students' performance thereby offering better decision making towards improving teaching/learning outcomes in programming courses

Repository@Hull - Worktribe

Crossref

Risk based uncertainty quantification to improve robustness of manufacturing operations

Author: Arjunwadkar
Auret
Bakır
Bassett
Breiman
Breiman
Breiman
Cade
Cestnik
Chaudhuri
Cinzia Giannetti
Coad
Deloitte
Dimelis
Ding
Efron
Efron
Foresight
Francke
Freund
Geraci
Germany Trade and Invest
Giannetti
Giannetti
Giannetti
International Standard Organisation ISO
Jemwa
Jiang
Kim
Kim
Koenker
Koenker
Kotsiantis
Köksal
Lewis
Lewis
Lewis
Loh
Loh
Loh
Manyika
Meinshausen
Pao
Platt
Postek
Rajesh S. Ransing
Ransing
Ransing
Ransing
Roshan
Scharf
Shao
Stricker
Wei
Yan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

The cyber-physical systems of Industry 4.0 are expected to generate vast amount of in-process data and revolutionise the way data, knowledge and wisdom is captured and reused in manufacturing industries. The goal is to increase profits by dramatically reducing the occurrence of unexpected process results and waste. ISO9001:2015 defines risk as effect of uncertainty. In the 7Epsilon context, the risk is defined as effect of uncertainty on expected results. The paper proposes a novel algorithm to embed risk based thinking in quantifying uncertainty in manufacturing operations during the tolerance synthesis process. This method uses penalty functions to mathematically represent deviation from expected results and solves the tolerance synthesis problem by proposing a quantile regression tree approach. The latter involves non parametric estimation of conditional quantiles of a response variable from in-process data and allows process engineers to discover and visualise optimal ranges that are associated with quality improvements. In order to quantify uncertainty and predict process robustness, a probabilistic approach, based on the likelihood ratio test with bootstrapping, is proposed which uses smoothed probability estimation of conditional probabilities. The mathematical formulation presented in this paper will allow organisations to extend Six Sigma process improvement principles in the Industry 4.0 context and implement the 7 steps of 7Epsilon in order to satisfy the requirements of clauses 6.1 and 7.1.6 of the ISO9001:2015 and the aerospace AS9100:2016 quality standard

Elsevier - Publisher Connector

Crossref

Cronfa at Swansea University

Using extreme prior probabilities on the Naive Credal Classifier

Author: Abellán Mulero Joaquín
García Castellano Francisco Javier
Mantas Ruiz Carlos Javier
Moral García Serafín
Publication venue: Elsevier
Publication date: 15/02/2022
Field of study

The Naive Credal Classifier (NCC) was the first method proposed for Imprecise Classification. It starts from the known Naive Bayes algorithm (NB), which assumes that the attributes are independent given the class variable. Despite this unrealistic assumption, NB and NCC have been successfully used in practical applications. In this work, we propose a new version of NCC, called Extreme Prior Naive Credal Classifier (EP-NCC). Unlike NCC, EP-NCC takes into consideration the lower and upper prior probabilities of the class variable in the estimation of the lower and upper conditional probabilities. We demonstrate that, with our proposed EP-NCC, the predictions are more informative than with NCC without increasing the risk of making erroneous predictions. An experimental analysis carried out in this work shows that EP-NCC significantly outperforms NCC and obtains statistically equivalent results to the algorithm proposed so far for Imprecise Classification based on decision trees, even though EP-NCC is computationally simpler. Therefore, EP-NCC is more suitable to be applied to large datasets for Imprecise Classification than the methods proposed so far in this field. This is an important issue in favor of our proposal due to the increasing amount of data in every area.This work has been supported by UGR-FEDER funds under Project A-TIC-344-UGR20, by the “FEDER/Junta de Andalucía-Consejería de Transformación Económica, Industria, Conocimiento y Universidades ” under Project P20_00159, and by research scholarship FPU17/02685

Repositorio Institucional Universidad de Granada

A data mining approach to guide students through the enrollment process based on academic performance

Author: Alvarado Gustavo
Chue Gallardo Jorge
Estrella Jhonny
Ortigosa Álvaro
Peche Juan Pablo
Vialardi Sacín César
Vinatea Bruno
Publication venue: 'MTT Agrifood Research Finland'
Publication date: 01/01/2011
Field of study

Student academic performance at universities is crucial for education management systems. Many actions and decisions are made based on it, specifically the enrollment process. During enrollment, students have to decide which courses to sign up for. This research presents the rationale behind the design of a recommender system to support the enrollment process using the students’ academic performance record. To build this system, the CRISP-DM methodology was applied to data from students of the Computer Science Department at University of Lima, Perú. One of the main contributions of this work is the use of two synthetic attributes to improve the relevance of the recommendations made. The first attribute estimates the inherent difficulty of a given course. The second attribute, named potential, is a measure of the competence of a student for a given course based on the grades obtained in relatedcourses. Data was mined using C4.5, KNN (K-nearest neighbor), Naïve Bayes, Bagging and Boosting, and a set of experiments was developed in order to determine the best algorithm for this application domain. Results indicate that Bagging is the best method regarding predictive accuracy. Based on these results, the “Student Performance Recommender System” (SPRS) was developed, including a learning engine. SPRS was tested with a sample group of 39 students during the enrollment process. Results showed that the system had a very good performance under real-life conditions

Repositorio Institucional Ulima

An Analysis of Reduced Error Pruning

Author: Elomaa T.
Kaariainen M.
Publication venue: 'AI Access Foundation'
Publication date: 03/06/2011
Field of study

Top-down induction of decision trees has been observed to suffer from the inadequate functioning of the pruning phase. In particular, it is known that the size of the resulting tree grows linearly with the sample size, even though the accuracy of the tree does not improve. Reduced Error Pruning is an algorithm that has been used as a representative technique in attempts to explain the problems of decision tree learning. In this paper we present analyses of Reduced Error Pruning in three different settings. First we study the basic algorithmic properties of the method, properties that hold independent of the input decision tree and pruning examples. Then we examine a situation that intuitively should lead to the subtree under consideration to be replaced by a leaf node, one in which the class label and attribute values of the pruning examples are independent of each other. This analysis is conducted under two different assumptions. The general analysis shows that the pruning probability of a node fitting pure noise is bounded by a function that decreases exponentially as the size of the tree grows. In a specific analysis we assume that the examples are distributed uniformly to the tree. This assumption lets us approximate the number of subtrees that are pruned because they do not receive any pruning examples. This paper clarifies the different variants of the Reduced Error Pruning algorithm, brings new insight to its algorithmic properties, analyses the algorithm with less imposed assumptions than before, and includes the previously overlooked empty subtrees to the analysis

arXiv.org e-Print Archive

Crossref

From Binary to Multi-Class Divisions: improvements on Hierarchical Divisive Human Activity Recognition

Author: Tomás Vieira Caldas
Publication venue
Publication date: 12/07/2019
Field of study

Repositório Aberto da Universidade do Porto

Classification of Malaysian students' tendency in choosing TVET after secondary school using analytic hierarchy process and decision tree model: a case study in northern part of Malaysia

Author: Hong Chia Ming
Publication venue
Publication date: 01/01/2021
Field of study

Technical and Vocational Education and Training (TVET) programme is a channel to produce skilled workers because it provides theoretical knowledge and practical skills for students. However, the current enrolment in TVET programme among secondary school leavers is still low. Therefore, this study aims to uncover the factors that affect students’ tendency in enrolling TVET programme. The perception of students, public, instructors, employers, parents, besides facility, cost and policy have been discovered as the factors in this study. In the first phase, Analytic Hierarchy Process (AHP) is used to determine the level of importance of each factor. Based on the outcome of the first phase, various Decision Tree models are developed to classify the students’ tendency in enrolling TVET programme. The respondents in this study are students from the TVET programme and secondary schools. The result of AHP shows that the factor of parents is the most important factor, followed by instructors, employers, students, cost, facility, policy and public. Then, four types of Decision Tree namely ID3, CART, C4.5 and CHAID are generated to classify the students’ tendency based on the four most important factors (parents, instructors, employers and students). The ID3 tree in 5-fold Cross Validation is selected as the best model due to its low misclassification rate (0.1355), high accuracy rate (0.8645), high precision rate (0.9143), high recall rate (0.8828), and high F-score (0.8983) with tree depth of 7 and maximum branches of 3. Hence, in the future, this model can be used to classify the students’ tendency enrolling in TVET programme. It can also assist the government to implement strategic plans such as organizing campaigns or providing on-the-job training to students in the TVET programme. Therefore, skilled workers that can adapt to new technology and innovation could be produced

Universiti Utara Malaysia: UUM eTheses