4 research outputs found

    Non-uniform Feature Sampling for Decision Tree Ensembles

    Full text link
    We study the effectiveness of non-uniform randomized feature selection in decision tree classification. We experimentally evaluate two feature selection methodologies, based on information extracted from the provided dataset: (i)(i) \emph{leverage scores-based} and (ii)(ii) \emph{norm-based} feature selection. Experimental evaluation of the proposed feature selection techniques indicate that such approaches might be more effective compared to naive uniform feature selection and moreover having comparable performance to the random forest algorithm [3]Comment: 7 pages, 7 figures, 1 tabl

    University entry selection framework using rule-based and back-propagation

    Get PDF
    Processing thousands of applications can be a challenging task, especially when the applicant does not consider the university requirements and their qualification. The selection officer will have to check the program requirements and calculate the merit score of the applicants. This process is based on rules determined by the Ministry of Education and the institution will have to select the qualified applicants among thousands of applications. In recent years, several student selection methods have been proposed using the fuzzy multiple decision making and decision trees. These approaches have produced high accuracy and good detection rates on closed domain university data. However, current selection procedure requires the admission officers to manually evaluate the applications and match the applicants’ qualifications with the program they applied. Because the selection process is tedious and very prone to mistakes, a comprehensive approach to detect and identify qualified applicants for university enrollment is highly desired. In this work, a student selection framework using rule-based and backpropagation neural network is presented. Two processes are involved in this work; the first phase known as pre-processing uses rule-based for checking the university requirements, merit calculation and data conversion to serve as input for the next phase. The second phase uses back-propagation neural network model to evaluate the qualified candidates for admission to particular programs. This means only selected data of the qualified applicants from the first phase will be sent to the next phase for further processing. The dataset consists of 3,790 datasets from Universiti Pendidikan Sultan Idris. The experiments have shown that the proposed method of ruled-based and back-propagation neural network produced better performance, where the framework has successfully been implemented and validated with the average performance of more than 95% accuracy for student selection across all sets of the test data

    Apprentissage et forêts aléatoires

    Get PDF
    This is devoted to a nonparametric estimation method called random forests, introduced by Breiman in 2001. Extensively used in a variety of areas, random forests exhibit good empirical performance and can handle massive data sets. However, the mathematical forces driving the algorithm remain largely unknown. After reviewing theoretical literature, we focus on the link between infinite forests (theoretically analyzed) and finite forests (used in practice) aiming at narrowing the gap between theory and practice. In particular, we propose a way to select the number of trees such that the errors of finite and infinite forests are similar. On the other hand, we study quantile forests, a type of algorithms close in spirit to Breiman's forests. In this context, we prove the benefit of trees aggregation: while each tree of quantile forest is not consistent, with a proper subsampling step, the forest is. Next, we show the connection between forests and some particular kernel estimates, which can be made explicit in some cases. We also establish upper bounds on the rate of convergence for these kernel estimates. Then we demonstrate two theorems on the consistency of both pruned and unpruned Breiman forests. We stress the importance of subsampling to demonstrate the consistency of the unpruned Breiman's forests. At last, we present the results of a Dreamchallenge whose goal was to predict the toxicity of several compounds for several patients based on their genetic profile.Cette thèse est consacrée aux forêts aléatoires, une méthode d'apprentissage non paramétrique introduite par Breiman en 2001. Très répandues dans le monde des applications, les forêts aléatoires possèdent de bonnes performances et permettent de traiter efficacement de grands volumes de données. Cependant, la théorie des forêts ne permet pas d'expliquer à ce jour l'ensemble des bonnes propriétés de l'algorithme. Après avoir dressé un état de l'art des résultats théoriques existants, nous nous intéressons en premier lieu au lien entre les forêts infinies (analysées en théorie) et les forêts finies (utilisées en pratique). Nous proposons en particulier une manière de choisir le nombre d'arbres pour que les erreurs des forêts finies et infinies soient proches. D'autre part, nous étudions les forêts quantiles, un type d'algorithme proche des forêts de Breiman. Dans ce cadre, nous démontrons l'intérêt d'agréger des arbres : même si chaque arbre de la forêt quantile est inconsistant, grâce à un sous-échantillonnage adapté, la forêt quantile est consistante. Dans un deuxième temps, nous prouvons que les forêts aléatoires sont naturellement liées à des estimateurs à noyau que nous explicitons. Des bornes sur la vitesse de convergence de ces estimateurs sont également établies. Nous démontrons, dans une troisième approche, deux théorèmes sur la consistance des forêts de Breiman élaguées et complètement développées. Dans ce dernier cas, nous soulignons, comme pour les forêts quantiles, l'importance du sous-échantillonnage dans la consistance de la forêt. Enfin, nous présentons un travail indépendant portant sur l'estimation de la toxicité de certains composés chimiques
    corecore