6 research outputs found

    Computerized adaptive test and decision trees: A unifying approach

    Get PDF
    In the last few years, several articles have proposed decision trees (DTs) as an alternative to computerized adapted tests (CATs). These works have focused on showing the differences between the two methods with the aim of identifying the advantages of each of them and thus determining when it is preferable to use one method or another. In this article, Tree-CAT, a new technique for building CATs is presented. Unlike the existing work, Tree-CAT exploits the similarities between CATs and DTs. This technique allows the creation of CATs that minimise the mean square error in the estimation of the examinee’s ability level, and controls the item’s exposure rate. The decision tree is sequentially built by means of an innovative algorithmic procedure that selects the items associated with each of the tree branches by solving a linear program. In addition, our work presents further advantages over alternative item selection techniques with exposure control, such as instant item selection or simultaneous administration of the test to an unlimited number of participants. These advantages allow accurate on-line CATs to be implemented even when the item selection method is computationally costly.Numerical experiments were conducted in Uranus, a supercomputer cluster located at Universidad Carlos III de Madrid and jointly funded by EU-FEDER funds and by the Spanish Government via the National Projects No. UNC313-4E-2361, No. ENE2009-12213- C03-03, No. ENE2012-33219, No. ENE2012-31753 and No. ENE2015-68265-P

    cat.dt: An R package for fast construction of accurate Computerized Adaptive Tests using Decision Trees

    Get PDF
    This article introduces the cat.dt package for the creation of Computerized Adaptive Tests (CATs). Unlike existing packages, the cat.dt package represents the CAT in a Decision Tree (DT) structure. This allows building the test before its administration, ensuring that the creation time of the test is independent of the number of participants. Moreover, to accelerate the construction of the tree, the package controls its growth by joining nodes with similar estimations or distributions of the ability level and uses techniques such as message passing and pre-calculations. The constructed tree, as well as the estimation procedure, can be visualized using the graphical tools included in the package. An experiment designed to evaluate its performance shows that the cat.dt package drastically reduces computational time in the creation of CATs without compromising accuracy.This article has been funded by the Spanish National Project No. RTI2018-101857-B-I00

    Merged Tree-CAT: A fast method for building precise computerized adaptive tests based on decision trees

    Get PDF
    Over the last few years, there has been an increasing interest in the creation of Computerized Adaptive Tests (CATs) based on Decision Trees (DTs). Among the available methods, the Tree-CAT method has been able to demonstrate a mathematical equivalence between both techniques. However, this method has the inconvenience of requiring a high performance cluster while taking a few days to perform its computations. This article presents the Merged Tree-CAT method, which extends the Tree-CAT technique, to create CATs based on DTs in just a few seconds in a personal computer. In order to do so, the Merged Tree-CAT method controls the growth of the tree by merging those branches in which both the distribution and the estimation of the latent level are similar. The performed experiments show that the proposed method obtains estimations of the latent level which are comparable to the obtained by the state-of-the-art techniques, while drastically reducing the computational time.Numerical experiments were conducted in Uranus, a supercomputer cluster located at Universidad Carlos III de Madrid and jointly funded by EU-FEDER funds and by the Spanish Government via the National Projects nos. UNC313-4E-2361, ENE2009-12213- C03-03, ENE2012-33219, ENE2012-31753 and ENE2015-68265-P. This article was also funded by the Spanish National Project no. RTI2018-101857-B-I00

    Accurate Prediction of Children's ADHD Severity Using Family Burden Information: A Neural Lasso Approach

    Get PDF
    The deep lasso algorithm (dlasso) is introduced as a neural version of the statistical linear lasso algorithm that holds benefits from both methodologies: feature selection and automatic optimization of the parameters (including the regularization parameter). This last property makes dlasso particularly attractive for feature selection on small samples. In the two first conducted experiments, it was observed that dlasso is capable of obtaining better performance than its non-neuronal version (traditional lasso), in terms of predictive error and correct variable selection. Once that dlasso performance has been assessed, it is used to determine whether it is possible to predict the severity of symptoms in children with ADHD from four scales that measure family burden, family functioning, parental satisfaction, and parental mental health. Results show that dlasso is able to predict parents’ assessment of the severity of their children’s inattention from only seven items from the previous scales. These items are related to parents’ satisfaction and degree of parental burden.This research has been partially supported by the Spanish National Project No. RTI2018-101857-B-I00. This research was also partly financed by the Community of Madrid in the framework of the multi-annual agreement with the University Carlos III Madrid in its line of action Excellence for University Teaching Staff. They are established within the framework of the V Regional Plan for Scientific Research and Technological Innovation 2016–2020

    Iterative variable selection for high-dimensional data: prediction of pathological response in triple-negative breast cancer

    Get PDF
    In the last decade, regularized regression methods have offered alternatives forperforming multi-marker analysis and feature selection in a whole genome context.The process of defining a list of genes that will characterize an expressionprofile, remains unclear. This procedure oscillates between selecting the genes or transcripts of interest based on previous clinical evidence, or performing a whole transcriptome analys is that rests on advanced statistics. This paper introduces a methodology to deal with the variable selection and model estimation problems in the high-dimensional set-up, which can be particularly useful in the whole genome context. Results are validated using simulated data, and a real dataset from a triple negative breast cancer study

    Variable selection algorithms in generalized linear models

    Get PDF
    Mención Internacional en el título de doctorThis thesis has been developed at University Carlos III of Madrid, motivated through a collaboration with the Gregorio Marañón General University Hospital, in Madrid. It is framed within the field of Penalized Linear Models, specifically Variable Selection in Regression, Classification and Survival Models, but it also explores other techniques such as Variable Clustering and Semi-Supervised Learning. In recent years, variable selection techniques based on penalized models have gained considerable importance. With the advance of technologies in the last decade, it has been possible to collect and process huge volumes of data with algorithms of greater computational complexity. However, although it seemed that models that provided simple and interpretable solutions were going to be definitively displaced by more complex ones, they have still proved to be very useful. Indeed, in a practical sense, a model that is capable of filtering important information, easily extrapolated and interpreted by a human, is often more valuable than a more complex model that is incapable of providing any kind of feedback on the underlying problem, even when the latter offers better predictions. This thesis focuses on high dimensional problems, in which the number of variables is of the same order or larger than the sample size. In this type of problems, restrictions that eliminate variables from the model often lead to better performance and interpretability of the results. To adjust linear regression in high dimension the Sparse Group Lasso regularization method has proven to be very efficient. However, in order to use the Sparse Group Lasso in practice, there are two critical aspects on which the solution depends: the correct selection of the regularization parameters, and a prior specification of groups of variables. Very little research has focused on algorithms for the selection of the regularization parameters of the Sparse Group Lasso, and none has explored the issue of the grouping and how to relax this restriction that in practice is an obstacle to using this method. The main objective of this thesis is to propose new methods of variable selection in generalized linear models. This thesis explores the Sparse Group Lasso regularization method, analyzing in detail the correct selection of the regularization parameters, and finally relaxing the problem of group specification by introducing a new variable clustering algorithm based on the Sparse Group Lasso, but much more flexible and that extends it. In a parallel but related line of research, this thesis reveals a connection between penalized linear models and semi-supervised learning. This thesis is structured as a compendium of articles, divided into four chapters. Each chapter has a structure and contents independent from the rest, however, all of them follow a common line. First, variable selection methods based on regularization are introduced, describing the optimization problem that appears and a numerical algorithm to approximate its solution when a term of the objective function is not differentiable. The latter occurs naturally when penalties inducing variable selection are added. A contribution of this work is the iterative Sparse Group Lasso, which is an algorithm to obtain the estimation of the coefficients of the Sparse Group Lasso model, without the need to specify the regularization parameters. It uses coordinate descent for the parameters, while approximating the error function in a validation sample. Moreover, with respect to the traditional Sparse Group Lasso, this new proposal considers a more general penalty, where each group has a flexible weight. A separate chapter presents an extension that uses the iterative Sparse Group Lasso to order the variables in the model according to a defined importance index. The introduction of this index is motivated by problems in which there are a large number of variables, only a few of which are directly related to the response variable. This methodology is applied to genetic data, revealing promising results. A further significant contribution of this thesis is the Group Linear Algorithm with Sparse Principal decomposition, which is also motivated by problems in which only a small number of variables influence the response variable. However, unlike other methodologies, in this case the relevant variables are not necessarily among the observed data. This makes it a potentially powerful method, adaptable to multiple scenarios, which is also, as a side effect, a supervised variable clustering algorithm. Moreover, it can be interpreted as an extension of the Sparse Group Lasso that does not require an initial specification of the groups. From a computational point of view, this paper presents an organized framework for solving problems in which the objective function is a linear combination of a differentiable error term and a penalty. The flexibility of this implementation allows it to be applied to problems in very different contexts, for example, the proposed Generalized Elastic Net for semisupervised learning. Regarding its main objective, this thesis offers a framework for the exploration of generalized interpretable models. In the last chapter, in addition to compiling a summary of the contributions of the thesis, future lines of work in the scope of the thesis are included.Esta tesis se ha desarrollado en la Universidad Carlos III de Madrid motivada por una colaboración de investigación con el Hospital General Universitario Gregorio Marañón, en Madrid. Está enmarcada dentro del campo de los Modelos Lineales Penalizados, concretamente Selección de Variables en Modelos de Regresión, Clasificación y Supervivencia, pero también explora otras técnicas como Clustering de Variables y Aprendizaje Semi-Supervisado. En los últimos años, las técnicas de selección de variables basadas en modelos penalizados han cobrado notable importancia. Con el avance de las tecnologías en la última década, se ha conseguido recopilar y tratar enormes volúmenes de datos con algoritmos de una complejidad computacional superior. Sin embargo, aunque parecía que los modelos que aportaban soluciones sencillas e interpretables iban a ser definitivamente desplazados por otros más complejos, han resultado ser todavía muy útiles. De hecho, en un sentido práctico, muchas veces tiene más valor un modelo que sea capaz de filtrar información importante, fácilmente extrapolable e interpretable por un humano, que otro más complejo incapaz de aportar ningún tipo de retroalimentación al problema de fondo, incluso cuando este último ofrezca mejores predicciones. Esta tesis se enfoca en problemas de alta dimensión, en los cuales el número de variables es del mismo orden o superior al tamaño muestral. En este tipo de problemas, restricciones que eliminen variables del modelo a menudo conducen a un mejor desempeño e interpretabilidad de los resultados. Para ajustar regresión lineal en alta dimensión el método de regularización Sparse Group Lasso ha demostrado ser muy eficiente. No obstante, para utilizar en la práctica el Sparse Group Lasso, hay que tener en cuenta dos aspectos fundamentales de los cuales depende la solución, que son la correcta selección de los parámetros de regularización, y una especificación previa de grupos de variables. Muy pocas investigaciones se han centrado en algoritmos para la selección de los parámetros de regularización del Sparse Group Lasso, y ninguna ha explorado el tema de la agrupación y cómo relajar esta restricción que en la práctica constituye una barrera para utilizar este método. El principal objetivo de esta tesis es proponer nuevos métodos de selección de variables en modelos lineales generalizados. Esta tesis explora el método de regularización Sparse Group Lasso, analizando detalladamente la correcta selección de los parámetros de regularización, y finalmente relajando el problema de la especificación de los grupos mediante un nuevo algoritmo de agrupación de variables basado en el Sparse Group Lasso, pero mucho más flexible y que lo extiende. En una línea de investigación paralela, pero relacionada, esta tesis revela una conexión entre los modelos lineales penalizados y el aprendizaje semi-supervisado. Esta tesis está estructurada en formato por compendio de artículos, dividida en cuatro capítulos. Cada capítulo tiene una estructura y contenidos independiente del resto, sin embargo, siguen todos un eje común. Primeramente, se introducen los métodos de selección de variables basados en regularización, describiendo el problema de optimización que aparece y un algoritmo numérico para aproximar su solución cuando una parte de la función objetivo no es diferenciable. Esto último ocurre de manera natural cuando se añaden penalizaciones que inducen selección de variables. Una de las aportaciones de este trabajo es el iterative Sparse Group Lasso, que es un algoritmo para obtener la estimación de los coeficientes del modelo Sparse Group Lasso, sin la necesidad de especificar los parámetros de regularización. Utiliza descenso por coordenadas para los parámetros, mientras aproxima la función de error en una muestra de validación. Además, con respecto al Sparse Group Lasso clásico, esta nueva propuesta considera una penalización más general, donde cada grupo tiene un peso flexible. En otro capítulo se presenta una extensión que utiliza el iterative Sparse Group Lasso para ordenar las variables del modelo según un índice de importancia definido. La introducción de este índice está motivada por problemas en los cuales hay un número elevado de variables, de las cuales solamente unas pocas están relacionadas directamente con la variable respuesta. Esta metodología es aplicada a unos datos genéticos, mostrando resultados prometedores. Otra importante aportación de esta tesis es el Group Linear Algorithm with Sparse Principal decomposition, que está motivado también por problemas en los cuales solamente un número reducido de variables influye en la variable respuesta. Sin embargo, a diferencia de otras metodologías, en este caso las variables influyentes no necesariamente están entre las características observadas. Esto lo convierte en un método muy potente, adaptable a múltiples escenarios, que además, como efecto secundario, es un algoritmo supervisado de agrupación de variables. En un sentido, puede interpretarse como una extensión del Sparse Group Lasso que no requiere una especificación inicial de los grupos. Desde un punto de vista computacional, este trabajo presenta un enfoque organizado para resolver problemas en los cuales la función objetivo es una combinación lineal de un término de error diferenciable y una penalización. La flexibilidad de esta implementación le permite ser aplicada a problemas en contextos muy diferentes, por ejemplo, el Generalized Elastic Net propuesto para aprendizaje semi-supervisado. Con relación a su principal objetivo, esta tesis ofrece un marco para la investigación de modelos generalizados interpretables. En el último capítulo, además de recopilarse un resumen de las aportaciones de la tesis, se incluyen líneas de trabajo futuro en el ámbito de la temática de la tesis.Simulations in Sections 3.3 and 3.4 have been carried out in Uranus, a supercomputer cluster located at Universidad Carlos III de Madrid and funded jointly by EU-FEDER funds and by the Spanish Government via the National Projects No. UNC313-4E-2361, No. ENE2009-12213- C03-03, No. ENE2012-33219 and No. ENE2015-68265-P.Programa de Doctorado en Ingeniería Matemática por la Universidad Carlos III de MadridPresidente: María Luz Durban Reguera.- Secretario: Alberto José Ferrer Riquelme.- Vocal: Jeff Goldsmit
    corecore