6 research outputs found
Computerized adaptive test and decision trees: A unifying approach
In the last few years, several articles have proposed decision trees (DTs) as an alternative to computerized adapted tests (CATs). These works have focused on showing the differences between the two methods with the aim of identifying the advantages of each of them and thus determining when it is preferable to use one method or another. In this article, Tree-CAT, a new technique for building CATs is presented. Unlike the existing work, Tree-CAT exploits the similarities between CATs and DTs. This technique allows the creation of CATs that minimise the mean square error in the estimation of the examinee’s ability level, and controls the item’s exposure rate. The decision tree is sequentially built by means of an innovative algorithmic procedure that selects the items associated with each of the tree branches by solving a linear program. In addition, our work presents further advantages over alternative item selection techniques with exposure control, such as instant item selection or simultaneous administration of the test to an unlimited number of participants. These advantages allow accurate on-line CATs to be implemented even when the item selection method is computationally costly.Numerical experiments were conducted in Uranus, a supercomputer cluster located at Universidad Carlos III de Madrid and jointly funded by EU-FEDER funds and by the Spanish Government via the National Projects No. UNC313-4E-2361, No. ENE2009-12213- C03-03, No. ENE2012-33219, No. ENE2012-31753 and No. ENE2015-68265-P
cat.dt: An R package for fast construction of accurate Computerized Adaptive Tests using Decision Trees
This article introduces the cat.dt package for the creation of Computerized Adaptive Tests (CATs). Unlike existing packages, the cat.dt package represents the CAT in a Decision Tree (DT) structure. This allows building the test before its administration, ensuring that the creation time of the test is independent of the number of participants. Moreover, to accelerate the construction of the tree, the package controls its growth by joining nodes with similar estimations or distributions of the ability level and uses techniques such as message passing and pre-calculations. The constructed tree, as well as the estimation procedure, can be visualized using the graphical tools included in the package. An experiment designed to evaluate its performance shows that the cat.dt package drastically reduces computational time in the creation of CATs without compromising accuracy.This article has been funded by the Spanish National Project No. RTI2018-101857-B-I00
Merged Tree-CAT: A fast method for building precise computerized adaptive tests based on decision trees
Over the last few years, there has been an increasing interest in the creation of Computerized Adaptive Tests (CATs) based on Decision Trees (DTs). Among the available methods, the Tree-CAT method has been able to demonstrate a mathematical equivalence between both techniques. However, this method has the inconvenience of requiring a high performance cluster while taking a few days to perform its computations. This article presents the Merged Tree-CAT method, which extends the Tree-CAT technique, to create CATs based on DTs in just a few seconds in a personal computer. In order to do so, the Merged Tree-CAT method controls the growth of the tree by merging those branches in which both the distribution and the estimation of the latent level are similar. The performed experiments show that the proposed method obtains estimations of the latent level which are comparable to the obtained by the state-of-the-art techniques, while drastically reducing the computational time.Numerical experiments were conducted in Uranus, a supercomputer cluster located at Universidad Carlos III de Madrid and jointly funded by EU-FEDER funds and by the Spanish Government via the National Projects nos. UNC313-4E-2361, ENE2009-12213- C03-03, ENE2012-33219, ENE2012-31753 and ENE2015-68265-P. This article was also funded by the Spanish National Project no. RTI2018-101857-B-I00
Accurate Prediction of Children's ADHD Severity Using Family Burden Information: A Neural Lasso Approach
The deep lasso algorithm (dlasso) is introduced as a neural version of the statistical
linear lasso algorithm that holds benefits from both methodologies: feature selection and
automatic optimization of the parameters (including the regularization parameter). This
last property makes dlasso particularly attractive for feature selection on small samples. In
the two first conducted experiments, it was observed that dlasso is capable of obtaining
better performance than its non-neuronal version (traditional lasso), in terms of predictive
error and correct variable selection. Once that dlasso performance has been assessed, it
is used to determine whether it is possible to predict the severity of symptoms in children
with ADHD from four scales that measure family burden, family functioning, parental
satisfaction, and parental mental health. Results show that dlasso is able to predict
parents’ assessment of the severity of their children’s inattention from only seven items
from the previous scales. These items are related to parents’ satisfaction and degree of
parental burden.This research has been partially supported by the Spanish National Project No. RTI2018-101857-B-I00. This research was also partly financed by the Community of Madrid in the framework of the multi-annual agreement with the University Carlos III Madrid in its line of action Excellence for University Teaching Staff. They are established within the framework of the V Regional Plan for Scientific Research and Technological Innovation 2016–2020
Iterative variable selection for high-dimensional data: prediction of pathological response in triple-negative breast cancer
In the last decade, regularized regression methods have offered alternatives forperforming multi-marker analysis and feature selection in a whole genome context.The process of defining a list of genes that will characterize an expressionprofile, remains unclear. This procedure oscillates between selecting the genes or transcripts of interest based on previous clinical evidence, or performing a whole transcriptome analys is that rests on advanced statistics. This paper introduces a methodology to deal with the variable selection and model estimation problems in the high-dimensional set-up, which can be particularly useful in the whole genome context. Results are validated using simulated data, and a real dataset from a triple negative breast cancer study
Variable selection algorithms in generalized linear models
Mención Internacional en el título de doctorThis thesis has been developed at University Carlos III of Madrid,
motivated through a collaboration with the Gregorio Marañón General
University Hospital, in Madrid. It is framed within the field of
Penalized Linear Models, specifically Variable Selection in Regression,
Classification and Survival Models, but it also explores other
techniques such as Variable Clustering and Semi-Supervised Learning.
In recent years, variable selection techniques based on penalized models
have gained considerable importance. With the advance of technologies
in the last decade, it has been possible to collect and process
huge volumes of data with algorithms of greater computational complexity.
However, although it seemed that models that provided simple
and interpretable solutions were going to be definitively displaced by
more complex ones, they have still proved to be very useful. Indeed, in
a practical sense, a model that is capable of filtering important information,
easily extrapolated and interpreted by a human, is often more
valuable than a more complex model that is incapable of providing
any kind of feedback on the underlying problem, even when the latter
offers better predictions.
This thesis focuses on high dimensional problems, in which the number
of variables is of the same order or larger than the sample size.
In this type of problems, restrictions that eliminate variables from the
model often lead to better performance and interpretability of the results.
To adjust linear regression in high dimension the Sparse Group
Lasso regularization method has proven to be very efficient. However,
in order to use the Sparse Group Lasso in practice, there are two critical
aspects on which the solution depends: the correct selection of the
regularization parameters, and a prior specification of groups of variables.
Very little research has focused on algorithms for the selection
of the regularization parameters of the Sparse Group Lasso, and none
has explored the issue of the grouping and how to relax this restriction
that in practice is an obstacle to using this method.
The main objective of this thesis is to propose new methods of variable
selection in generalized linear models. This thesis explores the Sparse Group Lasso regularization method, analyzing in detail the
correct selection of the regularization parameters, and finally relaxing
the problem of group specification by introducing a new variable
clustering algorithm based on the Sparse Group Lasso, but much more
flexible and that extends it. In a parallel but related line of research,
this thesis reveals a connection between penalized linear models and
semi-supervised learning.
This thesis is structured as a compendium of articles, divided into four
chapters. Each chapter has a structure and contents independent from
the rest, however, all of them follow a common line. First, variable selection
methods based on regularization are introduced, describing the
optimization problem that appears and a numerical algorithm to approximate
its solution when a term of the objective function is not differentiable.
The latter occurs naturally when penalties inducing variable
selection are added. A contribution of this work is the iterative
Sparse Group Lasso, which is an algorithm to obtain the estimation
of the coefficients of the Sparse Group Lasso model, without the need
to specify the regularization parameters. It uses coordinate descent
for the parameters, while approximating the error function in a validation
sample. Moreover, with respect to the traditional Sparse Group
Lasso, this new proposal considers a more general penalty, where each
group has a flexible weight. A separate chapter presents an extension
that uses the iterative Sparse Group Lasso to order the variables in
the model according to a defined importance index. The introduction
of this index is motivated by problems in which there are a large
number of variables, only a few of which are directly related to the
response variable. This methodology is applied to genetic data, revealing
promising results. A further significant contribution of this
thesis is the Group Linear Algorithm with Sparse Principal decomposition,
which is also motivated by problems in which only a small
number of variables influence the response variable. However, unlike
other methodologies, in this case the relevant variables are not necessarily
among the observed data. This makes it a potentially powerful
method, adaptable to multiple scenarios, which is also, as a side effect,
a supervised variable clustering algorithm. Moreover, it can be
interpreted as an extension of the Sparse Group Lasso that does not
require an initial specification of the groups. From a computational point of view, this paper presents an organized framework for solving
problems in which the objective function is a linear combination
of a differentiable error term and a penalty. The flexibility of this
implementation allows it to be applied to problems in very different
contexts, for example, the proposed Generalized Elastic Net for semisupervised
learning.
Regarding its main objective, this thesis offers a framework for the
exploration of generalized interpretable models. In the last chapter,
in addition to compiling a summary of the contributions of the thesis,
future lines of work in the scope of the thesis are included.Esta tesis se ha desarrollado en la Universidad Carlos III de Madrid
motivada por una colaboración de investigación con el Hospital General
Universitario Gregorio Marañón, en Madrid. Está enmarcada dentro
del campo de los Modelos Lineales Penalizados, concretamente
Selección de Variables en Modelos de Regresión, Clasificación y Supervivencia,
pero también explora otras técnicas como Clustering de
Variables y Aprendizaje Semi-Supervisado.
En los últimos años, las técnicas de selección de variables basadas
en modelos penalizados han cobrado notable importancia. Con el
avance de las tecnologías en la última década, se ha conseguido recopilar
y tratar enormes volúmenes de datos con algoritmos de una
complejidad computacional superior. Sin embargo, aunque parecía
que los modelos que aportaban soluciones sencillas e interpretables
iban a ser definitivamente desplazados por otros más complejos, han
resultado ser todavía muy útiles. De hecho, en un sentido práctico,
muchas veces tiene más valor un modelo que sea capaz de filtrar información
importante, fácilmente extrapolable e interpretable por un
humano, que otro más complejo incapaz de aportar ningún tipo de
retroalimentación al problema de fondo, incluso cuando este último
ofrezca mejores predicciones.
Esta tesis se enfoca en problemas de alta dimensión, en los cuales el
número de variables es del mismo orden o superior al tamaño muestral.
En este tipo de problemas, restricciones que eliminen variables
del modelo a menudo conducen a un mejor desempeño e interpretabilidad
de los resultados. Para ajustar regresión lineal en alta dimensión
el método de regularización Sparse Group Lasso ha demostrado
ser muy eficiente. No obstante, para utilizar en la práctica el Sparse
Group Lasso, hay que tener en cuenta dos aspectos fundamentales de
los cuales depende la solución, que son la correcta selección de los
parámetros de regularización, y una especificación previa de grupos
de variables. Muy pocas investigaciones se han centrado en algoritmos
para la selección de los parámetros de regularización del Sparse
Group Lasso, y ninguna ha explorado el tema de la agrupación y cómo
relajar esta restricción que en la práctica constituye una barrera para
utilizar este método.
El principal objetivo de esta tesis es proponer nuevos métodos de selección
de variables en modelos lineales generalizados. Esta tesis explora
el método de regularización Sparse Group Lasso, analizando
detalladamente la correcta selección de los parámetros de regularización,
y finalmente relajando el problema de la especificación de
los grupos mediante un nuevo algoritmo de agrupación de variables
basado en el Sparse Group Lasso, pero mucho más flexible y que lo
extiende. En una línea de investigación paralela, pero relacionada,
esta tesis revela una conexión entre los modelos lineales penalizados
y el aprendizaje semi-supervisado.
Esta tesis está estructurada en formato por compendio de artículos,
dividida en cuatro capítulos. Cada capítulo tiene una estructura y
contenidos independiente del resto, sin embargo, siguen todos un eje
común. Primeramente, se introducen los métodos de selección de
variables basados en regularización, describiendo el problema de optimización
que aparece y un algoritmo numérico para aproximar su
solución cuando una parte de la función objetivo no es diferenciable.
Esto último ocurre de manera natural cuando se añaden penalizaciones
que inducen selección de variables. Una de las aportaciones
de este trabajo es el iterative Sparse Group Lasso, que es un algoritmo
para obtener la estimación de los coeficientes del modelo Sparse
Group Lasso, sin la necesidad de especificar los parámetros de regularización.
Utiliza descenso por coordenadas para los parámetros,
mientras aproxima la función de error en una muestra de validación.
Además, con respecto al Sparse Group Lasso clásico, esta nueva propuesta
considera una penalización más general, donde cada grupo
tiene un peso flexible. En otro capítulo se presenta una extensión que
utiliza el iterative Sparse Group Lasso para ordenar las variables del
modelo según un índice de importancia definido. La introducción de
este índice está motivada por problemas en los cuales hay un número
elevado de variables, de las cuales solamente unas pocas están relacionadas
directamente con la variable respuesta. Esta metodología
es aplicada a unos datos genéticos, mostrando resultados prometedores.
Otra importante aportación de esta tesis es el Group Linear
Algorithm with Sparse Principal decomposition, que está motivado
también por problemas en los cuales solamente un número reducido
de variables influye en la variable respuesta. Sin embargo, a diferencia de otras metodologías, en este caso las variables influyentes no
necesariamente están entre las características observadas. Esto lo convierte
en un método muy potente, adaptable a múltiples escenarios,
que además, como efecto secundario, es un algoritmo supervisado de
agrupación de variables. En un sentido, puede interpretarse como una
extensión del Sparse Group Lasso que no requiere una especificación
inicial de los grupos. Desde un punto de vista computacional, este
trabajo presenta un enfoque organizado para resolver problemas en
los cuales la función objetivo es una combinación lineal de un término
de error diferenciable y una penalización. La flexibilidad de
esta implementación le permite ser aplicada a problemas en contextos
muy diferentes, por ejemplo, el Generalized Elastic Net propuesto
para aprendizaje semi-supervisado.
Con relación a su principal objetivo, esta tesis ofrece un marco para la
investigación de modelos generalizados interpretables. En el último
capítulo, además de recopilarse un resumen de las aportaciones de la
tesis, se incluyen líneas de trabajo futuro en el ámbito de la temática
de la tesis.Simulations in Sections 3.3 and 3.4 have been carried out in Uranus, a supercomputer cluster located at Universidad Carlos III de Madrid and funded jointly by EU-FEDER funds and by the Spanish Government via the National Projects No. UNC313-4E-2361, No. ENE2009-12213- C03-03, No. ENE2012-33219 and No. ENE2015-68265-P.Programa de Doctorado en Ingeniería Matemática por la Universidad Carlos III de MadridPresidente: María Luz Durban Reguera.- Secretario: Alberto José Ferrer Riquelme.- Vocal: Jeff Goldsmit