8 research outputs found

    Survival ensembles by the sum of pairwise differences with application to lung cancer microarray studies

    Full text link
    Lung cancer is among the most common cancers in the United States, in terms of incidence and mortality. In 2009, it is estimated that more than 150,000 deaths will result from lung cancer alone. Genetic information is an extremely valuable data source in characterizing the personal nature of cancer. Over the past several years, investigators have conducted numerous association studies where intensive genetic data is collected on relatively few patients compared to the numbers of gene predictors, with one scientific goal being to identify genetic features associated with cancer recurrence or survival. In this note, we propose high-dimensional survival analysis through a new application of boosting, a powerful tool in machine learning. Our approach is based on an accelerated lifetime model and minimizing the sum of pairwise differences in residuals. We apply our method to a recent microarray study of lung adenocarcinoma and find that our ensemble is composed of 19 genes, while a proportional hazards (PH) ensemble is composed of nine genes, a proper subset of the 19-gene panel. In one of our simulation scenarios, we demonstrate that PH boosting in a misspecified model tends to underfit and ignore moderately-sized covariate effects, on average. Diagnostic analyses suggest that the PH assumption is not satisfied in the microarray data and may explain, in part, the discrepancy in the sets of active coefficients. Our simulation studies and comparative data analyses demonstrate how statistical learning by PH models alone is insufficient.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS426 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    New insights into the suitability of the third dimension for visualizing multivariate/multidimensional data: a study based on loss of quality quantification

    Get PDF
    Most visualization techniques have traditionally used two-dimensional, instead of three-dimensional representations to visualize multidimensional and multivariate data. In this article, a way to demonstrate the underlying superiority of three-dimensional, with respect to two-dimensional, representation is proposed. Specifically, it is based on the inevitable quality degradation produced when reducing the data dimensionality. The problem is tackled from two different approaches: a visual and an analytical approach. First, a set of statistical tests (point classification, distance perception, and outlier identification) using the two-dimensional and three-dimensional visualization are carried out on a group of 40 users. The results indicate that there is an improvement in the accuracy introduced by the inclusion of a third dimension; however, these results do not allow to obtain definitive conclusions on the superiority of three-dimensional representation. Therefore, in order to draw further conclusions, a deeper study based on an analytical approach is proposed. The aim is to quantify the real loss of quality produced when the data are visualized in two-dimensional and three-dimensional spaces, in relation to the original data dimensionality, to analyze the difference between them. To achieve this, a recently proposed methodology is used. The results obtained by the analytical approach reported that the loss of quality reaches significantly high values only when switching from three-dimensional to two-dimensional representation. The considerable quality degradation suffered in the two-dimensional visualization strongly suggests the suitability of the third dimension to visualize data

    A methodology to compare dimensionality reduction algorithms in terms of loss of quality

    Get PDF
    Dimensionality Reduction (DR) is attracting more attention these days as a result of the increasing need to handle huge amounts of data effectively. DR methods allow the number of initial features to be reduced considerably until a set of them is found that allows the original properties of the data to be kept. However, their use entails an inherent loss of quality that is likely to affect the understanding of the data, in terms of data analysis. This loss of quality could be determinant when selecting a DR method, because of the nature of each method. In this paper, we propose a methodology that allows different DR methods to be analyzed and compared as regards the loss of quality produced by them. This methodology makes use of the concept of preservation of geometry (quality assessment criteria) to assess the loss of quality. Experiments have been carried out by using the most well-known DR algorithms and quality assessment criteria, based on the literature. These experiments have been applied on 12 real-world datasets. Results obtained so far show that it is possible to establish a method to select the most appropriate DR method, in terms of minimum loss of quality. Experiments have also highlighted some interesting relationships between the quality assessment criteria. Finally, the methodology allows the appropriate choice of dimensionality for reducing data to be established, whilst giving rise to a minimum loss of quality

    Fundamentos e aplicações da metodologia de superfície de resposta

    Get PDF
    Dissertação de Mestrado em Estatística, Matemática e Computação apresentada à Universidade AbertaA otimização de processos e produtos, a caracterização do sistema e a quantificação do impacto da incerteza dos parâmetros de entrada na resposta do sistema, assumem importância cada vez maior na investigação nas mais diversas áreas da sociedade, seja pelo impacto económico seja pelas consequências que possam advir. A Metodologia de Superfície de Resposta (MSR), nas suas mais diversas abordagens, tem-se revelado uma ferramenta da maior importância nestas áreas. Desde a publicação do artigo de Box e Wilson (1951) que a metodologia foi sendo objeto do interesse de investigadores no âmbito dos fundamentos e das aplicações. Esta metodologia, na abordagem tradicional, tem um carater sequencial e em cada iteração contemplam-se três etapas: definição do planeamento experimental, ajuste do modelo e otimização. Nestas seis décadas, os planeamentos experimentais foram sendo desenvolvidos para responder às aplicações e aos objetivos, com vista a proporcionar um modelo o mais preciso possível. Os modelos utilizados para aproximar a resposta foram evoluindo dos modelos polinomiais de primeira e segunda ordem para os modelos de aprendizagem automática, passando por diferentes modelos não lineares. Os métodos de otimização passaram pelo mesmo processo de expansão da metodologia, com vista a responder a desafios cada vez mais exigentes. A este caminho não são alheios o desenvolvimento computacional e a simulação. Se no início a metodologia se aplicava apenas a sistemas reais, hoje, a simulação de sistemas, nas mais diversas áreas e com crescente grau de complexidade, socorre-se dos metamodelos para reduzir os custos computacionais associados. A quantificação probabilística da incerteza é um excelente exemplo da aplicação da MSR. A quantificação do impacto da incerteza nas variáveis de entrada na resposta do sistema pode ser obtida implementando a metodologia com uma abordagem estocástica. Esta forma de implementação da metodologia também permite implementar a análise de sensibilidade. Neste trabalho faz-se um levantamento dos desenvolvimentos da MSR, nas várias fases da implementação da metodologia, nas seis décadas que decorreram desde a sua introdução. Apresentam-se três aplicações: na indústria da cerâmica, na produção florestal e na área da saúde, mais especificamente no prognóstico do cancro da mama.The processes and products optimization, the system characterization and quantification of the uncertainty impact of the input parameters on the system response assume increasing importance in research in several areas of society, either by economic impact or by the consequences that may ensue. The Response Surface Methodology (RSM), in its various approaches, has proven itself to be a tool of major importance in these fields. Since the publication of the paper of Box and Wilson (1951) the methodology has been a subject of interest to researchers in the context of the fundamentals and applications. In the traditional approach, this methodology has a sequential character, and for each iteration there are three steps involved: defining the experimental design, fitting the model and optimization. In these six decades, the experimental designs have been developed to respond to the applications and objectives, in order to provide the most accurate model possible, according to the purpose. The models used to approximate the response have evolved from first and second order polynomials models to machine learning models, going through different nonlinear models. Optimization methods have gone through the same process of expansion of the methodology, in order to meet increasingly demanding challenges. And this path is not unconnected with the computational development and computer simulation. If at the beginning the methodology was applied only to real systems, today, in simulation systems, in different areas and with increasing degree of complexity, we use the metamodel to reduce the associated computational costs. The probabilistic quantification of uncertainty is an excellent example of the application of the MSR. The quantification of the input uncertainties impact in the system response can be obtained by implementing the method with a stochastic approach. This way of implementing the methodology also allows the implementation of the sensitivity analysis. In this paper we make a survey of the developments of the MSR, at various stages of the implementation of the methodology, in the six decades that have elapsed since its introduction. We present three applications: in the ceramics industry, in forestry production and in healthcare, specifically in the breast cancer prognostic

    An Inductive Learning Approach to Prognostic Prediction

    No full text
    This paper introduces the Recurrence Surface Approximation, an inductive learning method based on linear programming that predicts recurrence times using censored training examples, that is, examples in which the available training output may be only a lower bound on the "right answer." This approach is augmented with a feature selection method that chooses an appropriate feature set within the context of the linear programming generalizer. Computational results in the field of breast cancer prognosis are shown. A straightforward translation of the prediction method to an artificial neural network model is also proposed. 1 INTRODUCTION Machine learning methods have been successfully applied to the analysis of many different complex problems in recent years, including many biomedical applications. One field which can benefit from this type of approach is the analysis of survival or lifetime data (Lee, 1992; Miller Jr., 1981), in which the objective can be broadly defined as predicting..
    corecore