8 research outputs found

    Selecting Representative Data Sets

    Get PDF

    Prognostic Methods for Integrating Data from Complex Diseases

    Get PDF
    Statistics in medical research gained a vast surge with the development of high-throughput biotechnologies that provide thousands of measurements for each patient. These multi-layered data has the clear potential to improve the disease prognosis. Data integration is increasingly becoming essential in this context, to address problems such as increasing the power, inconsistencies between studies, obtaining more reliable biomarkers and gaining a broader understanding of the disease. This thesis focuses on addressing the challenges in the development of statistical methods while contributing to the methodological advancements in this field. We propose a clinical data analysis framework to obtain a model with good prediction accuracy addressing missing data and model instability. A detailed pre-processing pipeline is proposed for miRNA data that removes unwanted noise and offers improved concordance with qRT-PCR data. Platform specific models are developed to uncover biomarkers using mRNA, protein and miRNA data, to identify the source with the most important prognostic information. This thesis explores two types of data integration: horizontal; the integration of same type of data, and vertical; the integration of data from different platforms for the same patient. We use multiple miRNA datasets to develop a meta-analysis framework addressing the challenges in horizontal data integration using a multi-step validation protocol. In the vertical data integration, we extend the pre-validation principle and derive platform dependent weights to utilise the weighted Lasso. Our study revealed that integration of multi-layered data is instrumental in improving the prediction accuracy and in obtaining more biologically relevant biomarkers. A novel visualisation technique to look at prediction accuracy at patient level revealed vital findings with translational impact in personalised medicine

    Advances in Data Mining Knowledge Discovery and Applications

    Get PDF
    Advances in Data Mining Knowledge Discovery and Applications aims to help data miners, researchers, scholars, and PhD students who wish to apply data mining techniques. The primary contribution of this book is highlighting frontier fields and implementations of the knowledge discovery and data mining. It seems to be same things are repeated again. But in general, same approach and techniques may help us in different fields and expertise areas. This book presents knowledge discovery and data mining applications in two different sections. As known that, data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. In this book, most of the areas are covered with different data mining applications. The eighteen chapters have been classified in two parts: Knowledge Discovery and Data Mining Applications

    Statistical methods to evaluate disease outcome diagnostic accuracy of multiple biomarkers with application to HIV and TB research.

    Get PDF
    Doctor of Philosophy in Statistics. University of KwaZulu-Natal, Pietermaritzburg 2015.One challenge in clinical medicine is that of the correct diagnosis of disease. Medical researchers invest considerable time and effort to improving accurate disease diagnosis and following from this diagnostic tests are important components in modern medical practice. The receiver oper- ating characteristic (ROC) is a statistical tool commonly used for describing the discriminatory accuracy and performance of a diagnostic test. A popular summary index of discriminatory accuracy is the area under ROC curve (AUC). In the medical research data, scientists are simultaneously evaluating hundreds of biomarkers. A critical challenge is the combination of biomarkers into models that give insight into disease. In infectious disease, biomarkers are often evaluated as well as in the micro organism or virus causing infection, adding more complexity to the analysis. In addition to providing an improved understanding of factors associated with infection and disease development, combinations of relevant markers are important to the diagnosis and treatment of disease. Taken together, this extends the role of, the statistical analyst and presents many novel and major challenges. This thesis discusses some of the various strategies and issues in using statistical data analysis to address the diagnosis problem, of selecting and combining multiple markers to estimate the predictive accuracy of test results. We also consider different methodologies to address missing data and to improve the predictive accuracy in the presence of incomplete data. The thesis is divided into five parts. The first part is an introduction to the theory behind the methods that we used in this work. The second part places emphasis on the so called classic ROC analysis, which is applied to cross sectional data. The main aim of this chapter is to address the problem of how to select and combine multiple markers and evaluate the appropriateness of certain techniques used in estimating the area under the ROC curve (AUC). Logistic regression models offer a simple method for combining markers. We applied resampling methods to adjust for over-fitting associated with model selection. We simulated several multivariate models to evaluate the performance of the resampling approaches in this setting. We applied these methods to data collected from a study of tuberculosis immune reconstitution in ammatory syndrome (TB-IRIS) in Cape Town, South Africa. Baseline levels of five biomarkers were evaluated and we used this dataset to evaluate whether a combination of these biomarkers could accurately discriminate between TB-IRIS and non TB-IRIS patients, by applying AUC analysis and resampling methods. The third part is concerned with a time dependent ROC analysis with event-time outcome and comparative analysis of the techniques applied to incomplete covariates. Three different methods are assessed and investigated, namely mean imputation, nearest neighbor hot deck imputation and multivariate imputation by chain equations (MICE). These methods were used together with bootstrap and cross-validation to estimate the time dependent AUC using a non-parametric approach and a Cox model. We simulated several models to evaluate the performance of the resampling approaches and imputation methods. We applied the above methods to a real data set. The fourth part is concerned with applying more advanced variable selection methods to predict the survival of patients using time dependent ROC analysis. The least absolute shrinkage and selection operator (LASSO) Cox model is applied to estimate the bootstrap cross-validated, 632 and 632+ bootstrap AUCs for TBM/HIV data set from KwaZulu-Natal in South Africa. We also suggest the use of ridge-Cox regression to estimate the AUC and two level bootstrapping to estimate the variances for AUC, in addition to evaluating these suggested methods. The last part of the research is an application study using genetic HIV data from rural KwaZulu-Natal to evaluate the sequence of ambiguities as a biomarker to predict recent infection in HIV patients

    Un Système de Classification Supervisée à Base de Règles Implicatives

    Get PDF
    This PhD thesis presents a series of research works done in the field of supervised data classification, more precisely in the domain of semi–automatic learning of fuzzy rules–based classifiers. The prepared manuscript presents first an overview of the classification problem, and also of the main classification methods that have already been implemented and certified in order to place the proposed method in the general context of the domain. Once the context established, the actual research work is presented : the definition of a formal background for representing an elementary fuzzy rule-based classifier in a bidimensional space, the description of a learning algorithm for these elementary classifiers for a given data set and the conception of a multi-dimensional classification system which is able to handle multi–classes problems by combining the elementary classifiers. The implementation and testing of all these functionalities and finally the application of the resulted classifier on two real–world digital image problems are finaly presented : the analysis of the quality of industrial products using 3D tomographic images and the identification of regions of interest in radar satellite images.Le travail de recherche de la thèse concerne la classification supervisée de données et plus particulièrement l’apprentissage semi–automatique de classifieurs à base de règles floues graduelles. Le manuscrit de la thèse présente une description de la problématique de classification ainsi que les principales méthodes de classification déjà développées, afin de placer la méthode proposée dans le contexte général de la spécialité. Ensuite, les travaux de la thèse sont présentés : la définition d’un cadre formel pour la représentation d’un classifieur élémentaire à base de règles floues graduelles dans un espace 2D, la spécification d’un algorithme d’apprentissage de classifieurs élémentaires à partir de données, la conception d’un système multi–dimensionnel de classification multi–classes par combinaison de classifieurs élémentaires. L’implémentation de l’ensemble des fonctionnalités est ensuite détaillée, puis finalement les développements réalisés sont utilisés pour deux applications en imagerie : analyse de la qualité des produits industriels par tomographie, classification en régions d’intérêt d’images satellitaires radar

    Information structure and the prosodic structure of English : a probabilistic relationship

    Get PDF
    This work concerns how information structure is signalled prosodically in English, that is, how prosodic prominence and phrasing are used to indicate the salience and organisation of information in relation to a discourse model. It has been standardly held that information structure is primarily signalled by the distribution of pitch accents within syntax structure, as well as intonation event type. However, we argue that these claims underestimate the importance, and richness, of metrical prosodic structure and its role in signalling information structure. We advance a new theory, that information structure is a strong constraint on the mapping of words onto metrical prosodic structure. We show that focus (kontrast) aligns with nuclear prominence, while other accents are not usually directly 'meaningful'. Information units (theme/rheme) try to align with prosodic phrases. This mapping is probabilistic, so it is also influenced by lexical and syntactic effects, as well as rhythmical constraints and other features including emphasis. Rather than being directly signalled by the prosody, the likelihood of each information structure interpretation is mediated by all these properties. We demonstrate that this theory resolves problematic facts about accent distribution in earlier accounts and makes syntactic focus projection rules unnecessary. Previous theories have claimed that contrastive accents are marked by a categorically distinct accent type to other focal accents (e.g. L+H* v H*). We show this distinction in fact involves two separate semantic properties: contrastiveness and theme/rheme status. Contrastiveness is marked by increased prominence in general. Themes are distinguished from rhemes by relative prominence, i.e. the rheme kontrast aligns with nuclear prominence at the level of phrasing that includes both theme and rheme units. In a series of production and perception experiments, we directly test our theory against previous accounts, showing that the only consistent cue to the distinction between theme and rheme nuclear accents is relative pitch height. This height difference accords with our understanding of the marking of nuclear prominence: theme peaks are only lower than rheme peaks in rheme-theme order, consistent with post-nuclear lowering; in theme-rheme order, the last of equal peaks is perceived as nuclear. The rest of the thesis involves analysis of a portion of the Switchboard corpus which we have annotated with substantial new layers of semantic (kontrast) and prosodic features, which are described. This work is an essentially novel approach to testing discourse semantics theories in speech. Using multiple regression analysis, we demonstrate distributional properties of the corpus consistent with our claims. Plain and nuclear accents are best distinguished by phrasal features, showing the strong constraint of phrase structure on the perception of prominence. Nuclear accents can be reliably predicted by semantic/syntactic features, particularly kontrast, while other accents cannot. Plain accents can only be identified well by acoustic features, showing their appearance is linked to rhythmical and low-level semantic features. We further show that kontrast is not only more likely in nuclear position, but also if a word is more structurally or acoustically prominent than expected given its syntactic/information status properties. Consistent with our claim that nuclear accents are distinctive, we show that pre-, post- and nuclear accents have different acoustic profiles; and that the acoustic correlates of increased prominence vary by accent type, i.e. pre-nuclear or nuclear. Finally, we demonstrate the efficacy of our theory compared to previous accounts using examples from the corpus

    Estimating the Accuracy of Learned Concepts

    No full text
    This paper investigates alternative estimators of the accuracy of concepts learned from examples. In particular, the cross-validation and 632 bootstrap estimators are studied, using synthetic training data and the foil learning algorithm. Our experimental results contradict previous papers in statistics, which advocate the 632 bootstrap method as superior to crossvalidation. Nevertheless, our results also suggest that conclusions based on cross-validation in previous machine learning papers are unreliable. Specifically, our observations are that (i) the true error of the concept learned by foil from independently drawn sets of examples of the same concept varies widely, (ii) the estimate of true error provided by cross-validation has high variability but is approximately unbiased, and (iii) the 632 bootstrap estimator has lower variability than cross-validation, but is systematically biased. 1 Introduction The problem of concept induction (also known as the classification problem [ K..

    Estimating the Accuracy of Learned Concepts* Abstract

    No full text
    This paper investigates alternative estimators of the accuracy of concepts learned from examples. In particular, the cross-validation and 632 bootstrap estimators are studied, using synthetic training data and the FOIL learning algorithm. Our experimental results contradict previous papers in statistics, which advocate the 632 bootstrap method as superior to crossvalidation. Nevertheless, our results also suggest that conclusions based on cross-validation in previous machine learning papers are unreliable. Specifically, our observations are that (i) the true error of the concept learned by FOIL from independently drawn sets of examples of the same concept varies widely, (ii) the estimate of true error provided by cross-validation has high variability but is approximately unbiased, and (iii) the 632 bootstrap estimator has lower variability than cross-validation, but is systematically biased.
    corecore