723 research outputs found

    Feature Selection and Dimensionality Reduction in Genomics and Proteomics

    Get PDF
    International audienceFinding reliable, meaningful patterns in data with high numbers of attributes can be extremely difficult. Feature selection helps us to decide what attributes or combination of attributes are most important for finding these patterns. In this chapter, we study feature selection methods for building classification models from high-throughput genomic (microarray) and proteomic (mass spectrometry) data sets. Thousands of feature candidates must be analyzed, compared and combined in such data sets. We describe the basics of four different approaches used for feature selection and illustrate their effects on an MS cancer proteomic data set. The closing discussion provides assistance in performing an analysis in high-dimensional genomic and proteomic data

    Regularized Least Squares Cancer Classifiers from DNA microarray data

    Get PDF
    BACKGROUND: The advent of the technology of DNA microarrays constitutes an epochal change in the classification and discovery of different types of cancer because the information provided by DNA microarrays allows an approach to the problem of cancer analysis from a quantitative rather than qualitative point of view. Cancer classification requires well founded mathematical methods which are able to predict the status of new specimens with high significance levels starting from a limited number of data. In this paper we assess the performances of Regularized Least Squares (RLS) classifiers, originally proposed in regularization theory, by comparing them with Support Vector Machines (SVM), the state-of-the-art supervised learning technique for cancer classification by DNA microarray data. The performances of both approaches have been also investigated with respect to the number of selected genes and different gene selection strategies. RESULTS: We show that RLS classifiers have performances comparable to those of SVM classifiers as the Leave-One-Out (LOO) error evaluated on three different data sets shows. The main advantage of RLS machines is that for solving a classification problem they use a linear system of order equal to either the number of features or the number of training examples. Moreover, RLS machines allow to get an exact measure of the LOO error with just one training. CONCLUSION: RLS classifiers are a valuable alternative to SVM classifiers for the problem of cancer classification by gene expression data, due to their simplicity and low computational complexity. Moreover, RLS classifiers show generalization ability comparable to the ones of SVM classifiers also in the case the classification of new specimens involves very few gene expression levels

    Nonparametric statistical inference for functional brain information mapping

    Get PDF
    An ever-increasing number of functional magnetic resonance imaging (fMRI) studies are now using information-based multi-voxel pattern analysis (MVPA) techniques to decode mental states. In doing so, they achieve a significantly greater sensitivity compared to when they use univariate analysis frameworks. Two most prominent MVPA methods for information mapping are searchlight decoding and classifier weight mapping. The new MVPA brain mapping methods, however, have also posed new challenges for analysis and statistical inference on the group level. In this thesis, I discuss why the usual procedure of performing t-tests on MVPA derived information maps across subjects in order to produce a group statistic is inappropriate. I propose a fully nonparametric solution to this problem, which achieves higher sensitivity than the most commonly used t-based procedure. The proposed method is based on resampling methods and preserves the spatial dependencies in the MVPA-derived information maps. This enables to incorporate a cluster size control for the multiple testing problem. Using a volumetric searchlight decoding procedure and classifier weight maps, I demonstrate the validity and sensitivity of the new approach using both simulated and real fMRI data sets. In comparison to the standard t-test procedure implemented in SPM8, the new results showed a higher sensitivity and spatial specificity. The second goal of this thesis is the comparison of the two widely used information mapping approaches -- the searchlight technique and classifier weight mapping. Both methods take into account the spatially distributed patterns of activation in order to predict stimulus conditions, however the searchlight method solely operates on the local scale. The searchlight decoding technique has furthermore been found to be prone to spatial inaccuracies. For instance, the spatial extent of informative areas is generally exaggerated, and their spatial configuration is distorted. In this thesis, I compare searchlight decoding with linear classifier weight mapping, both using the formerly proposed non-parametric statistical framework using a simulation and ultra-high-field 7T experimental data. It was found that the searchlight method led to spatial inaccuracies that are especially noticeable in high-resolution fMRI data. In contrast, the weight mapping method was more spatially precise, revealing both informative anatomical structures as well as the direction by which voxels contribute to the classification. By maximizing the spatial accuracy of ultra-high-field fMRI results, such global multivariate methods provide a substantial improvement for characterizing structure-function relationships

    Random Subset Feature Selection for Ecological Niche Modeling of Wildfire Activity and the Monarch Butterfly

    Get PDF
    Correlative ecological niche models (ENMs) are essential for investigating distributions of species and natural phenomena via environmental correlates across broad fields, including entomology and pyrogeography featured in this study. Feature (variable) selection is critical for producing more robust ENMs with greater transferability across space and time, but few studies evaluate formal feature selection algorithms (FSAs) for producing higher performance ENMs. Variability of ENMs arising from feature subsets is also seldom represented. A novel FSA is developed and evaluated, the random subset feature selection algorithm (RSFSA). The RSFSA generates an ensemble of higher accuracy ENMs from different feature subsets, producing a feature subset ensemble (FSE). The RSFSA-selected FSEs are novelly used to represent ENM variability. Wildfire activity presence/absence databases for the western US prove ideal for evaluating RSFSA-selected MaxEnt ENMs. The RSFSA was effective in identifying FSEs of 15 of 90 variables with higher accuracy and information content than random FSEs. Selected FSEs were used to identify severe contemporary wildfire deficits and significant future increases in wildfire activity for many ecoregions. Migratory roosting localities of declining eastern North American monarch butterflies (Danaus plexippus) were used to spatially model migratory pathways, comparing RSFSAselected MaxEnt ENMs and kernel density estimate models (KDEMs). The higher information content ENMs best correlated migratory pathways with nectar resources in grasslands. Higher accuracy KDEMs best revealed migratory pathways through less suitable desert environments. Monarch butterfly roadkill data was surveyed for Texas within the main Oklahoma to Mexico Central Funnel migratory pathway. A random FSE of MaxEnt roadkill ENMs was used to estimate a 2-3% loss of migrants to roadkill. Hotspots of roadkill in west Texas and Mexico were recommended for assessing roadkill mitigation to assist in monarch population recovery. The RSFSA effectively produces higher performance ENM FSEs for estimating optimal feature subset sizes, and comparing ENM algorithms and parameters, and environmental scenarios. The RSFSA also performed comparably to expert variable selection, confirming its value in the absence of expert information. The RSFSA should be compared with other FSAs for developing ENMs and in data mining applications across other disciplines, such as image classification and molecular bioinformatics

    Statistical approaches of gene set analysis with quantitative trait loci for high-throughput genomic studies.

    Get PDF
    Recently, gene set analysis has become the first choice for gaining insights into the underlying complex biology of diseases through high-throughput genomic studies, such as Microarrays, bulk RNA-Sequencing, single cell RNA-Sequencing, etc. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results. Further, the statistical structure and steps common to these approaches have not yet been comprehensively discussed, which limits their utility. Hence, a comprehensive overview of the available gene set analysis approaches used for different high-throughput genomic studies is provided. The analysis of gene sets is usually carried out based on gene ontology terms, known biological pathways, etc., which may not establish any formal relation between genotype and trait specific phenotype. Further, in plant biology and breeding, gene set analysis with trait specific Quantitative Trait Loci data are considered to be a great source for biological knowledge discovery. Therefore, innovative statistical approaches are developed for analyzing, and interpreting gene expression data from Microarrays, RNA-sequencing studies in the context of gene sets with trait specific Quantitative Trait Loci. The utility of the developed approaches is studied on multiple real gene expression datasets obtained from various Microarrays and RNA-sequencing studies. The selection of gene sets through differential expression analysis is the primary step of gene set analysis, and which can be achieved through using gene selection methods. The existing methods for such analysis in high-throughput studies, such as Microarrays, RNA-sequencing studies, suffer from serious limitations. For instance, in Microarrays, most of the available methods are either based on relevancy or redundancy measures. Through these methods, the ranking of genes is done on single Microarray expression data, which leads to the selection of spuriously associated, and redundant gene sets. Therefore, newer, and innovative differential expression analytical methods have been developed for Microarrays, and single-cell RNA-sequencing studies for identification of gene sets to successfully carry out the gene set and other downstream analyses. Furthermore, several methods specifically designed for single-cell data have been developed in the literature for the differential expression analysis. To provide guidance on choosing an appropriate tool or developing a new one, it is necessary to review the performance of the existing methods. Hence, a comprehensive overview, classification, and comparative study of the available single-cell methods is hereby undertaken to study their unique features, underlying statistical models and their shortcomings on real applications. Moreover, to address one of the shortcomings (i.e., higher dropout events due to lower cell capture rates), an improved statistical method for downstream analysis of single-cell data has been developed. From the users’ point of view, the different developed statistical methods are implemented in various software tools and made publicly available. These methods and tools will help the experimental biologists and genome researchers to analyze their experimental data more objectively and efficiently. Moreover, the limitations and shortcomings of the available methods are reported in this study, and these need to be addressed by statisticians and biologists collectively to develop efficient approaches. These new approaches will be able to analyze high-throughput genomic data more efficiently to better understand the biological systems and increase the specificity, sensitivity, utility, and relevance of high-throughput genomic studies

    Using machine learning to predict individual severity estimates of alcohol withdrawal syndrome in patients with alcohol dependence

    Get PDF
    Despite its high prevalence in diverse clinical settings, treatment of alcohol withdrawal syndrome (AWS) is mainly based on subjective clinical opinion. Without reliable predictors of potential harmful AWS outcomes at the individual patient’s level, decisions like provision of pharmacotherapy rely on resource-intensive in-patient monitoring. By contrast, an accurate risk prognosis would enable timely preemptive treatment, open up possibilities for safe out-patient care and lead to a more efficient use of health care resources. The aim of this project was to develop such tools using clinical and patient-reported information easily attainable at patient’s admission. To this end, a machine learning framework incorporating nested cross-validation, ensemble learning, and external validation was developed to retrieve accurate, generalizable prediction models for three meaningful AWS outcomes: (1) Separating mild and more severe AWS as defined by the established AWS scale, and directly identifying patients at risk of (2) delirium tremens as well as (3) withdrawal seizures. Based on 121 sociodemographic, clinical and laboratory-based variables, that were retrieved retrospectively from the patients’ charts, this classification paradigm was used to build predictive models in two cohorts of AWS patients at major detoxification wards in Munich (Ludwig-Maximilian-Universität München, n=389; Technische Universität München, n=805). Moderate to severe AWS cases were predicted with significant balanced accuracy (BAC) in both cohorts (LMU, BAC = 69.4%; TU, BAC = 55.9%). A post-hoc association between the models’ poor outcome predictions and higher clomethiazole doses further added to their clinical validity. While delirium tremens cases were accurately identified in the TU cohort (BAC = 75%), the framework yielded no significant model for withdrawal seizures. Variable importance analyses revealed that predictive patterns highly varied between both treatment sites and withdrawal outcomes. Besides several previously described variables (most notably, low platelet count and cerebral brain lesions), several new predictors were identified (history of blood pressure abnormalities, positive urine-based benzodiazepine screening and years of schooling), emphasizing the utility of data-driven, hypothesis-free prediction approaches. Due to limitations of the datasets as well as site-specific patient characteristics, the models did not generalize across treatment sites, highlighting the need to conduct strict validation procedures before implementing prediction tools in clinical care. In conclusion, this dissertation provides evidence on the utility of machine learning methods to enable personalized risk predictions for AWS severity. More specifically, nested-cross validation and ensemble learning could be used to ensure generalizable, clinically applicable predictions in future prospective research based on multi-center collaboration.Die prädiktive Einschätzung der Ausprägung von Entzugssymptomen bei Patient*innen mit Alkoholabhängigkeit beruht trotz jahrzehntelanger wissenschaftlicher Bemühungen weiterhin auf subjektiver klinischer Einschätzung. Entgiftungsbehandlungen werden daher weltweit vorwiegend im stationären Rahmen durchgeführt, um eine engmaschige klinische Überwachung zu gewährleisten. Da über 90 % der Entzugssyndrome mit lediglich milder vegetativer Symptomatik verlaufen, bindet dieses Vorgehen wertvolle Ressourcen. Datenbasierte Prädiktionstools könnten einen wichtigen Beitrag in Richtung einer individualisierten, akkuraten und verlässlichen Verlaufsbeurteilung leisten. Diese würde sichere ambulante Behandlungskonzepte, prophylaktische medikamentöse Behandlungen von Risikopatient*innen, sowie innovative Behandlungsforschung basierend auf stratifizierten Risikogruppen ermöglichen. Das Ziel dieser Arbeit war die Entwicklung solcher prädiktiven Tools für Patient*innen mit Alkoholentzugssyndrom (AES). Hierfür wurde ein innovatives Machine Learning Paradigma unter Verwendung von strikten Validierungsmethoden (Nested Cross-Validation und Out-of-Sample External Validation) verwendet, um generalisierbare, akkurate Prädiktionsmodelle für drei bedeutsame klinische Endpunkte des AES zu entwickeln: (1) die Klassifikation von milden in Abgrenzung zu moderat bis schwer ausgeprägten AES Verläufen, definiert nach einer hierfür etablierten klinischen Skala (AES Skala), sowie die direkte Identifikation der Komplikationen (2) Delirium tremens (DT) sowie von (3) zerebralen Entzugsanfällen (WS). Dieses Paradigma wurde unter Verwendung von 121 retrospektiv erfassten klinischen, laborbasierten, sowie soziodemographischen Variablen auf 1194 Patient*innen mit Alkoholabhängigkeit an zwei großen Entgiftungsstationen in München angewandt (Ludwig-Maximilian-Universität München, n=389; Technische Universität München, n=805). Moderate bis schwere AES Verläufe konnten an beiden Behandlungszentren mit einer signifikanten Genauigkeit (balanced accuracy [BAC]) prädiziert werden (LMU, BAC = 69.4%; TU, BAC = 55.9%). In einer post-hoc Analyse war die Prädiktion moderater bis schwerer Verläufe zudem mit höheren kumulativen Clomethiazol-Dosen assoziiert, was für die klinische Validität der Modelle spricht. Während DT in der TU Kohorte mit einer hohen Genauigkeit (BAC = 75%) identifiziert werden konnte, war die Prädiktion von Entzugsanfällen nicht erfolgreich. Eine explorative Analyse konnte zeigen, dass die prädiktive Bedeutsamkeit einzelner Variable sowohl zwischen den Behandlungszentren als auch den einzelnen Endpunkten deutlich variierte. Neben mehreren bereits in früheren wissenschaftlichen Arbeiten beschriebenen prädiktiv wertvollen Variablen (insbesondere einer durchschnittlich niedrigeren Thrombozytenzahl im Blut sowie von strukturellen zerebralen Läsionen) konnten hierbei mehrere neue Prädiktoren identifiziert werden (Blutdruckauffälligkeiten in der Vorgeschichte; positives Urinscreening auf Benzodiazepine; Anzahl der Schuljahre). Diese Ergebnisse unterstreichen den Wert von datenbasierten, hypothesen-freien Prädiktionsansätzen. Aufgrund von Limitationen des retrospektiven Datensatzes, wie der fehlenden zentrumsübergreifenden Verfügbarkeit einiger Variablen, sowie klinischen Besonderheiten der beiden Kohorten, ließen sich die Modelle am jeweils anderen Behandlungszentrum nicht validieren. Dieses Ergebnis unterstreicht die Notwendigkeit, die Generalisierbarkeit von Prädiktionsergebnissen adäquat zu testen, bevor hierauf basierende Tools für die klinische Praxis empfohlen werden. Solche Methoden wurden im Rahmen dieser Arbeit erstmalig in einem Forschungsprojekt zum AES verwendet. Zusammenfassend, zeigen die Ergebnisse dieser Dissertation erstmalig einen Nutzen von Machine Learning Ansätzen zur individualisierten Risikoprädiktion schwerer AES Verläufe an. Das hierbei verwendete cross-validierte Machine Learning Paradigma wäre ein mögliches Analyseverfahren, um in künftigen prospektiven Multi-Center-Studien verlässliche Prädikationsergebnisse mit hohem klinischen Anwendungspotential zu erreichen

    Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

    Get PDF
    International audienceBackground: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses
    • …
    corecore