180 research outputs found

    Outcome prediction based on microarray analysis: a critical perspective on methods

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Information extraction from microarrays has not yet been widely used in diagnostic or prognostic decision-support systems, due to the diversity of results produced by the available techniques, their instability on different data sets and the inability to relate statistical significance with biological relevance. Thus, there is an urgent need to address the statistical framework of microarray analysis and identify its drawbacks and limitations, which will enable us to thoroughly compare methodologies under the same experimental set-up and associate results with confidence intervals meaningful to clinicians. In this study we consider gene-selection algorithms with the aim to reveal inefficiencies in performance evaluation and address aspects that can reduce uncertainty in algorithmic validation.</p> <p>Results</p> <p>A computational study is performed related to the performance of several gene selection methodologies on publicly available microarray data. Three basic types of experimental scenarios are evaluated, i.e. the independent test-set and the 10-fold cross-validation (CV) using maximum and average performance measures. Feature selection methods behave differently under different validation strategies. The performance results from CV do not mach well those from the independent test-set, except for the support vector machines (SVM) and the least squares SVM methods. However, these wrapper methods achieve variable (often low) performance, whereas the hybrid methods attain consistently higher accuracies. The use of an independent test-set within CV is important for the evaluation of the predictive power of algorithms. The optimal size of the selected gene-set also appears to be dependent on the evaluation scheme. The consistency of selected genes over variation of the training-set is another aspect important in reducing uncertainty in the evaluation of the derived gene signature. In all cases the presence of outlier samples can seriously affect algorithmic performance.</p> <p>Conclusion</p> <p>Multiple parameters can influence the selection of a gene-signature and its predictive power, thus possible biases in validation methods must always be accounted for. This paper illustrates that independent test-set evaluation reduces the bias of CV, and case-specific measures reveal stability characteristics of the gene-signature over changes of the training set. Moreover, frequency measures on gene selection address the algorithmic consistency in selecting the same gene signature under different training conditions. These issues contribute to the development of an objective evaluation framework and aid the derivation of statistically consistent gene signatures that could eventually be correlated with biological relevance. The benefits of the proposed framework are supported by the evaluation results and methodological comparisons performed for several gene-selection algorithms on three publicly available datasets.</p

    Constructing gene expression based prognostic models to predict recurrence and lymph node metastasis in colon cancer

    Get PDF
    The main goal of this study is to identify molecular signatures to predict lymph node metastases and recurrence in colon cancer patients. Recent advances in microarray technology facilitated building of accurate molecular classifiers, and in depth understanding of disease mechanisms.;Lymph node metastasis cannot be accurately estimated by morphological assessment. Molecular markers have the potential to improve prognostic accuracy. The first part of our study presents a novel technique to identify molecular markers for predicting stage of the disease based on microarray gene expression data. In the first step, random forests were used for variable selection and a 14-gene signature was identified. In the second step, the genes without differential expression in lymph node negative versus positive tumors were removed from the 14-gene signature, leading to the identification of a 9-gene signature. The lymph node status prediction accuracy of the 9-gene signature on an independent colon cancer dataset (n=17) was 82.3%. Area under curve (AUC) obtained from the time-dependent ROC curves using the 9-gene signature was 0.85 and 0.86 for relapse-free survival and overall survival, respectively. The 9-gene signature significantly stratified patients into low-risk and high-risk groups (log-rank tests, p\u3c0.05, n=73), with distinct relapse-free survival and overall survival. Based on the results, it could be concluded that the 9-gene signature could be used to identify lymph node metastases in patients. We further studied the 9-gene signature using correlation analysis on CGH and RNA expression datasets. It was found that the gene ITGB1 in the 9-gene signature exhibited strong relationship of DNA copy number and gene expression. Furthermore, genome-wide correlation analysis was done on CGH and RNA data, and three or more consecutive genes with significant correlation of DNA copy number and RNA expression were identified. These results might be helpful in identifying the regulators of gene expression.;The second part of the study was focused on identifying molecular signatures for patients at high-risk for recurrence who would benefit from adjuvant chemotherapy. The training set (n=36) consisted of patients who remained disease-free for 5 years and patients who experienced recurrence within 5 years. The remaining patients formed the testing set (n=37). A combinatorial scheme was developed to identify gene signatures predicting colon cancer recurrence. In the first step, preprocessing was done to discard undifferentiated genes and missing values were replaced with k=30 and k=20 using the k-nearest neighbors algorithm. Variable selection using the random forests algorithm was applied to obtain gene subsets. In the second step, InfoGain feature selection technique was used to drop lower ranked genes from the gene subsets based on their association with disease outcome. A 3-gene and a 5-gene signature were identified by this technique based on different missing value replacement methods. Both of the recurrence gene signatures stratified patients into low-risk and high-risk groups (log-rank tests, p\u3c0.05, n=73), with distinct relapse-free survival and overall survival. A recurrence prediction model was built using LWL classifier based on the 3-gene signature with an accuracy of 91.7% on the training set (n=36). Another recurrence prediction model was built using the random tree classifier based on the 5-gene signature with an accuracy of 83.3% on the training set (n=36). The prospective predictions obtained on the testing set using these models will be verified when the follow-up information becomes available in the future. The recurrence prediction accuracies of these gene signatures on independent colon cancer datasets were in the range 72.4% to 88.9%. These prognostic models might be helpful to clinicians in selecting more appropriate treatments for patients who are at high-risk of developing recurrence. When compared over multiple datasets, the 3-gene signature had improved prediction accuracy over the 5-gene signature. The identified lymph node and recurrence gene signatures were validated on rectal cancer data. Time-dependent ROC and Kaplan-Meier analysis were done producing significant results. These results support the fact that the developed prognostic models could be used to identify patients at high-risk of developing recurrence and get an estimate of the survival times in rectal cancer patients

    Molecular Fingerprint and Developmental Regulation of the Tegmental GABAergic and Glutamatergic Neurons Derived from the Anterior Hindbrain

    Get PDF
    Tegmental nuclei in the ventral midbrain and anterior hindbrain control motivated behavior, mood, memory, and movement. These nuclei contain inhibitory GABAergic and excitatory glutamatergic neurons, whose molecular diversity and development remain largely unraveled. Many tegmental neurons originate in the embryonic ventral rhombomere 1 (r1), where GABAergic fate is regulated by the transcription factor (TF) Tal1. We used single-cell mRNA sequencing of the mouse ventral r1 to characterize the Tal1-dependent and independent neuronal precursors. We describe gene expression dynamics during bifurcation of the GABAergic and glutamatergic lineages and show how active Notch signaling promotes GABAergic fate selection in postmitotic precursors. We identify GABAergic precursor subtypes that give rise to distinct tegmental nuclei and demonstrate that Sox14 and Zfpm2, two TFs downstream of Tal1, are necessary for the differentiation of specific tegmental GABAergic neurons. Our results provide a framework for understanding the development of cellular diversity in the tegmental nuclei.Peer reviewe

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Computational Cancer Research: Network-based analysis of cancer data disentangles clinically relevant alterations from molecular measurements

    Get PDF
    Cancer is a very complex genetic disease driven by combinations of mutated genes. This complexity strongly complicates the identification of driver genes and puts enormous challenges to reveal how they influence cancerogenesis, prognosis or therapy response. Thousands of molecular profiles of the major human types of cancer have been measured over the last years. Apart from well-studied frequently mutated genes, still only little is known about the role of rarely mutated genes in cancer or the interplay of mutated genes in individual cancers. Gene expression and mutation profiles can be measured routinely, but computational methods for the identification of driver candidates along with the prediction of their potential impacts on downstream targets and clinically relevant characteristics only rarely exist. Instead of only focusing on frequently mutated genes, each cancer patient should better be analyzed by using the full information in its cancer-specific molecular profiles to improve the understanding of cancerogenesis and to more precisely predict prognosis and therapy response of individual patients. This requires novel computational methods for the integrative analysis of molecular cancer data. A promising way to realize this is to consider cancer as a disease of cellular networks. Therefore, I have developed a novel network-based approach for the integrative analysis of molecular cancer data over the last years. This approach directly learns gene regulatory networks form gene expression and copy number data and further enables to quantify impacts of altered genes on clinically relevant downstream targets using network propagation. This habilitation thesis summarizes the results of seven of my publications. All publications have a focus on the integrative analysis of molecular cancer data with an overarching connection to the newly developed network-based approach. In the first three publications, networks were learned to identify major regulators that distinguish characteristic gene expression signatures with applications to astrocytomas, oligodendrogliomas, and acute myeloid leukemia. Next, the central publication of this habilitation thesis, which combines network inference with network propagation, is introduced. The great value of this approach is demonstrated by quantifying potential direct and indirect impacts of rare and frequent gene copy number alterations on patient survival. Further, the publication of the corresponding user-friendly R package regNet is introduced. Finally, two additional publications that also strongly highlight the value of the developed network-based approach are presented with the aims to predict cancer gene candidates within the region of the 1p/19q co-deletion of oligodendrogliomas and to determine driver candidates associated with radioresistance and relapse of prostate cancer. All seven publications are embedded into a brief introduction that motivates the scientific background and the major objectives of this thesis. The background is briefly going from the hallmarks of cancer over the complexity of cancer genomes down to the importance of networks in cancer. This includes a short introduction of the mathematical concepts that underlie the developed network inference and network propagation algorithms. Further, I briefly motivate and summarize my studies before the original publications are presented. The habilitation thesis is completed with a general discussion of the major results with a specific focus on the utilized network-based data analysis strategies. Major biologically and clinically relevant findings of each publication are also briefly summarized

    Bioinformatic analysis and deep learning on large-scale human transcriptomic data: studies on aging, Alzheimer’s neurodegeneration and cancer

    Get PDF
    [ES] El objetivo general del proyecto ha sido el análisis bioinformático integrativo de datos múltiples de proteómica y genómica combinados con datos clínicos asociados para la búsqueda de biomarcadores y módulos poligénicos causales aplicado a enfermedades complejas; principalmente, cáncer de origen primario desconocido, en sus distintos tipos y subtipos y enfermedades neurodegenerativas (ND) mayormente Alzheimer, además de neurodegeneración debida a la edad. Además, se ha hecho un uso intensivo de técnicas de inteligencia artificial, más en concreto de técnicas de redes neuronales de aprendizaje profundo para el análisis y pronóstico de dichas enfermedades

    Evaluating a new drug to combat alzheimer's disease.

    Get PDF
    M. Sc. University of KwaZulu-Natal, Durban 2014.Alzheimer’s disease (AD), a progressive neurodegenerative disorder that affects mostly the limbic system and the neocortical areas of the brain, is the most prevalent form of dementia affecting the elderly population. Twenty-nine million people live with the disease worldwide, 10% of the population > 65 years of age and 50 % of the population > 85 years of age. These figures are expected to increase exponentially over the next few decades and reach 81.1 million by the year 2040. Hallmark lesions include extracellular deposition of β-amyloid protein (Aβ) fibrillar plaques and intraneuronal neurofibrillary tangles (NFTs), which impair synaptic plasticity in the target regions of the brain thereby producing a progressive decline in cognitive function, with the earliest signs observed in learning and memory. Current therapies of AD are merely palliative and only slow down cognitive decline. In a recent study a novel compound, poly-N-methylated amyloid beta (Aβ)-peptide C-terminal fragments (MEPTIDES) was shown to reduce Aβ toxicity in vitro and in Drosophila melanogaster, however whether this novel drug is equally effective in mammals to inhibit Aβ-induced toxicity remains unclear. Accordingly in the present study we investigated the effects of MEPTIDES on the neurotoxicity induced by a single intracerebral (i.c.) injection of Aβ42 into the dorsal hippocampus of adult male Sprague-Dawley (SD) rats, a model of AD-like impaired learning and memory, and explored the implications of these findings for possible future management therapies or AD
    corecore