188 research outputs found

    Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields

    Get PDF
    Motivation: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome

    A decision-theoretic approach for segmental classification

    Full text link
    This paper is concerned with statistical methods for the segmental classification of linear sequence data where the task is to segment and classify the data according to an underlying hidden discrete state sequence. Such analysis is commonplace in the empirical sciences including genomics, finance and speech processing. In particular, we are interested in answering the following question: given data yy and a statistical model π(x,y)\pi(x,y) of the hidden states xx, what should we report as the prediction x^\hat{x} under the posterior distribution π(xy)\pi (x|y)? That is, how should you make a prediction of the underlying states? We demonstrate that traditional approaches such as reporting the most probable state sequence or most probable set of marginal predictions can give undesirable classification artefacts and offer limited control over the properties of the prediction. We propose a decision theoretic approach using a novel class of Markov loss functions and report x^\hat{x} via the principle of minimum expected loss (maximum expected utility). We demonstrate that the sequence of minimum expected loss under the Markov loss function can be enumerated exactly using dynamic programming methods and that it offers flexibility and performance improvements over existing techniques. The result is generic and applicable to any probabilistic model on a sequence, such as Hidden Markov models, change point or product partition models.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS657 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Conditional random pattern model for copy number aberration detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>DNA copy number aberration (CNA) is very important in the pathogenesis of tumors and other diseases. For example, CNAs may result in suppression of anti-oncogenes and activation of oncogenes, which would cause certain types of cancers. High density single nucleotide polymorphism (SNP) array data is widely used for the CNA detection. However, it is nontrivial to detect the CNA automatically because the signals obtained from high density SNP arrays often have low signal-to-noise ratio (SNR), which might be caused by whole genome amplification, mixtures of normal and tumor cells, experimental noise or other technical limitations. With the reduction in SNR, many false CNA regions are often detected and the true CNA regions are missed. Thus, more sophisticated statistical models are needed to make the CNAs detection, using the low SNR signals, more robust and reliable.</p> <p>Results</p> <p>This paper presents a conditional random pattern (CRP) model for CNA detection where much contextual cues are explored to suppress the noise and improve CNA detection accuracy. Both simulated and the real data are used to evaluate the proposed model, and the validation results show that the CRP model is more robust and reliable in the presence of noise for CNA detection using high density SNP array data, compared to a number of widely used software packages.</p> <p>Conclusions</p> <p>The proposed conditional random pattern (CRP) model could effectively detect the CNA regions in the presence of noise.</p

    Finding regions of aberrant DNA copy number associated with tumor phenotype

    Get PDF
    DNA copy number alterations are a hallmark of cancer. Understanding their role in tumor progression can help improve diagnosis, prognosis and therapy selection for cancer patients. High-resolution, genome-wide measurements of DNA copy number changes for large cohorts of tumors are currently available, owing to technologies like microarray-based array comparative hybridization (arrayCGH). In this thesis, we present a computational pipeline for statistical analysis of tumor cohorts, which can help extract relevant patterns of copy number aberrations and infer their association with various phenotypical indicators. The main challenges are the instability of classification models due to the high dimensionality of the arrays compared to the small number of tumor samples, as well as the large correlations between copy number estimates measured at neighboring loci. We show that the feature ranking given by several widely-used methods for feature selection is biased due to the large correlations between features. In order to correct for the bias and instability of the feature ranking, we introduce methods for consensus segmentation of the set of arrays. We present three algorithms for consensus segmentation, which are based on identifying recurrent DNA breakpoints or DNA regions of constant copy number profile. The segmentation constitutes the basis for computing a set of super-features, corresponding to the regions. We use the super-features for supervised classification and we compare the models to baseline models trained on probe data. We validated the methods by training models for prediction of the phenotype of breast cancers and neuroblastoma tumors. We show that the multivariate segmentation affords higher model stability, in general improves prediction accuracy and facilitates model interpretation. One of our most important biological results refers to the classification of neuroblastoma tumors. We show that patients belonging to different age subgroups are characterized by distinct copy number patterns, with largest discrepancy when the subgroups are defined as older or younger than 16-18 months. We thereby confirm the recommendation for a higher age cutoff than 12 months (current clinical practice) for differential diagnosis of neuroblastoma.Die abnormale Multiplizität bestimmter Segmente der DNS (copy number aberrations) ist eines der hervorstechenden Merkmale von Krebs. Das Verständnis der Rolle dieses Merkmals für das Tumorwachstum könnte massgeblich zur Verbesserung von Krebsdiagnose,-prognose und -therapie beitragen und somit bei der Auswahl individueller Therapien helfen. Micoroarray-basierte Technologien wie 'Array Comparative Hybridization' (array-CGH) erlauben es, hochauflösende, genomweite Kopiezahl-Karten von Tumorgeweben zu erstellen. Gegenstand dieser Arbeit ist die Entwicklung einer Software-Pipeline für die statistische Analyse von Tumorkohorten, die es ermöglicht, relevante Muster abnormaler Kopiezahlen abzuleiten und diese mit diversen phänotypischen Merkmalen zu assoziieren. Dies geschieht mithilfe maschineller Lernmethoden für Klassifikation und Merkmalselektion mit Fokus auf die Interpretierbarkeit der gelernten Modelle (regularisierte lineare Methoden sowie Entscheidungsbaum-basierte Modelle). Herausforderungen an die Methoden liegen vor allem in der hohen Dimensionalität der Daten, denen lediglich eine vergleichsweise geringe Anzahl von gemessenen Tumorproben gegenüber steht, sowie der hohen Korrelation zwischen den gemessenen Kopiezahlen in benachbarten genomischen Regionen. Folglich hängen die Resultate der Merkmalselektion stark von der Auswahl des Trainingsdatensatzes ab, was die Reproduzierbarkeit bei unterschiedlichen klinischen Datensätzen stark einschränkt. Diese Arbeit zeigt, dass die von diversen gängigen Methoden bestimmte Rangfolge von Features in Folge hoher Korrelationskoefizienten einzelner Prädiktoren stark verfälscht ist. Um diesen 'Bias' sowie die Instabilität der Merkmalsrangfolge zu korrigieren, führen wir in unserer Pipeline einen dimensions-reduzierenden Schritt ein, der darin besteht, die Arrays gemeinsam multivariat zu segmentieren. Wir präsentieren drei Algorithmen für diese multivariate Segmentierung,die auf der Identifikation rekurrenter DNA Breakpoints oder genomischer Regionen mit konstanten Kopiezahl-Profilen beruhen. Durch Zusammenfassen der DNA Kopiezahlwerte innerhalb einer Region bildet die multivariate Segmentierung die Grundlage für die Berechnung einer kleineren Menge von 'Super-Merkmalen'. Im Vergleich zu Klassifikationsverfahren,die auf Ebene einzelner Arrayproben beruhen, verbessern wir durch überwachte Klassifikation basierend auf den Super-Merkmalen die Interpretierbarkeit sowie die Stabilität der Modelle. Wir validieren die Methoden in dieser Arbeit durch das Trainieren von Vorhersagemodellen auf Brustkrebs und Neuroblastoma Datensätzen. Hier zeigen wir, dass der multivariate Segmentierungsschritt eine erhöhte Modellstabilität erzielt, wobei die Vorhersagequalität nicht abnimmt. Die Dimension des Problems wird erheblich reduziert (bis zu 200-fach weniger Merkmale), welches die multivariate Segmentierung nicht nur zu einem probaten Mittel für die Vorhersage von Phänotypen macht.Vielmehr eignet sich das Verfahren darüberhinaus auch als Vorverarbeitungschritt für spätere integrative Analysen mit anderen Datentypen. Auch die Interpretierbarkeit der Modelle wird verbessert. Dies ermöglicht die Identifikation von wichtigen Relationen zwischen Änderungen der Kopiezahl und Phänotyp. Beispielsweise zeigen wir, dass eine Koamplifikation in direkter Nachbarschaft des ERBB2 Genlokus einen höchst informativen Prädiktor für die Unterscheidung von entzündlichen und nicht-entzündlichen Brustkrebsarten darstellt. Damit bestätigen wir die in der Literatur gängige Hypothese, dass die Grösse eines Amplikons mit dem Krebssubtyp zusammenhängt. Im Fall von Neuroblastoma Tumoren zeigen wir, dass Untergruppen, die durch das Alter des Patienten deniert werden, durch Kopiezahl-Muster charakterisiert werden können. Insbesondere ist dies möglich, wenn ein Altersschwellenwert von 16 bis 18 Monaten zur Definition der Gruppen verwandt wird, bei dem ausserdem auch die höchste Vorhersagegenauigkeit vorliegt. Folglich geben wir weitere Evidenz für die Empfehlung, einen höheren Schwellenwert als zwölf Monate für die differentielle Diagnose von Neuroblastoma zu verwenden

    RVD2: An ultra-sensitive variant detection model for low-depth heterogeneous next-generation sequencing data

    Get PDF
    Motivation: Next-generation sequencing technology is increasingly being used for clinical diagnostic tests. Unlike research cell lines, clinical samples are often genomically heterogeneous due to low sample purity or the presence of genetic subpopulations. Therefore, a variant calling algorithm for calling low-frequency polymorphisms in heterogeneous samples is needed. Result: We present a novel variant calling algorithm that uses a hierarchical Bayesian model to estimate allele frequency and call variants in heterogeneous samples. We show that our algorithm improves upon current classifiers and has higher sensitivity and specificity over a wide range of median read depth and minor allele frequency. We apply our model and identify twelve mutations in the PAXP1 gene in a matched clinical breast ductal carcinoma tumor sample; two of which are loss-of-heterozygosity events

    Statistical Methods for Bioinformatics: Estimation of Copy N umber and Detection of Gene Interactions

    Get PDF
    Identification of copy number aberrations in the human genome has been an important area in cancer research. In the first part of my thesis, I propose a new model for determining genomic copy numbers using high-density single nucleotide polymorphism genotyping microarrays. The method is based on a Bayesian spatial normal mixture model with an unknown number of components corresponding to true copy numbers. A reversible jump Markov chain Monte Carlo algorithm is used to implement the model and perform posterior inference. The second part of the thesis describes a new method for the detection of gene-gene interactions using gene expression data extracted from micro array experiments. The method is based on a two-step Genetic Algorithm, with the first step detecting main effects and the second step looking for interacting gene pairs. The performances of both algorithms are examined on both simulated data and real cancer data and are compared with popular existing algorithms. Conclusions are given and possible extensions are discussed

    INVESTIGATING INVASION IN DUCTAL CARCINOMA IN SITU WITH TOPOGRAPHICAL SINGLE CELL GENOME SEQUENCING

    Get PDF
    Synchronous Ductal Carcinoma in situ (DCIS-IDC) is an early stage breast cancer invasion in which it is possible to delineate genomic evolution during invasion because of the presence of both in situ and invasive regions within the same sample. While laser capture microdissection studies of DCIS-IDC examined the relationship between the paired in situ (DCIS) and invasive (IDC) regions, these studies were either confounded by bulk tissue or limited to a small set of genes or markers. To overcome these challenges, we developed Topographic Single Cell Sequencing (TSCS), which combines laser-catapulting with single cell DNA sequencing to measure genomic copy number profiles from single tumor cells while preserving their spatial context. We applied TSCS to sequence 1,293 single cells from 10 synchronous DCIS patients. We also applied deep-exome sequencing to the in situ, invasive and normal tissues for the DCIS-IDC patients. Previous bulk tissue studies had produced several conflicting models of tumor evolution. Our data support a multiclonal invasion model, in which genome evolution occurs within the ducts and gives rise to multiple subclones that escape the ducts into the adjacent tissues to establish the invasive carcinomas. In summary, we have developed a novel method for single cell DNA sequencing, which preserves spatial context, and applied this method to understand clonal evolution during the transition between carcinoma in situ to invasive ductal carcinoma

    Hidden Markov Models

    Get PDF
    Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research

    Aneuploidy in Health and Disease

    Get PDF
    Aneuploidy means any karyotype that is not euploid, anything that stands outside the norm. Two particular characteristics make the research of aneuploidy challenging. First, it is often hard to distinguish what is a cause and what is a consequence. Secondly, aneuploidy is often associated with a persistent defect in maintenance of genome stability. Thus, working with aneuploid, unstable cells means analyzing an ever changing creature and capturing the features that persist. In the book Aneuploidy in Health and Disease we summarize the recent advances in understanding the causes and consequences of aneuploidy and its link to human pathologies

    Best practices for bioinformatic characterization of neoantigens for clinical utility

    Get PDF
    Neoantigens are newly formed peptides created from somatic mutations that are capable of inducing tumor-specific T cell recognition. Recently, researchers and clinicians have leveraged next generation sequencing technologies to identify neoantigens and to create personalized immunotherapies for cancer treatment. To create a personalized cancer vaccine, neoantigens must be computationally predicted from matched tumor-normal sequencing data, and then ranked according to their predicted capability in stimulating a T cell response. This candidate neoantigen prediction process involves multiple steps, including somatic mutation identification, HLA typing, peptide processing, and peptide-MHC binding prediction. The general workflow has been utilized for many preclinical and clinical trials, but there is no current consensus approach and few established best practices. In this article, we review recent discoveries, summarize the available computational tools, and provide analysis considerations for each step, including neoantigen prediction, prioritization, delivery, and validation methods. In addition to reviewing the current state of neoantigen analysis, we provide practical guidance, specific recommendations, and extensive discussion of critical concepts and points of confusion in the practice of neoantigen characterization for clinical use. Finally, we outline necessary areas of development, including the need to improve HLA class II typing accuracy, to expand software support for diverse neoantigen sources, and to incorporate clinical response data to improve neoantigen prediction algorithms. The ultimate goal of neoantigen characterization workflows is to create personalized vaccines that improve patient outcomes in diverse cancer types
    corecore