22 research outputs found

    Enhancing Estimates of Breakpoints in Genome Copy Number Alteration using Confidence Masks

    Get PDF
    Chromosomal structural changes in human body known as copy number alteration (CNA) are often associated with diseases, such as various forms of cancer. Therefore, accurate estimation of breakpoints of the CNAs is important to understand the genetic basis of many diseases. The high‐resolution comparative genomic hybridization (HR‐CGH) and single‐nucleotide polymorphism (SNP) technologies enable cost‐efficient and high‐throughput CNA detection. However, probing provided using these profiles gives data highly contaminated by intensive Gaussian noise having white properties. We observe the probabilistic properties of CNA in HR‐CGH and SNP measurements and show that jitter in the breakpoints can statistically be described with either the discrete skew Laplace distribution when the segmental signal‐to‐noise ratio (SNR) exceeds unity or modified Bessel function‐based approximation when SNR is <1. Based upon these approaches, the confidence masks can be developed and used to enhance the estimates of the CNAs for the given confidence probability by removing some unlikely existing breakpoints

    Finding regions of aberrant DNA copy number associated with tumor phenotype

    Get PDF
    DNA copy number alterations are a hallmark of cancer. Understanding their role in tumor progression can help improve diagnosis, prognosis and therapy selection for cancer patients. High-resolution, genome-wide measurements of DNA copy number changes for large cohorts of tumors are currently available, owing to technologies like microarray-based array comparative hybridization (arrayCGH). In this thesis, we present a computational pipeline for statistical analysis of tumor cohorts, which can help extract relevant patterns of copy number aberrations and infer their association with various phenotypical indicators. The main challenges are the instability of classification models due to the high dimensionality of the arrays compared to the small number of tumor samples, as well as the large correlations between copy number estimates measured at neighboring loci. We show that the feature ranking given by several widely-used methods for feature selection is biased due to the large correlations between features. In order to correct for the bias and instability of the feature ranking, we introduce methods for consensus segmentation of the set of arrays. We present three algorithms for consensus segmentation, which are based on identifying recurrent DNA breakpoints or DNA regions of constant copy number profile. The segmentation constitutes the basis for computing a set of super-features, corresponding to the regions. We use the super-features for supervised classification and we compare the models to baseline models trained on probe data. We validated the methods by training models for prediction of the phenotype of breast cancers and neuroblastoma tumors. We show that the multivariate segmentation affords higher model stability, in general improves prediction accuracy and facilitates model interpretation. One of our most important biological results refers to the classification of neuroblastoma tumors. We show that patients belonging to different age subgroups are characterized by distinct copy number patterns, with largest discrepancy when the subgroups are defined as older or younger than 16-18 months. We thereby confirm the recommendation for a higher age cutoff than 12 months (current clinical practice) for differential diagnosis of neuroblastoma.Die abnormale MultiplizitĂ€t bestimmter Segmente der DNS (copy number aberrations) ist eines der hervorstechenden Merkmale von Krebs. Das VerstĂ€ndnis der Rolle dieses Merkmals fĂŒr das Tumorwachstum könnte massgeblich zur Verbesserung von Krebsdiagnose,-prognose und -therapie beitragen und somit bei der Auswahl individueller Therapien helfen. Micoroarray-basierte Technologien wie 'Array Comparative Hybridization' (array-CGH) erlauben es, hochauflösende, genomweite Kopiezahl-Karten von Tumorgeweben zu erstellen. Gegenstand dieser Arbeit ist die Entwicklung einer Software-Pipeline fĂŒr die statistische Analyse von Tumorkohorten, die es ermöglicht, relevante Muster abnormaler Kopiezahlen abzuleiten und diese mit diversen phĂ€notypischen Merkmalen zu assoziieren. Dies geschieht mithilfe maschineller Lernmethoden fĂŒr Klassifikation und Merkmalselektion mit Fokus auf die Interpretierbarkeit der gelernten Modelle (regularisierte lineare Methoden sowie Entscheidungsbaum-basierte Modelle). Herausforderungen an die Methoden liegen vor allem in der hohen DimensionalitĂ€t der Daten, denen lediglich eine vergleichsweise geringe Anzahl von gemessenen Tumorproben gegenĂŒber steht, sowie der hohen Korrelation zwischen den gemessenen Kopiezahlen in benachbarten genomischen Regionen. Folglich hĂ€ngen die Resultate der Merkmalselektion stark von der Auswahl des Trainingsdatensatzes ab, was die Reproduzierbarkeit bei unterschiedlichen klinischen DatensĂ€tzen stark einschrĂ€nkt. Diese Arbeit zeigt, dass die von diversen gĂ€ngigen Methoden bestimmte Rangfolge von Features in Folge hoher Korrelationskoefizienten einzelner PrĂ€diktoren stark verfĂ€lscht ist. Um diesen 'Bias' sowie die InstabilitĂ€t der Merkmalsrangfolge zu korrigieren, fĂŒhren wir in unserer Pipeline einen dimensions-reduzierenden Schritt ein, der darin besteht, die Arrays gemeinsam multivariat zu segmentieren. Wir prĂ€sentieren drei Algorithmen fĂŒr diese multivariate Segmentierung,die auf der Identifikation rekurrenter DNA Breakpoints oder genomischer Regionen mit konstanten Kopiezahl-Profilen beruhen. Durch Zusammenfassen der DNA Kopiezahlwerte innerhalb einer Region bildet die multivariate Segmentierung die Grundlage fĂŒr die Berechnung einer kleineren Menge von 'Super-Merkmalen'. Im Vergleich zu Klassifikationsverfahren,die auf Ebene einzelner Arrayproben beruhen, verbessern wir durch ĂŒberwachte Klassifikation basierend auf den Super-Merkmalen die Interpretierbarkeit sowie die StabilitĂ€t der Modelle. Wir validieren die Methoden in dieser Arbeit durch das Trainieren von Vorhersagemodellen auf Brustkrebs und Neuroblastoma DatensĂ€tzen. Hier zeigen wir, dass der multivariate Segmentierungsschritt eine erhöhte ModellstabilitĂ€t erzielt, wobei die VorhersagequalitĂ€t nicht abnimmt. Die Dimension des Problems wird erheblich reduziert (bis zu 200-fach weniger Merkmale), welches die multivariate Segmentierung nicht nur zu einem probaten Mittel fĂŒr die Vorhersage von PhĂ€notypen macht.Vielmehr eignet sich das Verfahren darĂŒberhinaus auch als Vorverarbeitungschritt fĂŒr spĂ€tere integrative Analysen mit anderen Datentypen. Auch die Interpretierbarkeit der Modelle wird verbessert. Dies ermöglicht die Identifikation von wichtigen Relationen zwischen Änderungen der Kopiezahl und PhĂ€notyp. Beispielsweise zeigen wir, dass eine Koamplifikation in direkter Nachbarschaft des ERBB2 Genlokus einen höchst informativen PrĂ€diktor fĂŒr die Unterscheidung von entzĂŒndlichen und nicht-entzĂŒndlichen Brustkrebsarten darstellt. Damit bestĂ€tigen wir die in der Literatur gĂ€ngige Hypothese, dass die Grösse eines Amplikons mit dem Krebssubtyp zusammenhĂ€ngt. Im Fall von Neuroblastoma Tumoren zeigen wir, dass Untergruppen, die durch das Alter des Patienten deniert werden, durch Kopiezahl-Muster charakterisiert werden können. Insbesondere ist dies möglich, wenn ein Altersschwellenwert von 16 bis 18 Monaten zur Definition der Gruppen verwandt wird, bei dem ausserdem auch die höchste Vorhersagegenauigkeit vorliegt. Folglich geben wir weitere Evidenz fĂŒr die Empfehlung, einen höheren Schwellenwert als zwölf Monate fĂŒr die differentielle Diagnose von Neuroblastoma zu verwenden

    Computational Identification of Recessive Mutations in Cancers using High Throughput SNP-arrays

    Get PDF
    This thesis presents a highly sensitive genome wide search method for recessive mutations. The method is suitable for distantly related samples that are divided into phenotype positives and negatives. High throughput genotype arrays are used to identify and compare homozygous regions between the cohorts. The method is demonstrated by comparing colorectal cancer patients against unaffected references. The objective is to find homozygous regions and alleles that are more common in cancer patients. We have designed and implemented software tools to automate the data analysis from genotypes to lists of candidate genes and to their properties. The programs have been designed in respect to a pipeline architecture that allows their integration to other programs such as biological databases and copy number analysis tools. The integration of the tools is crucial as the genome wide analysis of the cohort differences produces many candidate regions not related to the studied phenotype. CohortComparator is a genotype comparison tool that detects homozygous regions and compares their loci and allele constitutions between two sets of samples. The data is visualised in chromosome specific graphs illustrating the homozygous regions and alleles of each sample. The genomic regions that may harbour recessive mutations are emphasised with different colours and a scoring scheme is given for these regions. The detection of homozygous regions, cohort comparisons and result annotations are all subjected to presumptions many of which have been parameterized in our programs. The effect of these parameters and the suitable scope of the methods have been evaluated. Samples with different resolutions can be balanced with the genotype estimates of their haplotypes and they can be used within the same study

    Statistical Methods for High Dimensional Networked Data Analysis.

    Full text link
    Networked data are frequently encountered in many scientific disciplines. One major challenges in the analysis of such data are its high dimensionality and complex dependence. My dissertation consists of three projects. The first project focuses on the development of sparse multivariate factor analysis regression model to construct the underlying sparse association map between gene expressions and biomarkers. This is motivated by the fact that some associations may be obscured by unknown confounding factors that are not collected in the data. I have shown that accounting for such unobserved confounding factors can increase both sensitivity and specificity for detecting important gene-biomarker associations and thus lead to more interpretable association maps. The second project concerns the reconstruction of the underlying gene regulatory network using directed acyclic graphical models. My project aims to reduce false discoveries by identifying and removing edges resulted from shared confounding factors. I propose sparse structural factor equation models, in which structural equation models are used to capture directed graphs while factor analysis models are used to account for potential latent factors. I have shown that the proposed method enables me to obtain a simpler and more interpretable topology of a gene regulatory network. The third project is devoted to the development of a new regression analysis methodology to analyze electroencephalogram (EEG) neuroimaging data that are correlated among electrodes within an EEG-net. To address analytic challenges pertaining to the integration of network topology into the analysis, I propose hybrid quadratic inference functions that utilize both prior and data-driven correlations among network nodes into statistical estimation and inference. The proposed method is conceptually simple and computationally fast and more importantly has appealing large-sample properties. In a real EEG data analysis I applied the proposed method to detect significant association of iron deficiency on event-related potential measured in two subregions, which was not found using the classical spatial ANOVA random-effects models.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111595/1/zhouyan_1.pd

    Approches bio-informatiques appliquées aux technologies émergentes en génomique

    Full text link
    Les Ă©tudes gĂ©nĂ©tiques, telles que les Ă©tudes de liaison ou d’association, ont permis d’acquĂ©rir une plus grande connaissance sur l’étiologie de plusieurs maladies affectant les populations humaines. MĂȘme si une dizaine de milliers d’études gĂ©nĂ©tiques ont Ă©tĂ© rĂ©alisĂ©es sur des centaines de maladies ou autres traits, une grande partie de leur hĂ©ritabilitĂ© reste inexpliquĂ©e. Depuis une dizaine d’annĂ©es, plusieurs percĂ©es dans le domaine de la gĂ©nomique ont Ă©tĂ© rĂ©alisĂ©es. Par exemple, l’utilisation des micropuces d’hybridation gĂ©nomique comparative Ă  haute densitĂ© a permis de dĂ©montrer l’existence Ă  grande Ă©chelle des variations et des polymorphismes en nombre de copies. Ces derniers sont maintenant dĂ©tectables Ă  l’aide de micropuce d’ADN ou du sĂ©quençage Ă  haut dĂ©bit. De plus, des Ă©tudes rĂ©centes utilisant le sĂ©quençage Ă  haut dĂ©bit ont permis de dĂ©montrer que la majoritĂ© des variations prĂ©sentes dans l’exome d’un individu Ă©taient rares ou mĂȘme propres Ă  cet individu. Ceci a permis la conception d’une nouvelle micropuce d’ADN permettant de dĂ©terminer rapidement et Ă  faible coĂ»t le gĂ©notype de plusieurs milliers de variations rares pour un grand ensemble d’individus Ă  la fois. Dans ce contexte, l’objectif gĂ©nĂ©ral de cette thĂšse vise le dĂ©veloppement de nouvelles mĂ©thodologies et de nouveaux outils bio-informatiques de haute performance permettant la dĂ©tection, Ă  de hauts critĂšres de qualitĂ©, des variations en nombre de copies et des variations nuclĂ©otidiques rares dans le cadre d’études gĂ©nĂ©tiques. Ces avancĂ©es permettront, Ă  long terme, d’expliquer une plus grande partie de l’hĂ©ritabilitĂ© manquante des traits complexes, poussant ainsi l’avancement des connaissances sur l’étiologie de ces derniers. Un algorithme permettant le partitionnement des polymorphismes en nombre de copies a donc Ă©tĂ© conçu, rendant possible l’utilisation de ces variations structurales dans le cadre d’étude de liaison gĂ©nĂ©tique sur donnĂ©es familiales. Ensuite, une Ă©tude exploratoire a permis de caractĂ©riser les diffĂ©rents problĂšmes associĂ©s aux Ă©tudes gĂ©nĂ©tiques utilisant des variations en nombre de copies rares sur des individus non reliĂ©s. Cette Ă©tude a Ă©tĂ© rĂ©alisĂ©e avec la collaboration du Wellcome Trust Centre for Human Genetics de l’University of Oxford. Par la suite, une comparaison de la performance des algorithmes de gĂ©notypage lors de leur utilisation avec une nouvelle micropuce d’ADN contenant une majoritĂ© de marqueurs rares a Ă©tĂ© rĂ©alisĂ©e. Finalement, un outil bio-informatique permettant de filtrer de façon efficace et rapide des donnĂ©es gĂ©nĂ©tiques a Ă©tĂ© implĂ©mentĂ©. Cet outil permet de gĂ©nĂ©rer des donnĂ©es de meilleure qualitĂ©, avec une meilleure reproductibilitĂ© des rĂ©sultats, tout en diminuant les chances d’obtenir une fausse association.Genetic studies, such as linkage and association studies, have contributed greatly to a better understanding of the etiology of several diseases. Nonetheless, despite the tens of thousands of genetic studies performed to date, a large part of the heritability of diseases and traits remains unexplained. The last decade experienced unprecedented progress in genomics. For example, the use of microarrays for high-density comparative genomic hybridization has demonstrated the existence of large-scale copy number variations and polymorphisms. These are now detectable using DNA microarray or high-throughput sequencing. In addition, high-throughput sequencing has shown that the majority of variations in the exome are rare or unique to the individual. This has led to the design of a new type of DNA microarray that is enriched for rare variants that can be quickly and inexpensively genotyped in high throughput capacity. In this context, the general objective of this thesis is the development of methodological approaches and bioinformatics tools for the detection at the highest quality standards of copy number polymorphisms and rare single nucleotide variations. It is expected that by doing so, more of the missing heritability of complex traits can then be accounted for, contributing to the advancement of knowledge of the etiology of diseases. We have developed an algorithm for the partition of copy number polymorphisms, making it feasible to use these structural changes in genetic linkage studies with family data. We have also conducted an extensive study in collaboration with the Wellcome Trust Centre for Human Genetics of the University of Oxford to characterize rare copy number definition metrics and their impact on study results with unrelated individuals. We have conducted a thorough comparison of the performance of genotyping algorithms when used with a new DNA microarray composed of a majority of very rare genetic variants. Finally, we have developed a bioinformatics tool for the fast and efficient processing of genetic data to increase quality, reproducibility of results and to reduce spurious associations

    Genome-Wide Analysis of Histone Modification Enrichments Induced by Marek's Disease Virus in Inbred Chicken Lines

    Get PDF
    Covalent histone modifications constitute a complex network of transcriptional regulation involved in diverse biological processes ranging from stem cell differentiation to immune response. The advent of modern sequencing technologies enables one to query the locations of histone modifications across the genome in an efficient manner. However, inherent biases in the technology and diverse enrichment patterns complicate data analysis. Marek's disease (MD) is an acute, lymphoma-inducing disease of chickens with disease outcomes affected by multiple host and environmental factors. Inbred chicken lines 63 and 72 share the same major histocompatibility complex haplotype, but have contrasting responses to MD. This dissertation presents novel methods for analysis of genome-wide histone modification data and application of new and existing methods to the investigation of epigenetic effects of MD on these lines. First, we present WaveSeq, a novel algorithm for detection of significant enrichments in ChIP-Seq data. WaveSeq implements a distribution-free approach by combining the continuous wavelet transform with Monte Carlo sampling techniques for effective peak detection. WaveSeq outperformed existing tools particularly for diffuse histone modification peaks demonstrating that restrictive distributional assumptions are not necessary for accurate ChIP-Seq peak detection. Second, we investigated latent MD in thymus tissues by profiling H3K4me3 and H3K27me3 in infected and control birds from lines 63 and 72. Several genes associated with MD, e.g. MX1 and CTLA&ndash;4, along with those linked with human cancers, showed line-specific and condition-specific enrichments. One of the first studies of histone modifications in chickens, our work demonstrated that MD induced widespread epigenetic variations. Finally, we analyzed the temporal evolution of histone modifications at distinct phases of MD progression in the bursa of Fabricius. Genes involved in several important pathways, e.g. apoptosis and MAPK signaling, and various immune-related miRNAs showed differential histone modifications in the promoter region. Our results indicated heightened inflammation in the susceptible line during early cytolytic MD, while resistant birds showed recuperative symptoms during early MD and epigenetic silencing during latent infection. Thus, although further elucidation of underlying mechanisms is necessary, this work provided the first definitive evidence of the epigenetic effects of MD

    Investigating chemoresistance in relapsed/refractory B cell non-Hodgkin Lymphoma

    Get PDF
    PhD ThesisPaediatric B-cell non-Hodgkin Lymphoma (B-NHL), namely Burkitt lymphoma and diffuse large B cell lymphoma, is successfully treated in the majority of patients in the UK at the cost of debilitating toxicity. For patients who undergo disease progression the prognosis is dire, with salvage rates as low as 20%. Previous studies have identified putative markers of disease progression, but none are currently used in the clinic. There is a clear need for usable markers of relapse/refractory disease at diagnosis for paediatric B-NHL with the aim to stratify patients and identify new potentially targetable genes and pathways. Copy number analysis of 162 patients from the CCLG and published data identified genomic aberrations associated with disease progression: 17p copy number neutral loss of heterozygosity (CNN-LOH), 3q29 amplification and 17q CNN-LOH. 17p CNNLOH was a prognostic marker with a hazard-ratio of 5.6 (95% CI 2-16, p=0.001, Cox proportional hazard method). TP53 was investigated further using a combination Sanger sequencing and whole-exome sequencing. TP53 aberrations were present in 52/95 cases, with biallelic abnormalities conferring poorer outcomes. Biallelic TP53 aberrations were also associated with complex chromosomal abnormalities, including a novel aberration termed 13qplex. Copy number analysis of 105 endemic BL patients treated in Malawi showed that prognostic aberrations in sporadic BL are present but not prognostic in endemic BL. TP53 aberrations were identified in endemic BL and were not associated with relapse, however biallelic cases had an inferior overall survival. Investigating 11 diagnostic and relapse pairs demonstrated that TP53 status drives evolution of chemo-resistant disease. BLs with TP53 aberrations at diagnosis exhibited linear evolution, while TP53 normal cases had early-diverging patterns of progression and acquired TP53 aberrations at relapse. We report TP53 as an important prognostic marker in paediatric B-NHL that confers higher risk of disease progression and may help inform treatment decisions allowing for the possibility of new treatments

    The Genetic Architecture of Structural Renal and Urinary Tract Malformations

    Get PDF
    Structural renal and urinary tract malformations are the most common cause of kidney failure in children. These congenital anomalies of the kidneys and urinary tract (CAKUT) are a phenotypically diverse group of malformations that result from defects in embryonic kidney, ureter, and bladder development. A genetic basis for CAKUT has been proposed, with over 50 monogenic causes reported, however, a molecular diagnosis is detected in less than 20% of patients. In this thesis, I used bioinformatics and statistical genetics methodology to investigate the genetic architecture of structural renal and urinary tract malformations using whole-genome sequencing (WGS) data from the 100,000 Genomes Project. Population-based rare and common variant association testing was performed in over 800 cases and 20,000 controls of diverse ancestry seeking enrichment of single-nucleotide/indel and structural variation on a genome-wide, per-gene, and cis-regulatory element basis. Using a sequencing-based genome-wide association study (GWAS) I identified the first robust genetic associations of posterior urethral valves (PUV), the most common cause of kidney failure in boys. Bayesian fine-mapping and functional annotation mapped these two loci to the transcription factor TBX5 and planar cell polarity gene PTK7, with both signals replicated in an independent cohort. Significant enrichment of rare structural variation affecting cis-regulatory elements was also detected providing novel insights into the pathogenesis of this poorly understood disorder. I also demonstrated that the contribution of known monogenic disease to CAKUT has been overestimated and that common and low-frequency variation plays an important role in phenotypic variability. These findings support an omnigenic rather than monogenic model of inheritance for CAKUT and are consistent with the extensive genotypic-phenotypic heterogeneity, variable expressivity, and incomplete penetrance observed in this condition. Finally, this work demonstrates the value of sequencing-based GWAS methodology in rare disease, beyond conventional monogenic gene discovery, and provides strong support for an inclusive diverse-ancestry approach
    corecore