11 research outputs found

    Single-cell copy number variation detection

    Get PDF
    Detection of chromosomal aberrations from a single cell by array comparative genomic hybridization (single-cell array CGH), instead of from a population of cells, is an emerging technique. However, such detection is challenging because of the genome artifacts and the DNA amplification process inherent to the single cell approach. Current normalization algorithms result in inaccurate aberration detection for single-cell data. We propose a normalization method based on channel, genome composition and recurrent genome artifact corrections. We demonstrate that the proposed channel clone normalization significantly improves the copy number variation detection in both simulated and real single-cell array CGH data

    Copy number variation detection using next generation sequencing read counts

    Get PDF
    Background: A copy number variation (CNV) is a difference between genotypes in the number of copies of a genomic region. Next generation sequencing (NGS) technologies provide sensitive and accurate tools for detecting genomic variations that include CNVs. However, statistical approaches for CNV identification using NGS are limited. We propose a new methodology for detecting CNVs using NGS data. This method (henceforth denoted by m-HMM) is based on a hidden Markov model with emission probabilities that are governed by mixture distributions. We use the Expectation-Maximization (EM) algorithm to estimate the parameters in the model. Results: A simulation study demonstrates that our proposed m-HMM approach has greater power for detecting copy number gains and losses relative to existing methods. Furthermore, application of our m-HMM to DNA sequencing data from the two maize inbred lines B73 and Mo17 to identify CNVs that may play a role in creating phenotypic differences between these inbred lines provides results concordant with previous array-based efforts to identify CNVs. Conclusions: The new m-HMM method is a powerful and practical approach for identifying CNVs from NGS data

    Application of order restricted statistical inference and hidden Markov modeling to problems in biology and genomics

    Get PDF
    Statistics is a powerful tool in different scientific fields by providing statistical supports in experimental designs, data processing and statistical inference. In this thesis, we conduct theoretical and methodological statistical research with applications in biological and genomic areas. In Chapter 2, we study the statistical testing problems with order-restricted null hypothesis, where the null parameter space is a union of two disjoint convex cones. We derive the likelihood ratio test and the intersection-union test, and show that the likelihood ratio test is uniformly more powerful than the intersection-union test. We also demonstrate the situation in which the uniformly more powerful tests are constructed, and discuss the applicability of the uniformly more powerful tests to real data analyses. In Chapter 3, we propose four testing procedures for detecting the monotonic changes in multivariate gene expression distributions. We consider cases in which the treatment factor is ordinal and can be naturally ordered. The proposed procedures focus the detection powers to genes with monotonic departures from mean equality. Also, the proposed methods are able to deal with small sample sizes and high-dimensional distributions. In Chapter 4, we propose a new methodology, based on a Hidden Markov Model with a mixture emission distribution, to detect copy number variations between different genomics using next generation sequencing read counts. This method demonstrates an improvement comparing to existing methods. We use this method to identify copy number variations between two maize genotypes, and the result is concordant to previous genomic studies using microarray data. This thesis concludes in Chapter 5, which provides a discussion of future research directions

    Statistical Methods and Analysis for Human Genetic Copy Number Variation and Homozygosity Mapping

    Get PDF
    Single nucleotide polymorphism (SNP) arrays are used primarily for genetic association studies, with data being analyzed in most cases one SNP at a time. Several other applications of SNP arrays, however, involve integration of data over multiple markers for a single individual. Two such applications of SNP arrays are studies of copy number variants (CNVs) and regions of homozygosity or identity by descent. Hidden Markov models are a common approach to both of these problems, but other methods have been used as well. In this dissertation I address several methodological issues related to these two types of analysis, and also apply the methods to several datasets. The purpose of my studies in CNVs is to better detect and analyze CNVs. A major concern for all copy number variation (CNV) calling algorithms is their reliability and repeatability. I use family data as a verification standard to evaluate CNV calling strategies and methods. I make recommendations for how CNV calls can be used in genome-wide association studies. I then apply them to analyze CNVs in studies of psychiatric disorders and birth outcomes. Results from these studies have the potential for great public health significance, because they can lead to better understanding of the genetic etiology and eventually to better markers for disease screening and diagnosis. Homozygosity mapping is a powerful method to map genes for rare recessive disorders. However, current methods are not ideal, especially when using high density SNP array data from consanguineous families. This study develops improved methods for homozygosity mapping using dense SNP data, and thus will improve the ability of geneticists to find genetic causes of rare recessive diseases. Many of these rare disorders are life-threatening; identification of the disease genes may help with early diagnosis and treatment

    Data analysis methods for copy number discovery and interpretation

    Get PDF
    Copy number variation (CNV) is an important type of genetic variation that can give rise to a wide variety of phenotypic traits. Differences in copy number are thought to play major roles in processes that involve dosage sensitive genes, providing beneficial, deleterious or neutral modifications to individual phenotypes. Copy number analysis has long been a standard in clinical cytogenetic laboratories. Gene deletions and duplications can often be linked with genetic Syndromes such as: the 7q11.23 deletion of Williams-­‐Bueren Syndrome, the 22q11 deletion of DiGeorge syndrome and the 17q11.2 duplication of Potocki-­‐Lupski syndrome. Interestingly, copy number based genomic disorders often display reciprocal deletion / duplication syndromes, with the latter frequently exhibiting milder symptoms. Moreover, the study of chromosomal imbalances plays a key role in cancer research. The datasets used for the development of analysis methods during this project are generated as part of the cutting-­‐edge translational project, Deciphering Developmental Disorders (DDD). This project, the DDD, is the first of its kind and will directly apply state of the art technologies, in the form of ultra-­‐high resolution microarray and next generation sequencing (NGS), to real-­‐time genetic clinical practice. It is collaboration between the Wellcome Trust Sanger Institute (WTSI) and the National Health Service (NHS) involving the 24 regional genetic services across the UK and Ireland. Although the application of DNA microarrays for the detection of CNVs is well established, individual change point detection algorithms often display variable performances. The definition of an optimal set of parameters for achieving a certain level of performance is rarely straightforward, especially where data qualities vary ... [cont.]

    UNCERTAINTY MITIGATION IN IMAGE-BASED MACHINE LEARNING MODELS FOR PRECISION MEDICINE

    Get PDF
    Machine learning (ML) algorithms have been developed to build predictive models in medicine and healthcare. In most cases, the performance of ML models/algorithms is measured by predictive accuracy or accuracy-related measures only. In medicine, the model results are intended to guide physicians to make critical decisions regarding patient care. This means that quantifying and mitigating the uncertainty of the output is also very important as it will allow decision makers to know how much they can rely on the model output. My dissertation focuses on studying model uncertainty of image-based ML in the context of precision medicine of brain cancer. Specifically, I focus on developing ML models to predict intra-tumor heterogeneity of genomic and molecular markers based on multi-contrast magnetic resonance imaging (MRI) data for glioblastoma (GBM) – the most aggressive type of brain cancer. Intra-tumor heterogeneity has been found to be a leading cause of treatment failure of GBM. Devising a non-invasive approach to map out the molecular/genomic distribution using MRI helps develop treatment with high precision. My dissertation research addresses the model uncertainties due to high-dimensional and noisy features, sparsity of labeled data, and utility of domain knowledge. In the first study, we developed a Semi-supervised Gaussian Process with Uncertainty-minimizing Feature-selection (SGP-UF), which can incorporate selected unlabeled samples (i.e. unbiopsied regions of a tumor) in the model training, and integrate feature selection with a new criterion of seeking features that minimize the prediction uncertainty. In the second study, we developed a Knowledge-infused Global-Local data fusion (KGL) framework, which optimally fuses three sources of data/information including biopsy samples (labeled data, local/sparse), images (unlabeled data, global), and knowledge-driven mechanistic models. In the third study, we developed a Weakly Supervised Ordinal Support Vector Machine (WSO-SVM), which aims to leverage a combination of data sources including biopsy/labeled samples and unlabeled samples from the tumor and image data from the normal brain, as well as their intrinsic ordinal relationship. We demonstrate that these novel methods significantly reduce prediction uncertainty while at the same time achieving higher accuracy in precision medicine, which can inform personalized targeted treatment decisions that potentially improve clinical outcome.Ph.D

    Investigation of de-novo copy number variants in patients with Autism Spectrum Disorder in Vietnam

    Get PDF
    Autism spectrum disorder (ASD) is a neurodevelopmental disorder with a prevalence of approximately 1% children worldwide. ASD is characterized by deficits in social communication and interaction and the presence of restricted interests and repetitive behaviours. Genetic alterations contributing to increasing risk of ASD have been reported. Early genetic screening, especially for families with a positive ASD history, could aid the early diagnosis and potentially more effective disease management strategies. Copy number variants (CNVs) are the alterations in the structure of chromosome and reported as a significant contribution to the pathogenesis of ASD. Currently, genome-wide DNA microarray (e.g. microarray comparative genomic hybridization (aCGH)) is considered as the first-tier screen for genetic aberrations in autistic children, albeit with a limited success (around 10% in studies primarily based on patients of Caucasian origin). Currently, there is no comprehensive study on the diagnostic potential of aCGH in autistic children from Vietnam. This study aims to investigate the possible role of CNV in Vietnamese patients with a clinical diagnosis of ASD. One hundred trios (both parents and at least one child) were recruited where in each trio the child was clinically diagnosed with ASD while the parents were not clinically affected. MECP2 and FMR1 DNA tests were performed for excluding Rett Syndrome and Fragile X syndrome as possible causes, respectively. The aCGH test was performed on all patients as well as their parents to identify de-novo CNVs in the patients. We detected 442 non-redundant CNVs in 100 patients with 210 (47.5%) identified as de novo in origin. We identified five variants of uncertain significance (VUS) as well as pathogenic de novo CNVs (two duplication and three deletion CNVs) in seven patients (four males and three females) related to autism based on the SFARI (Simons Foundation Autism Research Initiative) database. In six patients, four known pathogenic CNVs were identified (diagnostic success 6%). We found the highest (3%) contribution of deletions involving SHANK3 gene, which could pave the way for future diagnostics focused on this gene alone. These findings provide initial information for building an aCGH screening test for Vietnamese autistic children and identifying the relationship between genetic alterations and the effectiveness of stem cell transplantation – bringing a new strategy for ASD management in Vietnam

    Distinct transcriptional signatures of aneuploidy in murine pluripotent cell populations

    Get PDF
    Grant no: BB/D526261/1Genomic integrity in mouse embryonic and induced pluripotent stem cells can be compromised by factors such as extended time in culture and cellular reprogramming. Surprising, only a few studies have thus far examined the accumulation of chromosomal imbalances in mouse pluripotent populations upon prolonged propagation in vitro. It is presumed that specific recurring genetic changes can confer selective growth advantage and resistance to apoptosis and/or differentiation to the affected cells, although the genes that drive these processes remain elusive. The presence of these changes in published studies can confound the analysis of the data and hinder the reproducibility of the results. At the transcriptional level, aneuploidy manifests as large chromosomal regions of aberrant gene expression. This thesis presents a method to identify these regions in large-scale datasets and interrogate for recurrent patterns. The present analysis shows that over half of the 315 mouse pluripotent samples examined carry whole or partial-chromosome spanning clusters of aberrant transcription. Furthermore, there are common gene expression changes across samples with any type of predicted aneuploidy and samples with chromosome-specific aberrations. These transcriptional signatures have been used to train classification models which can predict aneuploid samples with over 90% accuracy. This is an important step towards the development of a low-cost and reliable transcriptional validation assay for the presence of aneuploidy

    Designing synthetic spike-in controls for next-generation sequencing and beyond

    Full text link
    Next-generation sequencing (NGS) is a revolutionary tool that can be used for a myriad of applications, ranging from clinical genome sequencing, to gene expression profiling with RNA sequencing (RNA-seq), to the detection of microbes within environmental samples or isolates. However, significant analytical challenges remain with NGS data due to the complexity of genome architecture, as well as a range of biases introduced during library preparation, sequencing and analysis. These biases and challenges can be understood and mitigated through the use of spike-in controls – DNA or RNA oligonucleotides with known sequence and length that are added to samples prior to library preparation. While spike-in controls have previously been developed for transcriptomics, they were designed for technologies that predated the advent of NGS and consequently suffer from several limitations. In this thesis, I present a novel design framework for synthetic spike-in standards (‘sequins’) that can be applied to a range of NGS applications, and demonstrate how sequins can be used as internal controls to assist in the analysis of accompanying samples. In Chapter 1, I develop a set of spliced synthetic RNA standards that are encoded by artificial gene loci on an accompanying in silico chromosome. RNA sequins enable the assessment of important but previously intractable RNA-seq properties including split-read alignment, alternative splicing, isoform-level quantification and fusion gene detection. In Chapter 2, I present the design of a set of DNA sequins comprising a synthetic community of artificial microbial genomes, which can be used in metagenome sequencing and analysis. Importantly, DNA sequins facilitate the accurate resolution of microbial abundance shifts between samples, which are otherwise imperceptible with NGS. Finally, in Chapter 3, I show how RNA sequins can be used in the analysis of complex brain transcriptomes generated using targeted RNA-seq. This includes an assessment of capture efficiency, quantitative accuracy, and the setting of empirical thresholds to distinguish signal from noise. These transcriptomes are presented as an atlas that can be used to link gene expression with neurological phenotypes. The technologies, associated datasets and analytical methods developed herein provide a qualitative and quantitative reference with which to navigate the complexity of genome biology
    corecore