11 research outputs found
Single-cell copy number variation detection
Detection of chromosomal aberrations from a single cell by array comparative genomic hybridization (single-cell array CGH), instead of from a population of cells, is an emerging technique. However, such detection is challenging because of the genome artifacts and the DNA amplification process inherent to the single cell approach. Current normalization algorithms result in inaccurate aberration detection for single-cell data. We propose a normalization method based on channel, genome composition and recurrent genome artifact corrections. We demonstrate that the proposed channel clone normalization significantly improves the copy number variation detection in both simulated and real single-cell array CGH data
Copy number variation detection using next generation sequencing read counts
Background: A copy number variation (CNV) is a difference between genotypes in the number of copies of a genomic region. Next generation sequencing (NGS) technologies provide sensitive and accurate tools for detecting genomic variations that include CNVs. However, statistical approaches for CNV identification using NGS are limited. We propose a new methodology for detecting CNVs using NGS data. This method (henceforth denoted by m-HMM) is based on a hidden Markov model with emission probabilities that are governed by mixture distributions. We use the Expectation-Maximization (EM) algorithm to estimate the parameters in the model.
Results: A simulation study demonstrates that our proposed m-HMM approach has greater power for detecting copy number gains and losses relative to existing methods. Furthermore, application of our m-HMM to DNA sequencing data from the two maize inbred lines B73 and Mo17 to identify CNVs that may play a role in creating phenotypic differences between these inbred lines provides results concordant with previous array-based efforts to identify CNVs.
Conclusions: The new m-HMM method is a powerful and practical approach for identifying CNVs from NGS data
Application of order restricted statistical inference and hidden Markov modeling to problems in biology and genomics
Statistics is a powerful tool in different scientific fields by providing statistical supports in experimental designs, data processing and statistical inference. In this thesis, we conduct theoretical and methodological statistical research with applications in biological and genomic areas.
In Chapter 2, we study the statistical testing problems with order-restricted null hypothesis, where the null parameter space is a union of two disjoint convex cones. We derive the likelihood ratio test and the intersection-union test, and show that the likelihood ratio test is uniformly more powerful than the intersection-union test. We also demonstrate the situation in which the uniformly more powerful tests are constructed, and discuss the applicability of the uniformly more powerful tests to real data analyses.
In Chapter 3, we propose four testing procedures for detecting the monotonic changes in multivariate gene expression distributions. We consider cases in which the treatment factor is ordinal and can be naturally ordered. The proposed procedures focus the detection powers to genes with monotonic departures from mean equality. Also, the proposed methods are able to deal with small sample sizes and high-dimensional distributions.
In Chapter 4, we propose a new methodology, based on a Hidden Markov Model with a mixture emission distribution, to detect copy number variations between different genomics using next generation sequencing read counts. This method demonstrates an improvement comparing to existing methods. We use this method to identify copy number variations between two maize genotypes, and the result is concordant to previous genomic studies using microarray data.
This thesis concludes in Chapter 5, which provides a discussion of future research directions
Statistical Methods and Analysis for Human Genetic Copy Number Variation and Homozygosity Mapping
Single nucleotide polymorphism (SNP) arrays are used primarily for genetic association studies, with data being analyzed in most cases one SNP at a time. Several other applications of SNP arrays, however, involve integration of data over multiple markers for a single individual. Two such applications of SNP arrays are studies of copy number variants (CNVs) and regions of homozygosity or identity by descent. Hidden Markov models are a common approach to both of these problems, but other methods have been used as well. In this dissertation I address several methodological issues related to these two types of analysis, and also apply the methods to several datasets.
The purpose of my studies in CNVs is to better detect and analyze CNVs. A major concern for all copy number variation (CNV) calling algorithms is their reliability and repeatability. I use family data as a verification standard to evaluate CNV calling strategies and methods. I make recommendations for how CNV calls can be used in genome-wide association studies. I then apply them to analyze CNVs in studies of psychiatric disorders and birth outcomes. Results from these studies have the potential for great public health significance, because they can lead to better understanding of the genetic etiology and eventually to better markers for disease screening and diagnosis.
Homozygosity mapping is a powerful method to map genes for rare recessive disorders. However, current methods are not ideal, especially when using high density SNP array data from consanguineous families. This study develops improved methods for homozygosity mapping using dense SNP data, and thus will improve the ability of geneticists to find genetic causes of rare recessive diseases. Many of these rare disorders are life-threatening; identification of the disease genes may help with early diagnosis and treatment
Data analysis methods for copy number discovery and interpretation
Copy
number
variation
(CNV)
is
an
important
type
of
genetic
variation
that
can
give
rise
to
a
wide
variety
of
phenotypic
traits.
Differences
in
copy
number
are
thought
to
play
major
roles
in
processes
that
involve
dosage
sensitive
genes,
providing
beneficial,
deleterious
or
neutral
modifications
to
individual
phenotypes.
Copy
number
analysis
has
long
been
a
standard
in
clinical
cytogenetic
laboratories.
Gene
deletions
and
duplications
can
often
be
linked
with
genetic
Syndromes
such
as:
the
7q11.23
deletion
of
Williams-ÂâBueren
Syndrome,
the
22q11
deletion
of
DiGeorge
syndrome
and
the
17q11.2
duplication
of
Potocki-ÂâLupski
syndrome.
Interestingly,
copy
number
based
genomic
disorders
often
display
reciprocal
deletion
/
duplication
syndromes,
with
the
latter
frequently
exhibiting
milder
symptoms.
Moreover,
the
study
of
chromosomal
imbalances
plays
a
key
role
in
cancer
research.
The
datasets
used
for
the
development
of
analysis
methods
during
this
project
are
generated
as
part
of
the
cutting-Ââedge
translational
project,
Deciphering
Developmental
Disorders
(DDD).
This
project,
the
DDD,
is
the
first
of
its
kind
and
will
directly
apply
state
of
the
art
technologies,
in
the
form
of
ultra-Ââhigh
resolution
microarray
and
next
generation
sequencing
(NGS),
to
real-Ââtime
genetic
clinical
practice.
It
is
collaboration
between
the
Wellcome
Trust
Sanger
Institute
(WTSI)
and
the
National
Health
Service
(NHS)
involving
the
24
regional
genetic
services
across
the
UK
and
Ireland.
Although
the
application
of
DNA
microarrays
for
the
detection
of
CNVs
is
well
established,
individual
change
point
detection
algorithms
often
display
variable
performances.
The
definition
of
an
optimal
set
of
parameters
for
achieving
a
certain
level
of
performance
is
rarely
straightforward,
especially
where
data
qualities
vary ... [cont.]
UNCERTAINTY MITIGATION IN IMAGE-BASED MACHINE LEARNING MODELS FOR PRECISION MEDICINE
Machine learning (ML) algorithms have been developed to build predictive models in medicine and healthcare. In most cases, the performance of ML models/algorithms is measured by predictive accuracy or accuracy-related measures only. In medicine, the model results are intended to guide physicians to make critical decisions regarding patient care. This means that quantifying and mitigating the uncertainty of the output is also very important as it will allow decision makers to know how much they can rely on the model output.
My dissertation focuses on studying model uncertainty of image-based ML in the context of precision medicine of brain cancer. Specifically, I focus on developing ML models to predict intra-tumor heterogeneity of genomic and molecular markers based on multi-contrast magnetic resonance imaging (MRI) data for glioblastoma (GBM) â the most aggressive type of brain cancer. Intra-tumor heterogeneity has been found to be a leading cause of treatment failure of GBM. Devising a non-invasive approach to map out the molecular/genomic distribution using MRI helps develop treatment with high precision. My dissertation research addresses the model uncertainties due to high-dimensional and noisy features, sparsity of labeled data, and utility of domain knowledge.
In the first study, we developed a Semi-supervised Gaussian Process with Uncertainty-minimizing Feature-selection (SGP-UF), which can incorporate selected unlabeled samples (i.e. unbiopsied regions of a tumor) in the model training, and integrate feature selection with a new criterion of seeking features that minimize the prediction uncertainty.
In the second study, we developed a Knowledge-infused Global-Local data fusion (KGL) framework, which optimally fuses three sources of data/information including biopsy samples (labeled data, local/sparse), images (unlabeled data, global), and knowledge-driven mechanistic models.
In the third study, we developed a Weakly Supervised Ordinal Support Vector Machine (WSO-SVM), which aims to leverage a combination of data sources including biopsy/labeled samples and unlabeled samples from the tumor and image data from the normal brain, as well as their intrinsic ordinal relationship.
We demonstrate that these novel methods significantly reduce prediction uncertainty while at the same time achieving higher accuracy in precision medicine, which can inform personalized targeted treatment decisions that potentially improve clinical outcome.Ph.D
Investigation of de-novo copy number variants in patients with Autism Spectrum Disorder in Vietnam
Autism spectrum disorder (ASD) is a neurodevelopmental disorder with a prevalence of approximately 1% children worldwide. ASD is characterized by deficits in social communication and interaction and the presence of restricted interests and repetitive behaviours. Genetic alterations contributing to increasing risk of ASD have been reported. Early genetic screening, especially for families with a positive ASD history, could aid the early diagnosis and potentially more effective disease management strategies. Copy number variants (CNVs) are the alterations in the structure of chromosome and reported as a significant contribution to the pathogenesis of ASD. Currently, genome-wide DNA microarray (e.g. microarray comparative genomic hybridization (aCGH)) is considered as the first-tier screen for genetic aberrations in autistic children, albeit with a limited success (around 10% in studies primarily based on patients of Caucasian origin). Currently, there is no comprehensive study on the diagnostic potential of aCGH in autistic children from Vietnam. This study aims to investigate the possible role of CNV in Vietnamese patients with a clinical diagnosis of ASD. One hundred trios (both parents and at least one child) were recruited where in each trio the child was clinically diagnosed with ASD while the parents were not clinically affected. MECP2 and FMR1 DNA tests were performed for excluding Rett Syndrome and Fragile X syndrome as possible causes, respectively. The aCGH test was performed on all patients as well as their parents to identify de-novo CNVs in the patients. We detected 442 non-redundant CNVs in 100 patients with 210 (47.5%) identified as de novo in origin. We identified five variants of uncertain significance (VUS) as well as pathogenic de novo CNVs (two duplication and three deletion CNVs) in seven patients (four males and three females) related to autism based on the SFARI (Simons Foundation Autism Research Initiative) database. In six patients, four known pathogenic CNVs were identified (diagnostic success 6%). We found the highest (3%) contribution of deletions involving SHANK3 gene, which could pave the way for future diagnostics focused on this gene alone. These findings provide initial information for building an aCGH screening test for Vietnamese autistic children and identifying the relationship between genetic alterations and the effectiveness of stem cell transplantation â bringing a new strategy for ASD management in Vietnam
Distinct transcriptional signatures of aneuploidy in murine pluripotent cell populations
Grant no: BB/D526261/1Genomic integrity in mouse embryonic and induced pluripotent stem cells can be compromised by factors such as extended time in culture and cellular reprogramming. Surprising, only a few studies have thus far examined the accumulation of chromosomal imbalances in mouse pluripotent populations upon prolonged propagation in vitro. It is presumed that specific recurring genetic changes can confer selective growth advantage and resistance to apoptosis and/or differentiation to the affected cells, although the genes that drive these processes remain elusive. The presence of these changes in published studies can confound the analysis of the data and hinder the reproducibility of the results. At the transcriptional level, aneuploidy manifests as large chromosomal regions of aberrant gene expression. This thesis presents a method to identify these regions in large-scale datasets and interrogate for recurrent patterns. The present analysis shows that over half of the 315 mouse pluripotent samples examined carry whole or partial-chromosome spanning clusters of aberrant transcription. Furthermore, there are common gene expression changes across samples with any type of predicted aneuploidy and samples with chromosome-specific aberrations. These transcriptional signatures have been used to train classification models which can predict aneuploid samples with over 90% accuracy. This is an important step towards the development of a low-cost and reliable transcriptional validation assay for the presence of aneuploidy
Designing synthetic spike-in controls for next-generation sequencing and beyond
Next-generation sequencing (NGS) is a revolutionary tool that can be used for a myriad of applications, ranging from clinical genome sequencing, to gene expression profiling with RNA sequencing (RNA-seq), to the detection of microbes within environmental samples or isolates. However, significant analytical challenges remain with NGS data due to the complexity of genome architecture, as well as a range of biases introduced during library preparation, sequencing and analysis. These biases and challenges can be understood and mitigated through the use of spike-in controls â DNA or RNA oligonucleotides with known sequence and length that are added to samples prior to library preparation. While spike-in controls have previously been developed for transcriptomics, they were designed for technologies that predated the advent of NGS and consequently suffer from several limitations. In this thesis, I present a novel design framework for synthetic spike-in standards (âsequinsâ) that can be applied to a range of NGS applications, and demonstrate how sequins can be used as internal controls to assist in the analysis of accompanying samples. In Chapter 1, I develop a set of spliced synthetic RNA standards that are encoded by artificial gene loci on an accompanying in silico chromosome. RNA sequins enable the assessment of important but previously intractable RNA-seq properties including split-read alignment, alternative splicing, isoform-level quantification and fusion gene detection. In Chapter 2, I present the design of a set of DNA sequins comprising a synthetic community of artificial microbial genomes, which can be used in metagenome sequencing and analysis. Importantly, DNA sequins facilitate the accurate resolution of microbial abundance shifts between samples, which are otherwise imperceptible with NGS. Finally, in Chapter 3, I show how RNA sequins can be used in the analysis of complex brain transcriptomes generated using targeted RNA-seq. This includes an assessment of capture efficiency, quantitative accuracy, and the setting of empirical thresholds to distinguish signal from noise. These transcriptomes are presented as an atlas that can be used to link gene expression with neurological phenotypes. The technologies, associated datasets and analytical methods developed herein provide a qualitative and quantitative reference with which to navigate the complexity of genome biology