155 research outputs found
Nonparametric false discovery rate control for identifying simultaneous signals
It is frequently of interest to jointly analyze multiple sequences of
multiple tests in order to identify simultaneous signals, defined as features
tested in multiple studies whose test statistics are non-null in each. In many
problems, however, the null distributions of the test statistics may be
complicated or even unknown, and there do not currently exist any procedures
that can be employed in these cases. This paper proposes a new nonparametric
procedure that can identify simultaneous signals across multiple studies even
without knowing the null distributions of the test statistics. The method is
shown to asymptotically control the false discovery rate, and in simulations
had excellent power and error control. In an analysis of gene expression and
histone acetylation patterns in the brains of mice exposed to a conspecific
intruder, it identified genes that were both differentially expressed and next
to differentially accessible chromatin. The proposed method is available in the
R package github.com/sdzhao/ssa
Nonparametric False Discovery Rate Control for Identifying Simultaneous Signals
It is frequently of interest to identify simultaneous signals, defined as features that exhibit statistical significance across each of several independent experiments. For example, genes that are consistently differentially expressed across experiments in different animal species can reveal evolutionarily conserved biological mechanisms. However, in some problems the test statistics corresponding to these features can have complicated or unknown null distributions. This paper proposes a novel nonparametric false discovery rate control procedure that can identify simultaneous signals even without knowing these null distributions. The method is shown, theoretically and in simulations, to asymptotically control the false discovery rate. It was also used to identify genes that were both differentially expressed and proximal to differentially accessible chromatin in the brains of mice exposed to a conspecific intruder. The proposed method is available in the R package github.com/sdzhao/ssa
A physical basis for quantitative ChIP-sequencing
ChIP followed by next-generation sequencing (ChIP-Seq) is a key technique for mapping the distribution of histone posttranslational modifications (PTMs) and chromatin-associated factors across genomes. There is a perceived challenge to define a quantitative scale for ChIP-Seq data, and as such, several approaches making use of exogenous additives, or "spike-ins," have recently been developed. Herein, we report on the development of a quantitative, physical model defining ChIP-Seq. The quantitative scale on which ChIP-Seq results should be compared emerges from the model. To test the model and demonstrate the quantitative scale, we examine the impacts of an EZH2 inhibitor through the lens of ChIP-Seq. We report a significant increase in immunoprecipitation of presumed off-target histone PTMs after inhibitor treatment, a trend predicted by the model but contrary to spike-in-based indications. Our work also identifies a sensitivity issue in spike-in normalization that has not been considered in the literature, placing limitations on its utility and trustworthiness. We call our new approach the sans-spike-in method for quantitative ChIP-sequencing (siQ-ChIP). A number of changes in community practice of ChIP-Seq, data reporting, and analysis are motivated by this work
Homologous recombination-deficient cancers: approaches to improve treatment and patient selection
In order for cells to divide, all the DNA in a cell must be copied and divided into two new cells. However, DNA in our cells is constantly dealing with different types of damage, either from factors outside (eg UV rays in sunlight) or inside the body (eg due to errors that occur during the copying of the DNA). To ensure that this damage does not lead to permanent changes, cells have DNA damage repair mechanisms. An important mechanism is homologous recombination (HR) that repairs double-stranded DNA breaks. Without this mechanism, cells cannot survive. However, some cancers have a defect in HR. This is a paradox, because healthy cells do not survive without HR, while these cancer cells apparently survive without HR. BRCA1 and BRCA2 are two important genes in HR and a BRCA1/2 mutation is associated with an increased risk to develop breast and ovarian cancer. In this thesis, models are used in which a BRCA1/2 defect is induced to study an HR defect in cancer. Since 2013, PARP inhibitors have been approved for the treatment of patients with BRCA1/2-mutated breast and ovarian cancer. However, a defect in HR can also be caused by other gene mutations and these patients could also benefit from PARP inhibitors. In this thesis, a test is validated to select the right patients for PARP inhibitor treatment. The working mechanisms of PARP inhibitors are also being investigated to make treatment even more effective. In addition, the immune system plays an important role in cancers with an HR defect. Some of these mechanisms are described and investigated
Uncovering rare genetic variants predisposing to coeliac disease
PhDCoeliac
disease
is
a
common
(1%
prevalence)
inflammatory
disease
of
the
small
intestine,
involving
the
role
of
tissue
transglutaminase
and
HLA-ÂâDQ
binding
immuno-Ââdominant
wheat
peptides.
The
disease
is
highly
heritable,
however,
at
most
only
40%
of
this
heritability
is
explained
by
HLA-ÂâDQ
and
risk
variants
from
genome
wide
association
and
fine
mapping
studies.
The
hypothesis
of
the
research
in
this
thesis
is
that
rare
(minor
allele
frequency
<0.5%)
mutations
of
large
effect
size
(odds
ratios
~2
â
5)
exist,
especially
in
multiply
affected
pedigrees,
which
account
for
the
missing
heritability
of
disease.
NimbleGen
exome
capture
and
Illumina
GAIIx
high
throughput
sequencing
was
performed
in
75
coeliac
disease
individuals
from
55
multiply
affected
families.
Candidate
genes
were
chosen
from
various
analytical
strategies:
linkage,
shared
variants
between
multiple
related
subjects
and
gene
burden
tests
for
multiple
potentially
causal
variants.
Highly
multiplexed
amplicon
sequencing,
using
Fluidigm
technology,
of
all
RefSeq
exons
from
24
candidate
genes
in
2,304
coeliac
cases
and
2,304
controls
was
performed
to
locate
further
rare
variation.
Gene
burden
tests
on
a
highly
stringent
post
quality
control
dataset
identified
no
significant
associations
(P<1x10-Ââ3)
at
the
resequenced
candidate
genes.
The
strategy
of
sequencing
multiply
affected
families,
and
deep
follow
up
of
candidate
genes,
has
not
identified
new
disease
risk
mutations.
Common
variants
(and
other
factors,
e.g.
environmental)
may
instead
account
for
familial
clustering
in
this
common
autoimmune
diseas
Transcription Factor-Mediated Epigenetic Regulation in the Healthy Brain and Neurological Disease
Proper cellular development and function is a complex process established by elaborate gene expression networks. These networks are regulated by epigenetic processes, which alter chromatin states and coordinate the binding of transcription factors (TFs) to regulatory elements (REs), such as enhancers, across the genome to facilitate gene expression. It follows then that a major experimental effort is to profile and understand the binding patterns of TFs to REs in various cellular types and contexts. Critically however, current TF profiling techniques are limited in their abilities to profile TF occupancy in targeted cellular populations and temporal windows, hindering investigations into epigenetic control in complex, multicellular systems, such as the brain. This dissertation focuses on two related areas: firstly, the design of new tools for profiling TF genome occupancy in the mouse brain in specific cellular populations and time periods, and secondly, investigating TF-mediated mechanisms of disease pathogenesis in animal models. In Chapter 2, we describe the development of a novel, viral-mediated method, termed adeno-associated virus (AAV) calling cards, for profiling binding sites of TFs across the genome in the live mouse brain. The AAV calling cards approach allows unique access to TF occupancy information that is inaccessible with other existing techniques, including cell type specificity (through Cre-mediated conditional expression) and historical binding (through longitudinal occupancy recording). Then, in Chapters 3 and 4, we apply this new technique to mouse models to investigate epigenetic misregulation in disease. Previous studies have demonstrated that a large portion of genetic variation associated with cellular dysfunction or disease exists in TF-bound enhancers, demonstrating the criticality of proper TF binding in maintaining cellular homeostasis. However, whether these elements are misregulated more broadly in disease contexts is unclear. In Chapter 3, we apply AAV calling cards to a model of acute seizure and uncover aberrant epigenetic regulation which is predictive of phenotypic outcomes. Particularly important in this study is the ability of AAV calling cards to record and integrate historical TF binding information, allowing linkage of antecedent epigenetic events to eventual seizure outcomes. Here, we longitudinally recorded prodromal enhancer activity to identify loci which are predictive of seizure severity. Next, in Chapter 4, we investigate epigenetic regulation in animal models and postmortem tissues from individuals with amyotrophic lateral sclerosis (ALS). In this study, we focus on a subset of ALS caused by a large hexanucleotide (G4C2) repeat expansion in the gene chromosome 9 open reading frame 72 (C9orf72), which is the most common genetic cause of ALS (C9ALS). Utilizing AAV calling cards as well as other established epigenomic profiling techniques, we observe broad epigenetic misregulation both in C9ALS mouse models and human tissues at the transcriptional and translational levels. Importantly, the C9ALS mouse models used in this study do not develop motor neuron degeneration or ALS-like phenotypes and were profiled at an early age, suggesting that these changes occur early in the disease process and are likely driven by C9orf72-related pathologic species, such as dipeptide repeat proteins (DPRs). Finally, in Chapter 5 we investigate the characteristic properties of C9orf72-specfic pathologies, including DPRs, in human C9ALS. We probed size and abundance of DNA expansions and DPRs in blood, cerebrospinal fluid, and postmortem tissues from C9ALS and sporadic ALS (sALS) individuals and identified novel correlations of C9ALS patient pathologies with clinical and demographic data. Moving forward, these data will facilitate mechanistic studies and clinical trials aimed at reducing or altering C9ALS pathologies in the central nervous system (CNS). In summary, the body of work detailed here extends our knowledge of TFs in both the healthy and diseased central nervous system (CNS), providing new insights into the role of epigenetic regulation in disease pathogenesis. Further, the establishment of AAV calling cards as a widely applicable epigenomic tool will empower innovative new studies in a variety of tissue and model systems
Investigating the role of nuclear encoded mitochondrial genes in the onset of type 2 diabetes
Mitochondrial dysfunction has long been implicated in Type 2 diabetes (T2D). This rela- tionship appears to be bidirectional, with evidence that mitochondrial dysfunction is both caused by and causal of T2D-related phenotypes. A potential causal role in T2D onset would be supported by evidence of a genetic predisposition to mitochondrial dysfunction, since inherited genetic risk factors precede and contribute to disease onset. Here, a genetic study design is used to investigate the potential role of T2D-associated genetic risk loci (T2D loci) in disrupting mitochondrial function through the altered expression of nuclear- encoded mitochondrial genes (NEMGs). The mitochondria are targeted by multiple T2D drugs and therefore such loci may be informative for effective treatment and prevention measures. The functional cisâgenes regulated by T2D loci were identified based on the co-location of T2D loci with adipose tissue expression quantitative trait (eQTL) within a genetic distance of 1 LDU. T2D loci and eQTL were previously mapped using LDU- based gene mapping, which is compared and contrasted in this thesis to other popular tests of association. 50 of the identified T2D cisâgenes were NEMGs and implicated a number of pathways in the inherited risk of T2D, including the relevant pathway of branched-chain amino acid catabolism. These same 50 genes were enriched for decreased expression in T2D cases compared to controls in independent gene expression datasets. Compared to the total known NEMGs, the 50 cis-NEMGs showed further enrichment for decreased expression, suggesting that T2D-eQTL co-location may identify specific subsets of causal genes. Finally, a candidate T2D locus associated with the cisâNEMG ACAD11 was fine-mapped using targeted sequence data for 94 T2D cases and 94 controls. Sev- eral candidate causal variants were identified, including two low-frequency haplotypes, one of which contained both an ACAD11 splicing mutation and a mutation predicted to disrupt the observed binding of HNF4A and COUP-TFII within the ACAD11 promoter region.Open Acces
Studying the effects of genetic factors on the female reproductive lifespan
The objective of my research was to investigate the rare and very-rare genetic factors influencing female reproductive ageing in humans using large-scale population exome-sequencing data. Over the past decade, most studies have relied on non-sequencing genomic data, which only allowed analysis of common genomic variants. However, these genome-wide array studies have limitations in capturing the complete range of genetic variation. Consequently, our understanding of the role of rare genomic variants, which may have a significant impact on menopause timing, has been limited. Furthermore, comprehensive studies exploring genetic factors associated with menopause age, particularly early and very early menopause, have been limited by the lack of large-scale sequencing genomic data, such as population-based datasets. Most of the previously published research has been derived from clinical and family studies, and there has been a dearth of population-based studies that can validate and identify novel genomic factors using a cohort of healthy individuals. Consequently, my aim was to utilise population whole-exome sequencing data for the first time to advance our understanding of genomic factors that impact female reproductive lifespan.
In Chapter 1, I provide an introduction to the biology of menopause. I emphasise the importance of studying menopause timing and the revolutionary impact of using population sequencing genomic data to improve our understanding of the underlying genomic causes of menopause timing.
Chapter 2 comprises analysis focusing on the correlation between bone morphogenetic protein 15 (BMP15) and its previously reported variants in relation to menopause timing. The BMP15 gene and its missense variants have been identified as a potential candidate for premature ovarian insufficiency (POI) based on prior investigations. However, our study revealed no evidence of the previously reported variants being causative factors for POI. Furthermore, when conducting a gene burden association test, we found no significant association between various types of BMP15 variants and early menopause.
Chapter 3 builds based on the previous chapter, which presents an in-depth analysis aimed at assessing the penetrance of over 100 genes associated with premature ovarian insufficiency (POI). The findings of this investigation provide limited evidence supporting the existence of autosomal dominant effects in the reported POI genes. Surprisingly, the vast majority of heterozygous effects on these genes were ruled out, with 99.9% of all protein-truncating variants being observed in women with normal reproductive health. However, we did observe evidence of haploinsufficiency effects in certain genes, including TWNK and SOHLH2.
Chapter 4 is an exome-wide association study to identify rare genetic variants associated with menopause timing. We identified effects ~5 times larger than previously discovered in analyses of common variants, highlighting protein-coding variants in ETAA1, ZNF518A, PNPLA8, PALB2 and SAMHD1. We found rare loss-of-function variants in the ZNF518A gene, which reduced menopause age by approximately six years.
Chapter 5 culminates by assessing the significant contributions made by this study in advancing our comprehension of the variation in genetic risk factors associated with female reproductive lifespan. Additionally, it outlines potential directions for future research in this field, highlighting areas that warrant further exploration and investigation
Discovering pathways to autism spectrum disorder by using functional and integrative genomics approaches to assess monozygotic twin differences
Autism spectrum disorder (ASD) is a common developmental disorder typified
by deficits in social communication and stereotyped behaviours. Despite evidence
of a strong genetic basis to the disorder, molecular studies have thus far had little
success in identifying risk variants or other biomarkers, and presently there is
no unified pathomechanistic explanation. Monozygotic (MZ) twins show incomplete
concordance in autistic traits, which suggests that alternative risk pathways
involving non-shared environmental (NSE) factors could also have an important
role to play in ASD. In this thesis, we describe microarray and RNA-seq studies
characterising gene expression in a sample of 53 ASD MZ twin pairs from TEDS.
The overall aims were to: 1) establish convergent evidence for genes and pathways
involved in the etiology of ASD comparing affected and unaffected subjects
across the sample 2) to identify those responsive to the environment by examining
differences within the discordant pairs. We found a number of genes were differentially
expressed including DEPDC1B - the most significant finding in cases
vs controls, which also showed consistent down regulation within pairs. We further
identified IGHG4, IGHG3, IGHV3-66, HSPA8P14, HSPA13, SLC15A2, and
found that these results were enriched for transcriptional control, immune, and
PI3K/AKT signalling pathways. We suggest that as these were found to be perturbed
in the discordant twins, they could represent ASD risk pathways sensitive
to the NSE. Next, we investigated integrative genomics methods for performing
meta-dimensional analysis using the expression data along with methylation data
on the same cohort. After applying regression-based joint analysis methods, and
meta-analysis p-value combination methods to our datasets, a number of genes
obtained nominal significance across the datasets, including potential genes of interest:
NLGN2, UBE3A, OXTR. We suggest these represent genes with evidence for
being functionally relevant to ASD
Spatial statistical modelling of epigenomic variability
Each cell in our body carries the same genetic information encoded in the DNA, yet the human
organism contains hundreds of cell types which differ substantially in physiology and functionality.
This variability stems from the existence of regulatory mechanisms that control gene expression,
and hence phenotype. The field of epigenetics studies how changes in biochemical factors, other
than the DNA sequence itself, might affect gene regulation. The advent of high throughput
sequencing platforms has enabled the profiling of different epigenetic marks on a genome-wide
scale; however, bespoke computational methods are required to interpret these high-dimensional
data and investigate the coupling between the epigenome and transcriptome.
This thesis contributes to the development of statistical models to capture spatial correlations
of epigenetic marks, with the main focus being DNA methylation. To this end, we developed
BPRMeth (Bayesian Probit Regression for Methylation), a probabilistic model for extracting
higher order methylation features that precisely quantify the spatial variability of bulk DNA
methylation patterns. Using such features, we constructed an accurate machine learning
predictor of gene expression from DNA methylation and identified prototypical methylation
profiles that explain most of the variability across promoter regions. The BPRMeth model, and
its algorithmic implementation, were subsequently substantially extended both to accommodate
different data types, and to improve the scalability of the algorithm.
Bulk experiments have paved the way for mapping the epigenetic landscape, nonetheless,
they fall short of explaining the epigenetic heterogeneity and quantifying its dynamics, which
inherently occur at the single cell level. Single cell bisulfite sequencing protocols have been
recently developed, however, due to intrinsic limitations of the technology they result in
extremely sparse coverage of CpG sites, effectively limiting the analysis repertoire to a semi-quantitative
level. To overcome these difficulties we developed Melissa (MEthyLation Inference
for Single cell Analysis), a Bayesian hierarchical model that leverages local correlations between
neighbouring CpGs and similarity between individual cells to jointly impute missing methylation
states, and cluster cells based on their genome-wide methylation profiles.
A recent experimental innovation enables the parallel profiling of DNA methylation, transcription
and chromatin accessibility (scNMT-seq), making it possible to link transcriptional
and epigenetic heterogeneity at the single cell resolution. For the scNMT-seq study, we applied
the extended BPRMeth model to quantify cell-to-cell chromatin accessibility heterogeneity
around promoter regions and subsequently link it to transcript abundance. This revealed that
genes with conserved accessibility profiles are associated with higher average expression levels.
In summary, this thesis proposes statistical methods to model and interpret epigenomic data
generated from high throughput sequencing experiments. Due to their statistical power and
flexibility we anticipate that these methods will be applicable to future sequencing technologies
and become widespread tools in the high throughput bioinformatics workbench for performing
biomedical data analysis
- âŠ