736 research outputs found
Developing tools for making inferences from genomic data
The central question that the King lab seeks to answer is, "What happens on the genomic level when phenotypes evolve?" To this end, the core focus of my work has been on developing software and analytical methods that facilitate the ability of other researchers working in this space to answer their questions of interest. ... In the second chapter, we describe the problem of inferring ancestral haplotype frequencies from pooled sequence data. We discuss five approaches to this problem that have already been developed, and we discuss a new genetic algorithm based approach that we designed. We perform a benchmarking test on a method developed by Burke et al. and we compare that method to the genetic algorithm by evaluating the performance of each on a simulated dataset. We find that the new method outperforms the existing one.by Paul PetrowskiIncludes bibliographical reference
Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders.
Unlike copy number variants (CNVs), inversions remain an underexplored genetic variation class. By integrating multiple genomic technologies, we discover 729 inversions in 41 human genomes. Approximately 85% of inversionsretrotransposition; 80% of the larger inversions are balanced and affect twice as many nucleotides as CNVs. Balanced inversions show an excess of common variants, and 72% are flanked by segmental duplications (SDs) or retrotransposons. Since flanking repeats promote non-allelic homologous recombination, we developed complementary approaches to identify recurrent inversion formation. We describe 40 recurrent inversions encompassing 0.6% of the genome, showing inversion rates up to 2.7 × 1
Recommended from our members
Optimizing rare variant association studies in theory and practice
Genome-wide association studies (GWAS) have greatly improved our understanding of the genetic basis of complex traits. However, there are two major limitations with GWAS. First, most common variants identified by GWAS individually or in combination explain only a small proportion of heritability. This raises the possibility that additional forms of genetic variation, such as rare variants, could contribute to the missing heritability. The second limitation is that GWAS typically cannot identify which genes are being affected by the associated variants. Examination of rare variants, especially those in coding regions of the genome, can help address these issues. Moreover, several studies have recently identified low-frequency variants at both known and novel loci associated with complex traits, suggesting that functionally significant rare variants exist in the human population
Application of Pool-seq for variation detection and proteogenomic database creation in β-hemolytic streptococci.
Proteogenomics is an emerging field that combines genomic (transcriptomic) and proteomic data with the aim of improving gene models and identification of proteins. Technological advances in each domain increase the potential of the field in fostering further understanding of organisms. For instance, the current low cost and fast sequencing technologies have made it possible to sequence multiple representative samples of organisms thus improving the comprehensiveness of the organisms’ reference proteomes. At the same time, improvements in mass spectrometry techniques have led to an increase in the quality and quantity of proteomics data produced, which are utilized to update the annotation of coding sequences in genomes.
Sequencing of pooled individual DNAs (Pool-seq) is one method for sequencing large numbers of samples cost effectively. It is a robust method that can accurately identify variations that exist between samples. Similar to other proteogenomics methods such as the sample specific databases derived from RNA-seq data, the variants from Pool-seq experiments can be utilized to create variant protein databases and improve the completeness of protein reference databases used in mass spectrometry (MS)-based proteomics analysis. In this thesis work, the efficiency of Pool-seq in identifying variants and estimating allele frequencies from strains of three β-hemolytic bacteria (GAS, GGS and GBS) is investigated. Moreover, in this work a novel Python package (‘PoolSeqProGen’) for creating variant protein databases from the Pool-seq experiments was developed. To our knowledge, this was the first work to use Pool-seq for sequencing large numbers of β-hemolytic bacteria and assess its efficiency on such genetically polymorphic bacteria. The ‘PoolSeqProGen’ tool is also the first and only tool available to create proteogenomic databases from Pool-seq data.
For organisms such as the β-hemolytic bacteria GAS, GBS and GGS that have open pangenomes, the sequencing and annotation of multiple representative strains is paramount in advancing our understanding of these human pathogens and in developing mass spectrometry databases. Due to the increasing use of MS in diagnostics of infectious diseases, this in turn translates to better diagnosis and treatment of the diseases caused by the pathogens and alleviating their devastating burdens on the human population. In this thesis, it is demonstrated that Pool-seq can be used to cost effectively and accurately identify variations that exist among strains of these polymorphic bacteria. In addition, the utility of the tool developed to extend single genome based databases and thereby improve the completeness of the databases and peptide/protein identifications by using variants identified from Pool-seq experiments is illustrated.Proteogenomiikka on kehittyvä tieteenala, joka yhdistää genomiikkaa ja proteomiikkaa geenimallien parantamiseksi ja proteiinien tunnistamiseksi. Molempien alojen tekninen kehitys lisää tämän yhdistetyn tieteenalan mahdollisuuksia eri eliöiden toimintojen ymmärtämiseksi. Esimerkiksi nykyiset edulliset ja nopeat sekvensointitekniikat ovat mahdollistaneet useiden eri organismien kattavan sekvensoinnin, mikä luonnollisesti parantaa myös näiden organismien vertailuproteomien kattavuutta. Samanaikaisesti massaspektrometriatekniikan kehitys on johtanut proteomiikka-analyysien laadun paranemiseen ja syvyyden lisääntymiseen. Tämä mahdollistaa ennustettujen sekvenssialueiden (esim. uusien geenien) validoinnin.
Yhdistettyjen yksittäisten DNA-näytteiden sekvensointi (Pool-sekvensointi) mahdollistaa suurten näytemäärien sekvensoinnin erittäin kustannustehokkaasti. Se on luotettava menetelmä, jolla voidaan tunnistaa tarkasti eri näytteiden väliset vaihtelut. Pool-sekvensointikokeiden muunnelmia voidaan käyttää luomaan variantti-proteiinitietokantoja ja parantamaan massaspektrometriaan perustuvien proteiinitietokantojen kattavuutta. Tässä väitöskirjassa tutkittiin Pool-sekvensoinnin tehokkuutta eri varianttien tunnistamisessa ja alleelitaajuuksien arvioimisessa kolmen β-hemolyyttisen streptokokki-bakteerin (GAS, GGS ja GBS) kannoista. Lisäksi työssä kehitettiin uusi Python-ohjelmointikielellä kirjoitettu ohjelmisto (‘PoolSeqProGen’) proteiinivariantitietokantojen luomiseksi Pool-sekvensointi -kokeista. Tämä on ensimmäinen työ, jossa Pool-sekvensointia käytettiin sekvensoimaan suuri määrä streptokokkeja ja arvioimaan menetelmän tehokkuutta geneettisesti polymorfisissa bakteereissa. ”PoolSeqProGen” -työkalu on myös ensimmäinen ja ainoa saatavilla oleva työkalu proteogenomisten tietokantojen luomiseen Pool-sekvensoinnilla tuotetusta datasta.
Kehitettäessä massaspektrometria tietokantoja avoimiin pangenomeihin perustuville organismeille, kutenβ-hemolyyttisille streptokokeille GAS, GBS ja GGS, useiden edustavien kantojen sekvensointi ja annotointi on ensiarvoisen tärkeää. Massaspekrometrian lisääntynyt käyttö tartuntatautien diagnosoinnissa parantaa näiden mikrobien aiheuttamien sairauksien diagnosointia ja mahdollistaa siten myös hoidon paremman kohdentamisen. Tässä väitöskirjatyössä osoitetaan, että Pool-sekvensointia voi käyttää kustannustehokkaasti ja tarkasti polymorfisten bakteerikantojen välillä esiintyvien variaatioiden tunnistamiseen. Lisäksi havainnollistamme yhteen genomiin pohjautuvien tietokantojen laajentamiseksi kehitetyn työkalun hyödyllisyyttä, jolla voidaan parantaa tietokantojen kattavuutta ja peptidi- ja proteiinitunnistusta käyttämällä Pool-sekvensointikokeissa tunnistettuja variantteja
Bioinformatics' approaches to detect genetic variation in whole genome sequencing data
Current genetic marker repositories are not sufficient or even are completely lacking for most farm animals. However, genetic markers are essential for the development of a research tool facilitating discovery of genetic factors that contribute to resistance to disease and the overall welfare and performance in farm animals. By large scale identification of Single Nucleotide Polymorphisms (SNPs) and Structural Variants (SVs) we aimed to contribute to the development of a repository of genetic variants for farm animals. For this purpose bioinformatics data pipelines were designed and validated to address the challenge of the cost effective identification of genetic markers in DNA sequencing data even in absence of a fully sequenced reference genome. To find SNPs in pig, we analysed publicly available whole genome shotgun sequencing datasets by sequence alignment and clustering. Sequence clusters were assigned to genomic locations using publicly available BAC sequencing and BAC mapping data. Within the sequence clusters thousands of SNPs were detected of which the genomic location is roughly known. For turkey and duck, species that both were lacking a sufficient sequence data repository for variant discovery, we applied next-generation sequencing (NGS) on a reduced genome representation of a pooled DNA sample. For turkey a genome reference was reconstructed from our sequencing data and available public sequencing data whereas in duck the reference genome constructed by a (NGS) project was used. SNPs obtained by our cost-effective SNP detection procedure still turned out to cover, at intervals, the whole turkey and duck genomes and are of sufficient quality to be used in genotyping studies. Allele frequencies, obtained by genotyping animal panels with a subset our SNPs, correlated well with those observed during SNP detection. The availability of two external duck SNP datasets allowed for the construction of a subset of SNPs which we had in common with these sets. Genotyping turned out that this subset was of outstanding quality and can be used for benchmarking other SNPs that we identified within duck. Ongoing developments in (NGS) allowed for paired end sequencing which is an extension on sequencing analysis that provides information about which pair of reads are coming from the outer ends of one sequenced DNA fragment. We applied this technique on a reduced genome representation of four chicken breeds to detect SVs. Paired end reads were mapped to the chicken reference genome and SVs were identified as abnormally aligned read pairs that have orientation or span sizes discordant from the reference genome. SV detection parameters, to distinguish true structural variants from false positives, were designed and optimized by validation of a small representative sample of SVs using PCR and traditional capillary sequencing. To conclude: we developed SNP repositories which fulfils a requirement for SNPs to perform linkage analysis, comparative genomics QTL studies and ultimately GWA studies in a range of farm animals. We also set the first step in developing a repository for SVs in chicken, a relatively new genetic marker in animal sciences. <br/
Genetic Factors that Contribute to the Pathogenesis of Amyotrophic Lateral Sclerosis
Amyotrophic lateral sclerosis (ALS) is fatal neurodegenerative disease for which there is no cure. The only treatment available extends survival by only a matter of months. There are over 20 genes that are known to cause ALS. Over half of the ALS cases with a family history of disease (FALS) can be explained by mutations in known ALS genes with hexanucleotide repeat expansions in C9ORF72 accounting for 40% of families. However roughly 90% of cases have no family history of disease (sporadic ALS or SALS) and a much smaller proportion (10%) of these cases can be explained by mutations in known ALS genes. Understanding the genetic factors that cause ALS or influence its progression will help us understand the cellular pathways involved in disease and identify potential therapeutic targets.
We used a pooled-sample sequencing approach to identify mutations in 17 ALS genes in a cohort of FALS and SALS patients to investigate the contribution of these genes to SALS, including the role of rare variants and the effect of mutations in multiple ALS genes in an individual. We identified potentially pathogenic mutations in 64.3% of familial and 27.8% of sporadic subjects. 3.8% of subjects had mutations in more than one ALS gene and these individuals on average had onset 10 years earlier than those with mutations in only one ALS gene (p=0.0046). There were no individual rare variants that were significantly associated with sporadic ALS, but rare variants in SOD1 were cumulatively more common in SALS subjects.
In addition we investigated the genetic background and stability of C9ROF72 repeat expansions in ALS. The presence of a risk haplotype shared between all expansion-carriers led to the prevailing idea of a founder expansion event, however this shared haplotype also supports the hypothesis of a genetic background that is more prone to expansion. We identified a rare variant rs147599399 on this genetic background that is present in some expansion carriers and some non-expansion carriers, indicating that the expansion arose on at least two separate occasions. This raises the possibility that C9ORF72 repeat expansions in sporadic ALS could be the result of de novo expansions on the risk haplotype. Furthermore we showed that expansion carriers with rs147599399 minor allele had longer survival than expansion carriers without the SNP (p=0.00047), indicating that the genetic background surrounding the C9ORF72 influences the effects of the expansion.
We performed Southern blotting to explore the size and stability of C9ORF72 repeat expansions. There was a high degree of somatic instability and instability in transmissions between families. There was no difference between expansion sizes in symptomatic and asymptomatic expansion carriers in families an there was no correlation between expansion size in any patient tissues and any clinical characteristics. These results need to be confirmed in a larger sample cohort, but suggest that expansion size alone doesn’t determine pathogenicity of C9ORF72 repeat expansions.
Lastly we examined the candidate gene TREM2 as a risk factor for ALS. This gene is involved in regulation of microglial activity, which is a known component of ALS pathogenesis, and the rare variant p.R47H was recently associated with risk Alzheimer’s disease. We found that the same p.R47H variant was significantly associated with ALS in our cohort and that expression of TREM2 was increased in ALS patients and SOD1 mutant mice compared to controls. A variant in the related gene TREML4 was marginally associated with ALS, but the effect of this variant is unknown. Mutations in the TREM genes provide a genetic link between to the neuro-inflammatory component of ALS and suggest other genes involved in microglial activation are good candidates for novel variant identification
Uncovering rare genetic variants predisposing to coeliac disease
PhDCoeliac
disease
is
a
common
(1%
prevalence)
inflammatory
disease
of
the
small
intestine,
involving
the
role
of
tissue
transglutaminase
and
HLA-‐DQ
binding
immuno-‐dominant
wheat
peptides.
The
disease
is
highly
heritable,
however,
at
most
only
40%
of
this
heritability
is
explained
by
HLA-‐DQ
and
risk
variants
from
genome
wide
association
and
fine
mapping
studies.
The
hypothesis
of
the
research
in
this
thesis
is
that
rare
(minor
allele
frequency
<0.5%)
mutations
of
large
effect
size
(odds
ratios
~2
–
5)
exist,
especially
in
multiply
affected
pedigrees,
which
account
for
the
missing
heritability
of
disease.
NimbleGen
exome
capture
and
Illumina
GAIIx
high
throughput
sequencing
was
performed
in
75
coeliac
disease
individuals
from
55
multiply
affected
families.
Candidate
genes
were
chosen
from
various
analytical
strategies:
linkage,
shared
variants
between
multiple
related
subjects
and
gene
burden
tests
for
multiple
potentially
causal
variants.
Highly
multiplexed
amplicon
sequencing,
using
Fluidigm
technology,
of
all
RefSeq
exons
from
24
candidate
genes
in
2,304
coeliac
cases
and
2,304
controls
was
performed
to
locate
further
rare
variation.
Gene
burden
tests
on
a
highly
stringent
post
quality
control
dataset
identified
no
significant
associations
(P<1x10-‐3)
at
the
resequenced
candidate
genes.
The
strategy
of
sequencing
multiply
affected
families,
and
deep
follow
up
of
candidate
genes,
has
not
identified
new
disease
risk
mutations.
Common
variants
(and
other
factors,
e.g.
environmental)
may
instead
account
for
familial
clustering
in
this
common
autoimmune
diseas
Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders
Unlike copy number variants (CNVs), inversions remain an underexplored genetic variation class. By integrating multiple genomic technologies, we discover 729 inversions in 41 human genomes. Approximately 85% of inversions <2 kbp form by twin-priming during L1 retrotransposition; 80% of the larger inversions are balanced and affect twice as many nucleotides as CNVs. Balanced inversions show an excess of common variants, and 72% are flanked by segmental duplications (SDs) or retrotransposons. Since flanking repeats promote non-allelic homologous recombination, we developed complementary approaches to identify recurrent inversion formation. We describe 40 recurrent inversions encompassing 0.6% of the genome, showing inversion rates up to 2.7 × 10(-4) per locus per generation. Recurrent inversions exhibit a sex-chromosomal bias and co-localize with genomic disorder critical regions. We propose that inversion recurrence results in an elevated number of heterozygous carriers and structural SD diversity, which increases mutability in the population and predisposes specific haplotypes to disease-causing CNVs
- …