736 research outputs found

    Developing tools for making inferences from genomic data

    Get PDF
    The central question that the King lab seeks to answer is, "What happens on the genomic level when phenotypes evolve?" To this end, the core focus of my work has been on developing software and analytical methods that facilitate the ability of other researchers working in this space to answer their questions of interest. ... In the second chapter, we describe the problem of inferring ancestral haplotype frequencies from pooled sequence data. We discuss five approaches to this problem that have already been developed, and we discuss a new genetic algorithm based approach that we designed. We perform a benchmarking test on a method developed by Burke et al. and we compare that method to the genetic algorithm by evaluating the performance of each on a simulated dataset. We find that the new method outperforms the existing one.by Paul PetrowskiIncludes bibliographical reference

    Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders.

    Get PDF
    Unlike copy number variants (CNVs), inversions remain an underexplored genetic variation class. By integrating multiple genomic technologies, we discover 729 inversions in 41 human genomes. Approximately 85% of inversionsretrotransposition; 80% of the larger inversions are balanced and affect twice as many nucleotides as CNVs. Balanced inversions show an excess of common variants, and 72% are flanked by segmental duplications (SDs) or retrotransposons. Since flanking repeats promote non-allelic homologous recombination, we developed complementary approaches to identify recurrent inversion formation. We describe 40 recurrent inversions encompassing 0.6% of the genome, showing inversion rates up to 2.7 × 1

    Application of Pool-seq for variation detection and proteogenomic database creation in β-hemolytic streptococci.

    Get PDF
    Proteogenomics is an emerging field that combines genomic (transcriptomic) and proteomic data with the aim of improving gene models and identification of proteins. Technological advances in each domain increase the potential of the field in fostering further understanding of organisms. For instance, the current low cost and fast sequencing technologies have made it possible to sequence multiple representative samples of organisms thus improving the comprehensiveness of the organisms’ reference proteomes. At the same time, improvements in mass spectrometry techniques have led to an increase in the quality and quantity of proteomics data produced, which are utilized to update the annotation of coding sequences in genomes. Sequencing of pooled individual DNAs (Pool-seq) is one method for sequencing large numbers of samples cost effectively. It is a robust method that can accurately identify variations that exist between samples. Similar to other proteogenomics methods such as the sample specific databases derived from RNA-seq data, the variants from Pool-seq experiments can be utilized to create variant protein databases and improve the completeness of protein reference databases used in mass spectrometry (MS)-based proteomics analysis. In this thesis work, the efficiency of Pool-seq in identifying variants and estimating allele frequencies from strains of three β-hemolytic bacteria (GAS, GGS and GBS) is investigated. Moreover, in this work a novel Python package (‘PoolSeqProGen’) for creating variant protein databases from the Pool-seq experiments was developed. To our knowledge, this was the first work to use Pool-seq for sequencing large numbers of β-hemolytic bacteria and assess its efficiency on such genetically polymorphic bacteria. The ‘PoolSeqProGen’ tool is also the first and only tool available to create proteogenomic databases from Pool-seq data. For organisms such as the β-hemolytic bacteria GAS, GBS and GGS that have open pangenomes, the sequencing and annotation of multiple representative strains is paramount in advancing our understanding of these human pathogens and in developing mass spectrometry databases. Due to the increasing use of MS in diagnostics of infectious diseases, this in turn translates to better diagnosis and treatment of the diseases caused by the pathogens and alleviating their devastating burdens on the human population. In this thesis, it is demonstrated that Pool-seq can be used to cost effectively and accurately identify variations that exist among strains of these polymorphic bacteria. In addition, the utility of the tool developed to extend single genome based databases and thereby improve the completeness of the databases and peptide/protein identifications by using variants identified from Pool-seq experiments is illustrated.Proteogenomiikka on kehittyvä tieteenala, joka yhdistää genomiikkaa ja proteomiikkaa geenimallien parantamiseksi ja proteiinien tunnistamiseksi. Molempien alojen tekninen kehitys lisää tämän yhdistetyn tieteenalan mahdollisuuksia eri eliöiden toimintojen ymmärtämiseksi. Esimerkiksi nykyiset edulliset ja nopeat sekvensointitekniikat ovat mahdollistaneet useiden eri organismien kattavan sekvensoinnin, mikä luonnollisesti parantaa myös näiden organismien vertailuproteomien kattavuutta. Samanaikaisesti massaspektrometriatekniikan kehitys on johtanut proteomiikka-analyysien laadun paranemiseen ja syvyyden lisääntymiseen. Tämä mahdollistaa ennustettujen sekvenssialueiden (esim. uusien geenien) validoinnin. Yhdistettyjen yksittäisten DNA-näytteiden sekvensointi (Pool-sekvensointi) mahdollistaa suurten näytemäärien sekvensoinnin erittäin kustannustehokkaasti. Se on luotettava menetelmä, jolla voidaan tunnistaa tarkasti eri näytteiden väliset vaihtelut. Pool-sekvensointikokeiden muunnelmia voidaan käyttää luomaan variantti-proteiinitietokantoja ja parantamaan massaspektrometriaan perustuvien proteiinitietokantojen kattavuutta. Tässä väitöskirjassa tutkittiin Pool-sekvensoinnin tehokkuutta eri varianttien tunnistamisessa ja alleelitaajuuksien arvioimisessa kolmen β-hemolyyttisen streptokokki-bakteerin (GAS, GGS ja GBS) kannoista. Lisäksi työssä kehitettiin uusi Python-ohjelmointikielellä kirjoitettu ohjelmisto (‘PoolSeqProGen’) proteiinivariantitietokantojen luomiseksi Pool-sekvensointi -kokeista. Tämä on ensimmäinen työ, jossa Pool-sekvensointia käytettiin sekvensoimaan suuri määrä streptokokkeja ja arvioimaan menetelmän tehokkuutta geneettisesti polymorfisissa bakteereissa. ”PoolSeqProGen” -työkalu on myös ensimmäinen ja ainoa saatavilla oleva työkalu proteogenomisten tietokantojen luomiseen Pool-sekvensoinnilla tuotetusta datasta. Kehitettäessä massaspektrometria tietokantoja avoimiin pangenomeihin perustuville organismeille, kutenβ-hemolyyttisille streptokokeille GAS, GBS ja GGS, useiden edustavien kantojen sekvensointi ja annotointi on ensiarvoisen tärkeää. Massaspekrometrian lisääntynyt käyttö tartuntatautien diagnosoinnissa parantaa näiden mikrobien aiheuttamien sairauksien diagnosointia ja mahdollistaa siten myös hoidon paremman kohdentamisen. Tässä väitöskirjatyössä osoitetaan, että Pool-sekvensointia voi käyttää kustannustehokkaasti ja tarkasti polymorfisten bakteerikantojen välillä esiintyvien variaatioiden tunnistamiseen. Lisäksi havainnollistamme yhteen genomiin pohjautuvien tietokantojen laajentamiseksi kehitetyn työkalun hyödyllisyyttä, jolla voidaan parantaa tietokantojen kattavuutta ja peptidi- ja proteiinitunnistusta käyttämällä Pool-sekvensointikokeissa tunnistettuja variantteja

    Bioinformatics' approaches to detect genetic variation in whole genome sequencing data

    Get PDF
    Current genetic marker repositories are not sufficient or even are completely lacking for most farm animals. However, genetic markers are essential for the development of a research tool facilitating discovery of genetic factors that contribute to resistance to disease and the overall welfare and performance in farm animals. By large scale identification of Single Nucleotide Polymorphisms (SNPs) and Structural Variants (SVs) we aimed to contribute to the development of a repository of genetic variants for farm animals. For this purpose bioinformatics data pipelines were designed and validated to address the challenge of the cost effective identification of genetic markers in DNA sequencing data even in absence of a fully sequenced reference genome. To find SNPs in pig, we analysed publicly available whole genome shotgun sequencing datasets by sequence alignment and clustering. Sequence clusters were assigned to genomic locations using publicly available BAC sequencing and BAC mapping data. Within the sequence clusters thousands of SNPs were detected of which the genomic location is roughly known. For turkey and duck, species that both were lacking a sufficient sequence data repository for variant discovery, we applied next-generation sequencing (NGS) on a reduced genome representation of a pooled DNA sample. For turkey a genome reference was reconstructed from our sequencing data and available public sequencing data whereas in duck the reference genome constructed by a (NGS) project was used. SNPs obtained by our cost-effective SNP detection procedure still turned out to cover, at intervals, the whole turkey and duck genomes and are of sufficient quality to be used in genotyping studies. Allele frequencies, obtained by genotyping animal panels with a subset our SNPs, correlated well with those observed during SNP detection. The availability of two external duck SNP datasets allowed for the construction of a subset of SNPs which we had in common with these sets. Genotyping turned out that this subset was of outstanding quality and can be used for benchmarking other SNPs that we identified within duck. Ongoing developments in (NGS) allowed for paired end sequencing which is an extension on sequencing analysis that provides information about which pair of reads are coming from the outer ends of one sequenced DNA fragment. We applied this technique on a reduced genome representation of four chicken breeds to detect SVs. Paired end reads were mapped to the chicken reference genome and SVs were identified as abnormally aligned read pairs that have orientation or span sizes discordant from the reference genome. SV detection parameters, to distinguish true structural variants from false positives, were designed and optimized by validation of a small representative sample of SVs using PCR and traditional capillary sequencing. To conclude: we developed SNP repositories which fulfils a requirement for SNPs to perform linkage analysis, comparative genomics QTL studies and ultimately GWA studies in a range of farm animals. We also set the first step in developing a repository for SVs in chicken, a relatively new genetic marker in animal sciences. <br/

    Genetic Factors that Contribute to the Pathogenesis of Amyotrophic Lateral Sclerosis

    Get PDF
    Amyotrophic lateral sclerosis (ALS) is fatal neurodegenerative disease for which there is no cure. The only treatment available extends survival by only a matter of months. There are over 20 genes that are known to cause ALS. Over half of the ALS cases with a family history of disease (FALS) can be explained by mutations in known ALS genes with hexanucleotide repeat expansions in C9ORF72 accounting for 40% of families. However roughly 90% of cases have no family history of disease (sporadic ALS or SALS) and a much smaller proportion (10%) of these cases can be explained by mutations in known ALS genes. Understanding the genetic factors that cause ALS or influence its progression will help us understand the cellular pathways involved in disease and identify potential therapeutic targets. We used a pooled-sample sequencing approach to identify mutations in 17 ALS genes in a cohort of FALS and SALS patients to investigate the contribution of these genes to SALS, including the role of rare variants and the effect of mutations in multiple ALS genes in an individual. We identified potentially pathogenic mutations in 64.3% of familial and 27.8% of sporadic subjects. 3.8% of subjects had mutations in more than one ALS gene and these individuals on average had onset 10 years earlier than those with mutations in only one ALS gene (p=0.0046). There were no individual rare variants that were significantly associated with sporadic ALS, but rare variants in SOD1 were cumulatively more common in SALS subjects. In addition we investigated the genetic background and stability of C9ROF72 repeat expansions in ALS. The presence of a risk haplotype shared between all expansion-carriers led to the prevailing idea of a founder expansion event, however this shared haplotype also supports the hypothesis of a genetic background that is more prone to expansion. We identified a rare variant rs147599399 on this genetic background that is present in some expansion carriers and some non-expansion carriers, indicating that the expansion arose on at least two separate occasions. This raises the possibility that C9ORF72 repeat expansions in sporadic ALS could be the result of de novo expansions on the risk haplotype. Furthermore we showed that expansion carriers with rs147599399 minor allele had longer survival than expansion carriers without the SNP (p=0.00047), indicating that the genetic background surrounding the C9ORF72 influences the effects of the expansion. We performed Southern blotting to explore the size and stability of C9ORF72 repeat expansions. There was a high degree of somatic instability and instability in transmissions between families. There was no difference between expansion sizes in symptomatic and asymptomatic expansion carriers in families an there was no correlation between expansion size in any patient tissues and any clinical characteristics. These results need to be confirmed in a larger sample cohort, but suggest that expansion size alone doesn’t determine pathogenicity of C9ORF72 repeat expansions. Lastly we examined the candidate gene TREM2 as a risk factor for ALS. This gene is involved in regulation of microglial activity, which is a known component of ALS pathogenesis, and the rare variant p.R47H was recently associated with risk Alzheimer’s disease. We found that the same p.R47H variant was significantly associated with ALS in our cohort and that expression of TREM2 was increased in ALS patients and SOD1 mutant mice compared to controls. A variant in the related gene TREML4 was marginally associated with ALS, but the effect of this variant is unknown. Mutations in the TREM genes provide a genetic link between to the neuro-inflammatory component of ALS and suggest other genes involved in microglial activation are good candidates for novel variant identification

    Uncovering rare genetic variants predisposing to coeliac disease

    Get PDF
    PhDCoeliac disease is a common (1% prevalence) inflammatory disease of the small intestine, involving the role of tissue transglutaminase and HLA-­‐DQ binding immuno-­‐dominant wheat peptides. The disease is highly heritable, however, at most only 40% of this heritability is explained by HLA-­‐DQ and risk variants from genome wide association and fine mapping studies. The hypothesis of the research in this thesis is that rare (minor allele frequency <0.5%) mutations of large effect size (odds ratios ~2 – 5) exist, especially in multiply affected pedigrees, which account for the missing heritability of disease. NimbleGen exome capture and Illumina GAIIx high throughput sequencing was performed in 75 coeliac disease individuals from 55 multiply affected families. Candidate genes were chosen from various analytical strategies: linkage, shared variants between multiple related subjects and gene burden tests for multiple potentially causal variants. Highly multiplexed amplicon sequencing, using Fluidigm technology, of all RefSeq exons from 24 candidate genes in 2,304 coeliac cases and 2,304 controls was performed to locate further rare variation. Gene burden tests on a highly stringent post quality control dataset identified no significant associations (P<1x10-­‐3) at the resequenced candidate genes. The strategy of sequencing multiply affected families, and deep follow up of candidate genes, has not identified new disease risk mutations. Common variants (and other factors, e.g. environmental) may instead account for familial clustering in this common autoimmune diseas

    Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders

    Get PDF
    Unlike copy number variants (CNVs), inversions remain an underexplored genetic variation class. By integrating multiple genomic technologies, we discover 729 inversions in 41 human genomes. Approximately 85% of inversions <2 kbp form by twin-priming during L1 retrotransposition; 80% of the larger inversions are balanced and affect twice as many nucleotides as CNVs. Balanced inversions show an excess of common variants, and 72% are flanked by segmental duplications (SDs) or retrotransposons. Since flanking repeats promote non-allelic homologous recombination, we developed complementary approaches to identify recurrent inversion formation. We describe 40 recurrent inversions encompassing 0.6% of the genome, showing inversion rates up to 2.7 × 10(-4) per locus per generation. Recurrent inversions exhibit a sex-chromosomal bias and co-localize with genomic disorder critical regions. We propose that inversion recurrence results in an elevated number of heterozygous carriers and structural SD diversity, which increases mutability in the population and predisposes specific haplotypes to disease-causing CNVs
    corecore