505 research outputs found

    Systematic Inference of Copy-Number Genotypes from Personal Genome Sequencing Data Reveals Extensive Olfactory Receptor Gene Content Diversity

    Get PDF
    Copy-number variations (CNVs) are widespread in the human genome, but comprehensive assignments of integer locus copy-numbers (i.e., copy-number genotypes) that, for example, enable discrimination of homozygous from heterozygous CNVs, have remained challenging. Here we present CopySeq, a novel computational approach with an underlying statistical framework that analyzes the depth-of-coverage of high-throughput DNA sequencing reads, and can incorporate paired-end and breakpoint junction analysis based CNV-analysis approaches, to infer locus copy-number genotypes. We benchmarked CopySeq by genotyping 500 chromosome 1 CNV regions in 150 personal genomes sequenced at low-coverage. The assessed copy-number genotypes were highly concordant with our performed qPCR experiments (Pearson correlation coefficient 0.94), and with the published results of two microarray platforms (95–99% concordance). We further demonstrated the utility of CopySeq for analyzing gene regions enriched for segmental duplications by comprehensively inferring copy-number genotypes in the CNV-enriched >800 olfactory receptor (OR) human gene and pseudogene loci. CopySeq revealed that OR loci display an extensive range of locus copy-numbers across individuals, with zero to two copies in some OR loci, and two to nine copies in others. Among genetic variants affecting OR loci we identified deleterious variants including CNVs and SNPs affecting ∌15% and ∌20% of the human OR gene repertoire, respectively, implying that genetic variants with a possible impact on smell perception are widespread. Finally, we found that for several OR loci the reference genome appears to represent a minor-frequency variant, implying a necessary revision of the OR repertoire for future functional studies. CopySeq can ascertain genomic structural variation in specific gene families as well as at a genome-wide scale, where it may enable the quantitative evaluation of CNVs in genome-wide association studies involving high-throughput sequencing

    WaveCNV: allele-specific copy number alterations in primary tumors and xenograft models from next-generation sequencing.

    Get PDF
    MotivationCopy number variations (CNVs) are a major source of genomic variability and are especially significant in cancer. Until recently microarray technologies have been used to characterize CNVs in genomes. However, advances in next-generation sequencing technology offer significant opportunities to deduce copy number directly from genome sequencing data. Unfortunately cancer genomes differ from normal genomes in several aspects that make them far less amenable to copy number detection. For example, cancer genomes are often aneuploid and an admixture of diploid/non-tumor cell fractions. Also patient-derived xenograft models can be laden with mouse contamination that strongly affects accurate assignment of copy number. Hence, there is a need to develop analytical tools that can take into account cancer-specific parameters for detecting CNVs directly from genome sequencing data.ResultsWe have developed WaveCNV, a software package to identify copy number alterations by detecting breakpoints of CNVs using translation-invariant discrete wavelet transforms and assign digitized copy numbers to each event using next-generation sequencing data. We also assign alleles specifying the chromosomal ratio following duplication/loss. We verified copy number calls using both microarray (correlation coefficient 0.97) and quantitative polymerase chain reaction (correlation coefficient 0.94) and found them to be highly concordant. We demonstrate its utility in pancreatic primary and xenograft sequencing data.Availability and implementationSource code and executables are available at https://github.com/WaveCNV. The segmentation algorithm is implemented in MATLAB, and copy number assignment is implemented [email protected] informationSupplementary data are available at Bioinformatics online

    Copy number variations among silkworms

    Full text link

    The missense of smell: functional variability in the human odorant receptor repertoire.

    Get PDF
    Humans have ~400 intact odorant receptors, but each individual has a unique set of genetic variations that lead to variation in olfactory perception. We used a heterologous assay to determine how often genetic polymorphisms in odorant receptors alter receptor function. We identified agonists for 18 odorant receptors and found that 63% of the odorant receptors we examined had polymorphisms that altered in vitro function. On average, two individuals have functional differences at over 30% of their odorant receptor alleles. To show that these in vitro results are relevant to olfactory perception, we verified that variations in OR10G4 genotype explain over 15% of the observed variation in perceived intensity and over 10% of the observed variation in perceived valence for the high-affinity in vitro agonist guaiacol but do not explain phenotype variation for the lower-affinity agonists vanillin and ethyl vanillin

    A Computational Framework Discovers New Copy Number Variants with Functional Importance

    Get PDF
    Structural variants which cause changes in copy numbers constitute an important component of genomic variability. They account for 0.7% of genomic differences in two individual genomes, of which copy number variants (CNVs) are the largest component. A recent population-based CNV study revealed the need of better characterization of CNVs, especially the small ones (<500 bp).We propose a three step computational framework (Identification of germline Changes in Copy Number or IgC2N) to discover and genotype germline CNVs. First, we detect candidate CNV loci by combining information across multiple samples without imposing restrictions to the number of coverage markers or to the variant size. Secondly, we fine tune the detection of rare variants and infer the putative copy number classes for each locus. Last, for each variant we combine the relative distance between consecutive copy number classes with genetic information in a novel attempt to estimate the reference model bias. This computational approach is applied to genome-wide data from 1250 HapMap individuals. Novel variants were discovered and characterized in terms of size, minor allele frequency, type of polymorphism (gains, losses or both), and mechanism of formation. Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb. Finally, we queried transcriptomic data from 129 individuals determined by RNA-sequencing as further validation and to assess the functional role of the new variants. We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08). Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies

    Immunomic and transcriptomic profiling of the immune response to gluten exposure in celiac disease

    Get PDF
    Celiac disease (CD) is an immune-mediated gastrointestinal disease that is precipitated by ingestion of dietary gluten, a protein found in wheat, barley and rye. The development of CD almost always requires genetic predisposition with the patients carrying either the DQ2 or DQ8 HLA-DQ alleles. As these genetic alleles are prevalent also in the healthy population, the main factors involved in CD pathogenesis, gluten and the genetic association, are necessary but not sufficient for CD development. Thus, the initial milieu of factors that lead to CD pathogenesis is still not completely understood, with current research suggesting other yet unidentified factors. The complexity of CD is also mirrored in its manifestation, with CD patients having symptoms ranging from gastrointestinal disorders to system-wide extra-intestinal manifestations, including skin and neurological conditions. It can also be asymptomatic. Consequently, CD misdiagnosis or late diagnosis is prevalent. Thus, further investigation of the immune response in CD is needed to better shed light on the intricacies of the cell and molecular changes involved, as well as to provide better diagnostic and therapeutic options. The advent and application of high throughput sequencing at the beginning of the last decade provided the opportunity to study the immune response in CD on an unprecedented scale. Particularly, with immune repertoire sequencing (RepSeq) and genome wide RNA sequencing (RNAseq), as well as the development of bioinformatics analysis methods, we considered the possibility of investigating the effect of gluten exposure in CD at a systemic level. The aim of this work was then to utilize RepSeq and RNAseq to characterize the global immunological and transcriptomic changes that occur in CD during in vivo gluten exposure. We also aimed to develop new computational methodologies for mining immune repertoire datasets to identify gluten associated T-cell receptor (TCR) clonotypes. At the beginning of our first study, which examined the global immune repertoire, there were few groundbreaking studies that had reported immunogenic gluten peptide-specific T-cell receptors in CD patients. However, as these studies used tetramer assays that allowed investigation of only a handful of gluten peptides at a time, they largely ignored the repertoire-wide immune response dynamics and the repertoire of T-cell receptors induced by gluten that may target not just gluten peptides but other antigens possibly relevant in CD. With study I, some of these shortcomings were addressed by using RepSeq to study the gluten-exposed global repertoire in the blood and gut of CD patients in an unbiased manner. The study showed that gluten exposure leads to increased TCR sharing in both blood and gut between unrelated CD patients, suggesting that the public component of the TCR immune response is important in CD. In addition, we identified particular TCR clonotypes that were induced by gluten exposure through the bioinformatics pipeline developed for differential abundance analysis in this study. The identified gluten-induced TCR clonotypes included novel as well as previously reported gluten-peptide binding TCRs, indicating that in spite of the immense diversity of the total immune repertoire, it was possible to utilize RepSeq to identify CD relevant clonotypes computationally without necessarily knowing their targeted antigen. In study I, the limited sharing of TCRs across individuals necessitated the comparison of TCR abundances within an individual (across different time points) or across-individuals, but only using the small set of public TCRs that were seen in multiple individuals. In study II, a comprehensive bioinformatic method that allows direct population level comparison of RepSeq datasets in two conditions for the identification of both public and private condition-associated TCRs was developed. The method relies on the assumption that private TCRs that are specific to an antigen, for example gluten peptide, are likely to have high similarity in their sequence to public TCRs targeting the same antigen and thus could be detected by proxy. It also assumes that such immune sequence components needed to mount an immune response to an antigen are likely to be shared across individuals, at least to a degree that may prove useful for computational identification. By dissecting the immune repertoire into clusters of TCRs with similar kmer composition and finding shared clusters of TCRs with similar kmer composition across individuals, the method facilitates the comparison of clonal abundances between condition groups and the identification of condition-relevant TCRs. The method was applied on CD RepSeq datasets from study I and successfully identified gluten-induced clonotypes, with TRBV-gene usage and positional amino acid usage patterns similar to known gluten-specific clonotypes. Overall, development of the method and its application on CD demonstrated that direct cross-individual comparison of immune repertoires for identification of disease relevant TCRs was possible, paving the way for direct investigation of the TCR immune response at the population-level, without necessarily knowing all the antigens targeted in an autoimmune disease like CD. In the final study, RNAseq was utilized to investigate the genome-wide transcriptional changes in the PBMC of CD patients, which showed that a short 3-day gluten exposure was enough to induce distinct transcriptional profile in patients. Importantly, this study identified genes with persistently altered expression and biological pathways with persistently perturbed regulation in CD patient PBMC, regardless of a long period of treatment with gluten-free diet. This study also suggested new candidate genes for known CD linked and/or associated genetic loci 19p13.11 and 21q22.3. In conclusion, this thesis developed new bioinformatic methods for the analysis of high throughput TCR immune repertoire datasets and the identification of condition-relevant clonotypes and applied the method on CD patient immune repertoires to identify gluten induced clonotypes. The thesis also provides several new insights into the global immune and transcriptional signatures associated with gluten exposure in CD. The methods and findings in this thesis have potential future use in CD disease stratification, diagnosis, therapy, and monitoring.Keliakia on immuunivÀlitteinen maha-suolikanavan sairaus, jonka aiheuttaa vehnÀssÀ, ohrassa ja rukiissa esiintyvÀ gluteeni. Keliakian kehittyminen vaatii melkein aina geneettisen alttiuden; potilaat kantavat joko HLA-DQ-geenin alleelia DQ2 tai DQ8. Koska nÀmÀ alleelit ovat yleisiÀ myös terveessÀ populaatiossa, tÀrkeimmÀt keliakian patogeneesiin liittyvÀt tekijÀt, gluteeni ja perintötekijÀt, ovat vÀlttÀmÀttömiÀ, mutta eivÀt riittÀviÀ keliakian kehittymiselle. NÀin ollen kaikkia keliakian kehittymiseen johtavia tekijöitÀ ei vielÀkÀÀn tÀysin ymmÀrretÀ, ja nykyiset tutkimukset viittaavat muihin, vielÀ tunnistamattomiin tekijöihin. Keliakian monimutkaisuus nÀkyy myös sen ilmenemismuodoissa: keliaakikoilla voi olla oireita aina maha-suolikanavan hÀiriöistÀ laajempiin suolen ulkopuolisiin ilmenemismuotoihin, mukaan lukien iho- ja neurologiset sairaudet. Tauti voi olla myös oireeton. Siten vÀÀrÀ tai viivÀstynyt diagnoosi on yleistÀ. Keliakian immuunivasteen lisÀtutkimusta tarvitaan, jotta voidaan paremmin selvittÀÀ siihen liittyviÀ monimutkaisia solu- ja molekyylimuutoksia sekÀ kehittÀÀ parempia diagnostisia ja terapeuttisia vaihtoehtoja. Uuden sukupolven sekvensointimenetelmien kehitys ja kÀyttö viime vuosikymmenen alussa on mahdollistanut keliakian immuunivasteen tutkimuksen ennennÀkemÀttömÀssÀ mittakaavassa. Erityisesti immunorepertuaarisekvensointi (RepSeq), genomin laajuinen RNA-sekvensointi (RNAseq) sekÀ bioinformatiikan analyysimenetelmien kehittÀminen mahdollisti tÀmÀn systemaattisen tutkimuksemme gluteenialtistuksen vaikutuksista keliakiassa. TÀmÀn työn tavoitteena oli hyödyntÀÀ RepSeqiÀ ja RNAseq:ia löytÀmÀÀn ne immunologiset ja geeniekspressiotasojen muutokset, joita esiintyy keliakiassa in vivo gluteenialtistuksen aikana. Kehitimme myös uusia laskennallisia menetelmiÀ immunorepertuaaridatan louhintaan gluteenia tunnistavien T-solureseptorien (TCR) klonotyyppien tunnistamiseksi. Tutkimuksemme alussa keliakian immunogeenisistÀ gluteenispesifisistÀ T-solureseptoreista oli tehty vasta muutamia uraauurtavia tutkimuksia tetrameerimÀÀritysten avulla, jotka kuitenkin mahdollistivat vain yksittÀisten gluteenipeptidien tutkimisen kerrallaan. Siten ne jÀttivÀt suurelta osin huomiotta koko laajemman immuunivasteen dynamiikan ja gluteenin indusoimien T-solureseptorien repertuaarin, joka saattaa kohdistua gluteenipeptidien lisÀksi myös muihin keliakian kannalta merkityksellisiin antigeeneihin. OsatyössÀ I tÀtÀ tutkittiin kÀyttÀmÀllÀ RepSeq menetelmÀÀ valikoimattomasti koko veren ja suoliston repertuaarikirjolle gluteenialtistuksen yhteydessÀ. Tutkimus osoitti, ettÀ gluteenialtistus johtaa potilaiden kesken samankaltaisten T-solureseptoreiden lisÀÀntymiseen sekÀ veressÀ ettÀ suolistossa, mikÀ viittaa siihen, ettÀ TCR-immuunivasteen ns. julkinen komponentti on tÀrkeÀ keliakiassa. LisÀksi löysimme tiettyjÀ gluteenialtistuksen indusoimia TCR-klonotyyppejÀ kÀyttÀmÀllÀ kehittÀmÀÀmme bioinformatiikan työkalua toisistaan poikkeavien klonotyyppimÀÀrien tilastolliseen vertailuun. Tunnistetut gluteenin indusoimat TCR-klonotyypit sisÀlsivÀt sekÀ uusia, ettÀ aiemmin raportoituja gluteenipeptidejÀ sitovia reseptoreita. TÀmÀ osoittaa, ettÀ koko repertuaarin valtavasta monimuotoisuudesta huolimatta RepSeqillÀ oli mahdollista tunnistaa keliakian kannalta merkityksellisiÀ klonotyyppejÀ laskennallisesti, ilman tarkkaa tietoa spesifisistÀ antigeeneistÀ. Tutkimuksessa I TCR-klonotyyppien identtisyys yksilöiden vÀlillÀ mahdollisti TCR-mÀÀrien vertailun saman yksilön sisÀllÀ (eri ajankohtina) tai eri yksilöiden vÀlillÀ, mutta analyyseissÀ kÀytettiin vain pientÀ joukkoa niitÀ julkisia T-solu-reseptoreita, jotka havaittiin useilla yksilöillÀ. Tutkimuksessa II kehitettiin kattava bioinformatiikan menetelmÀ, joka mahdollistaa Repseq-aineistojen suoran populaatiotason vertailun sekÀ julkisten ettÀ yksityisten tautiin tai altistukseen liittyvien T-solureseptoreiden tunnistamiseksi. MenetelmÀ perustuu oletukseen, ettÀ yksityiset reseptorit, jotka ovat spesifisiÀ antigeenille kuten gluteenipeptidille, ovat todennÀköisesti sekvensseiltÀÀn samankaltaisia julkisten reseptorien kanssa, jotka kohdistuvat samaan antigeeniin, ja siten ne voidaan löytÀÀ sen perusteella. MenetelmÀssÀ oletetaan myös, ettÀ sellaiset immuunisekvenssikomponentit, joita tarvitaan immuunivasteen muodostamiseen antigeenille, jaetaan todennÀköisesti yksilöiden kesken ainakin siinÀ mÀÀrin, ettÀ sitÀ voidaan hyödyntÀÀ laskennallisessa tunnistamisessa. Jakamalla immunorepertuaari TCR-klustereihin, joilla on samanlainen kmer-koostumus, ja löytÀmÀllÀ jaettuja TCR-klustereita, joilla on samanlainen kmer-koostumus, menetelmÀ mahdollistaa kloonimÀÀrien vertailun ryhmien vÀlillÀ ja esimerkiksi taudin tai altistuksen kannalta merkityksellisten TCR:ien tunnistamisen. MenetelmÀÀ sovellettiin keliakian RepSeq dataan tutkimuksesta I, ja sillÀ tunnistettiin onnistuneesti gluteenin indusoimia klonotyyppejÀ, joissa TRBV-geenin kÀyttö ja aminohappojen sijainnit olivat samanlaisia tunnettujen gluteenispesifisten klonotyyppien kanssa. Kaiken kaikkiaan menetelmÀn kehittÀminen ja sen soveltaminen keliakiaan osoitti, ettÀ immuunorepertuaarien suora ristiinvertailu taudin kannalta merkityksellisten T-solureseptoreiden tunnistamiseksi oli mahdollista, mikÀ viitoittaa tietÀ TCR-immuunivasteen suoralle tutkimukselle populaatiotasolla ilman ettÀ kaikkia autoimmuunitautiin liittyviÀ antigeenejÀ tarvitsee tuntea. ViimeisessÀ osatyössÀ kÀytettiin RNAseq menetelmÀÀ genominlaajuisten transkriptiomuutosten tutkimiseen keliakia-potilaiden veren mononukleaarisoluissa. Tutkimus osoitti, ettÀ lyhyt kolmen pÀivÀn gluteenialtistus riitti indusoimaan selvÀn transkriptioprofiilin muutoksen potilailla. Tutkimuksessa tunnistettiin geenejÀ, joiden ilmentyminen oli pysyvÀsti muuttunutta, sekÀ biologisia reittejÀ, joiden sÀÀtely on pysyvÀsti hÀiriintynyt keliaakikoiden valkosoluissa, riippumatta pitkÀstÀ hoidosta gluteenittomalla ruokavaliolla. Tutkimus löysi myös uusia ehdokasgeenejÀ tunnetuille keliakiaan kytkeytyville ja/tai assosioituville geenilokuksille 19p13.11 ja 21q22.3. Yhteenvetona voidaan todeta, ettÀ vÀitöskirjatyössÀ kehitettiin uusia bioinformaattisia menetelmiÀ massiivisten TCR-immunorepertuaariaineistojen analysointiin ja esimerkiksi tautien kannalta merkittÀvien klonotyyppien tunnistamiseen, ja sovellettiin menetelmÀÀ keliakia-potilaiden immunorepertuaareihin gluteenin indusoimien klonotyyppien tunnistamiseksi. VÀitöskirja tarjoaa myös useita uusia havaintoja keliakian gluteenialtistukseen liittyvistÀ immuuni- ja transkriptioprofiileista. TÀmÀn vÀitöskirjatyön menetelmistÀ ja tuloksista on tulevaisuudessa potentiaalisesti hyötyÀ keliakian diagnosoinnissa, hoidossa ja seurannassa

    Copy number variations in the gene space of Picea glauca

    Get PDF
    Les variations de nombre de copies (VNCs) sont des variations gĂ©nĂ©tiques de grande taille qui ont Ă©tĂ© dĂ©tectĂ©es parmi les individus de tous les organismes multicellulaires examinĂ©s Ă  ce jour. Ces variations ont un impact considĂ©rable sur la structure et la fonction des gĂšnes et ont Ă©tĂ© impliquĂ©es dans le contrĂŽle de diffĂ©rents traits phĂ©notypiques. Chez les plantes, les caractĂ©ristiques gĂ©nĂ©tiques des VNCs sont encore peu caractĂ©risĂ©es et les connaissances concernant les VNCs sont encore plus limitĂ©es chez les espĂšces arborescentes. Les objectifs principaux de cette thĂšse consistaient i) au dĂ©veloppement d’une approche pour la dĂ©tection de VNCs dans l’espace gĂ©nique de conifĂšres arborescents appartenant Ă  l’espĂšce P. glauca, ii) Ă  l’estimation du taux de mutation des VNCs Ă  l’échelle du gĂ©nome et iii) Ă  l’examen des profils de transmission des VNCs d’une gĂ©nĂ©ration Ă  la suivante. Nous avons utilisĂ© des donnĂ©es brutes de gĂ©notypage par puces de SNPs qui ont Ă©tĂ© gĂ©nĂ©rĂ©es pour 3663 individus appartenant Ă  55 familles biparentales, et avons examinĂ© plus de 14 000 gĂšnes pour identifier des VNCs. Nos rĂ©sultats montrent que les VNCs affectent une petite proportion de l’espace gĂ©nique. Les polymorphismes de nombre de copies observĂ©s chez les descendants Ă©taient soit hĂ©ritĂ©s soit gĂ©nĂ©rĂ©s par des mutations spontanĂ©es. Notre analyse montre aussi que les estimĂ©s du taux de mutation couvrent au moins trois ordres de grandeur, pouvant atteindre de hauts niveaux et variant pour diffĂ©rents gĂšnes, allĂšles et classes de VNCs. Le taux de mutation du nombre de copies Ă©tait aussi corrĂ©lĂ© au niveau d’expression des gĂšnes et la relation entre le taux de mutation et l’expression des gĂšnes Ă©tait mieux expliquĂ©e dans le cadre de l’hypothĂšse de barriĂšre par la dĂ©rive gĂ©nĂ©tique. Concernant l’hĂ©rĂ©ditĂ© des VNCs, nos rĂ©sultats montrent que la plupart de ces derniers (70%) sont transmises en violation des lois mendĂ©liennes de l’hĂ©rĂ©ditĂ©. La majoritĂ© des distorsions de transmission favorisaient la transmission d’une copie et contribuaient Ă  la restauration rapide du gĂ©notype Ă  deux-copies dans la gĂ©nĂ©ration suivante. Les niveaux de distorsion observĂ©s variaient considĂ©rablement et Ă©taient influencĂ©s par des effets parentaux et des effets liĂ©s au contexte gĂ©nĂ©tique. Nous avons aussi identifiĂ© des situations oĂč la perte d’une copie de gĂšne Ă©tait favorisĂ©e et soumise Ă  diffĂ©rentes formes de pressions sĂ©lectives. Cette Ă©tude montre que les mutations de novo et les distorsions de transmission de VNCs influencent la diversitĂ© gĂ©nĂ©tique prĂ©sente chez une espĂšce et jouent un rĂŽle important dans l’adaptation et l’évolution.Copy number variations (CNVs) are large genetic variations detected among the individuals of every multicellular organism examined so far. These variations have a considerable impact on gene structure and function and have been shown to be involved in the control of several phenotypic traits. In plants, the key genetic features of CNVs are still poorly understood and even less is known about CNVs in trees. The goals of this thesis were to i) develop an approach for the identification of CNVs in the gene space of the conifer tree Picea glauca, ii) estimate the rate of CNV generation genome-wide and iii) examine the transmission patterns of CNVs from one generation to the next. We used SNP-array raw intensity genotyping data for 3663 individuals belonging to 55 full-sib families to scan more than 14 000 genes for CNVs. Our findings show that CNVs affect a small proportion of the gene space and copy number variants detected in the progeny were either inherited or generated through de novo events. Our analyses show that copy number (CN) mutation rate estimates spanned at least three orders of magnitude, could reach high levels and varied for different genes, alleles and CNV classes. CN mutation rate was also correlated with gene expression levels and the relationship between mutation rate and gene expression was best explained within the frame of the drift-barrier hypothesis (DBH). With regard to CNV inheritance, our results show that most CNVs (70%) are transmitted from the parents in violation of Mendelian expectations. The majority of transmission distortions favored the one-copy allele and contributed to the rapid restoration of the two-copy genotype in the next generation. The observed distortion levels varied considerably and were influenced by parental, partner genotype and genetic background effects. We also identified instances where the loss of a gene copy was favored and subject to different types of selection pressures. This study shows that de novo mutations and transmission distortions of CNVs contribute both to the shaping of the standing genetic variation and play an important role in species adaptation and evolution

    The development of computational methods for large-scale comparisons and analyses of genome evolution

    Get PDF
    The last four decades have seen the development of a number of experimental methods for the deduction of the whole genome sequences of an ever-increasing number of organisms. These sequences have in the first instance, allowed their investigators the opportunity to examine the molecular primary structure of areas of scientific interest, but with the increased sampling of organisms across the phylogenetic tree and the improved quality and coverage of genome sequences and their associated annotations, the opportunity to undertake detailed comparisons both within and between taxonomic groups has presented itself. The work described in this thesis details the application of comparative bioinformatics analyses on inter- and intra-genomic datasets, to elucidate those genomic changes, which may underlie organismal adaptations and contribute to changes in the complexity of genome content and structure over time. The results contained herein demonstrate the power and flexibility of the comparative approach, utilising whole genome data, to elucidate the answers to some of the most pressing questions in the biological sciences today.As the volume of genomic data increases, both as a result of increased sampling of the tree of life and due to an increase in the quality and throughput of the sequencing methods, it has become clear that there is a necessity for computational analyses of these data. Manual analysis of this volume of data, which can extend beyond petabytes of storage space, is now impossible. Automated computational pipelines are therefore required to retrieve, categorise and analyse these data. Chapter two discusses the development of a computational pipeline named the Genome Comparison and Analysis Toolkit (GCAT). The pipeline was developed using the Perl programming language and is tightly integrated with the Ensembl Perl API allowing for the retrieval and analyses of their rich genomic resources. In the first instance the pipeline was tested for its robustness by retrieving and describing various components of genomic architecture across a number of taxonomic groups. Additionally, the need for programmatically independent means of accessing data and in particular the need for Semantic Web based protocols and tools for the sharing of genomics resources is highlighted. This is not just for the requirements of researchers, but for improved communication and sharing between computational infrastructure. A prototype Ensembl REST web service was developed in collaboration with the European Bioinformatics Institute (EBI) to provide a means of accessing Ensembl’s genomic data without having to rely on their Perl API. A comparison of the runtime and memory usage of the Ensembl Perl API and prototype REST API were made relative to baseline raw SQL queries, which highlights the overheads inherent in building wrappers around the SQL queries. Differences in the efficiency of the approaches were highlighted, and the importance of investing in the development of Semantic Web technologies as a tool to improve access to data for the wider scientific community are discussed.Data highlighted in chapter two led to the identification of relative differences in the intron structure of a number of organisms including teleost fish. Chapter three encompasses a published, peer-reviewed study. Inter-genomic comparisons were undertaken utilising the 5 available teleost genome sequences in order to examine and describe their intron content. The number and sizes of introns were compared across these fish and a frequency distribution of intron size was produced that identified a novel expansion in the Zebrafish lineage of introns in the size range of approximately 500-2,000 bp. Further hypothesis driven analyses of the introns across the whole distribution of intron sizes identified that the majority, but not all of the introns were largely comprised of repetitive elements. It was concluded that the introns in the Zebrafish peak were likely the result of an ancient expansion of repetitive elements that had since degraded beyond the ability of computational algorithms to identify them. Additional sampling throughout the teleost fish lineage will allow for more focused phylogenetically driven analyses to be undertaken in the future.In chapter four phylogenetic comparative analyses of gene duplications were undertaken across primate and rodent taxonomic groups with the intention of identifying significantly expanded or contracted gene families. Changes in the size of gene families may indicate adaptive evolution. A larger number of expansions, relative to time since common ancestor, were identified in the branch leading to modern humans than in any other primate species. Due to the unique nature of the human data in terms of quantity and quality of annotation, additional analyses were undertaken to determine whether the expansions were methodological artefacts or real biological changes. Novel approaches were developed to test the validity of the data including comparisons to other highly annotated genomes. No similar expansion was seen in mouse when comparing with rodent data, though, as assemblies and annotations were updated, there were differences in the number of significant changes, which brings into question the reliability of the underlying assembly and annotation data. This emphasises the importance of an understanding that computational predictions, in the absence of supporting evidence, may be unlikely to represent the actual genomic structure, and instead be more an artefact of the software parameter space. In particular, significant shortcomings are highlighted due to the assumptions and parameters of the models used by the CAFE gene family analysis software. We must bear in mind that genome assemblies and annotations are hypotheses that themselves need to be questioned and subjected to robust controls to increase the confidence in any conclusions that can be drawn from them.In addition functional genomics analyses were undertaken to identify the role of significantly changed genes and gene families in primates, testing against a hypothesis that would see the majority of changes involving immune, sensory or reproductive genes. Gene Ontology (GO) annotations were retrieved for these data, which enabled highlighting the broad GO groupings and more specific functional classifications of these data. The results showed that the majority of gene expansions were in families that may have arisen due to adaptation, or were maintained due to their necessary involvement in developmental and metabolic processes. Comparisons were made to previously published studies to determine whether the Ensembl functional annotations were supported by the de-novo analyses undertaken in those studies. The majority were not, with only a small number of previously identified functional annotations being present in the most recent Ensembl releases.The impact of gene family evolution on intron evolution was explored in chapter five, by analysing gene family data and intron characteristics across the genomes of 61 vertebrate species. General descriptive statistics and visualisations were produced, along with tests for correlation between change in gene family size and the number, size and density of their associated introns. There was shown to be very little impact of change in gene family size on the underlying intron evolution. Other, non-family effects were therefore considered. These analyses showed that introns were restricted to euchromatic regions, with heterochromatic regions such as the centromeres and telomeres being largely devoid of any such features. A greater involvement of spatial mechanisms such as recombination, GC-bias across GC-rich isochores and biased gene conversion was thus proposed to play more of a role, though depending largely on population genetic and life history traits of the organisms involved. Additional population level sequencing and comparative analyses across a divergent group of species with available recombination maps and life history data would be a useful future direction in understanding the processes involved

    Study of Genomic Copy Number Variation in Equine Health and Disease

    Get PDF
    This is a study of copy number variations (CNVs) in the horse genome to gain knowledge about the role of CNVs in equine biology, and their contribution to complex diseases and disorders. We constructed a 400K whole-genome tiling array and applied it for the discovery of CNVs in 38 normal horses of 16 diverse breeds, and the Przewalski horse. Altogether, 258 CNV regions (CNVRs) were identified across all autosomes, chrX, and chrUn. The CNVRs comprised 1.3% of the horse genome with chr12 being most enriched. American Miniature Horses had the highest and American Quarter Horses the lowest number of CNVs in relation to Thoroughbred references. The Przewalski horse was similar to native ponies and draft breeds. About 20% of CNVRs were intergenic, while 80% involved 750 annotated genes with molecular functions predominantly in sensory perception, immunity, and reproduction. The findings were integrated with previous CNV studies in the horse to generate a composite genome-wide dataset of 1476 CNVRs. Of these, 301 CNVRs were shared between studies, while 1174 were novel and require further validation. Integrated data revealed that only 41 out of over 400 breeds of the domestic horse have been analyzed for CNVs, whereas this study added 11 new breeds. The composite CNV dataset served as a foundation for the discovery of variants contributing to Recurrent Airway Obstruction (RAO) and XY disorders of sexual development (DSDs), such as cryptorchidism and XY sex reversal. In 16 RAO affected horses 363 CNVRs were identified, of which 31 were novel and not found in healthy horses. A deletion in SPI2 and SERPINA1 was studied in detail because the genes are involved in respiratory diseases in human. In horses with XY DSDs, over 50 novel CNVRs were identified including deletions of functional interest in the pseudoautosomal region and the ATRX gene. A potentially causative homozygous deletion in chr29 disrupting AKR1C genes with functions in sex hormone metabolism was shared between a cryptorchid and two sex reversal horses. The findings effectively improved the knowledge about CNVs in horses, in health and disease, and generated resources for future studies
    • 

    corecore