9 research outputs found

    SNPFile – A software library and file format for large scale association mapping and population genetics studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and at text files were adequate solutions earlier, the increased data size mandates more efficient solutions.</p> <p>Results</p> <p>We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods.</p> <p>Conclusion</p> <p>The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. This is somewhat alleviated by a scripting interfaces that makes it easy to write converters to and from the format.</p

    Kytkentä- ja assosiaatiomenetelmien käyttö suurten perheiden ja jatkuvan muuttujan kanssa : tutkimustapauksena musikaalisuuden perinnöllisyys

    Get PDF
    Genome wide linkage and association methods are used to map genes affecting traits with genetic predisposition. In this thesis, I compare the methods suitable for quantitative trait mapping in complex, extended pedigrees. As a case study, gene-mapping study of musical aptitude is performed with these methods. Linkage analysis methods are developed for family studies. However, only a few methods are suitable for extended families with a quantitative trait. Three linkage programs were successfully applied for such data in this study. These programs are the SOLAR, JPSGCS and KELVIN. All of these three programs are based on different methods and thus, the same calculations are not repeated. SOLAR is based on the variance components method, JPSGCS on a graphical method and KELVIN on the Bayesian method. Association analysis is also difficult to implement for large pedigrees, because it is best suited for case-control data. Fortunately, methods are extended also for family-based studies. Here, a genomic control method was used to correct for the familial relationships. The method evaluates the relatedness from the whole genome data and the association tests are corrected for the relatedness rates. This method was implemented from the GenAbel program. As a case study, these methods were applied to a musical aptitude study. The musical aptitude is here understood as an ability to perceive the melody, harmony and rhythm of music, and to recognize structures in set of sounds. These abilities were tested with Carl Seashore s tests for pitch and time and Kai Karma's test for auditory structuring. The data consists of 107 pedigrees and 93 sporadic subjects, comprising in total of 915 individuals. Each family includes 2 - 50 individuals. These individuals were genotyped with a SNP chip for over 700,00 SNPs. The linkage analyses revealed several promising loci for the musical aptitude. The best result was located in 4q12 and it was found with all of the three linkage programs. Most of the other results could also be identified with multiple programs, but some differences also occurred. However, none of the findings could be discovered with association analysis, probably due to a too small sample size.Genominlaajuisia kytkentä- ja assosiaatiomenetelmiä käytetään kartoittamaan tiettyjä ominaisuuksia aiheuttavia geenejä. Tässä tutkimuksessa vertailen geenikartoitukseen tarkoitettuja menetelmiä, jotka soveltuvat jatkuvien muuttujien tutkimiseen suurissa, monimutkaisissa suvuissa. Esimerkkitapauksena käytän musikaalisuuden geenikartoitusta, joka toteutetaan useammalla vertaillulla menetelmällä. Kytkentämenetelmät on kehitetty perheiden tutkimiseen. Kuitenkin vain muutamat menetelmät soveltuvat jatkuvina muuttujina mitattujen ominaisuuksien tutkimiseen suvuissa. Tässä työssä tällaista aineistoa analysoidaan kolmella kytkentäohjelmalla: SOLAR, JPSGCS ja KELVIN. Koska nämä ohjelmat perustuvat eri algoritmeihin ja lähestymistapoihin, samoja laskelmia ei toisteta. SOLAR perustuu varianssikomponenttimenetelmään, JPSGCS graafiseen menetelmään ja KELVIN on Bayesilainen menetelmä. Assosiaatioanalyysi on ylipäänsä hankala toteuttaa perheillä, koska se on ensisijaisesti suunnattu tapaus- verrokkitutkimuksiin. Näitä menetelmiä on kuitenkin muokattu, jotta ne soveltuisivat myös suvuille. Tässä työssä käytän assosiaatioanalyysiin GenAbel ohjelmaa, jossa sukulaisuus tutkittavien välillä korjataan genomikontrollimenetelmällä (genomic control). Tämä menetelmä arvioi sukulaisuuden kahden henkilön välillä näiden koko genomin genotyyppien yhtenevyyden avulla. Arvion perusteella assosiaatiolaskelmat korjataan ottamaan sukulaisuus huomioon. Esimerkkitapauksena käytän näitä valittuja ohjelmia musikaalisuuden perinnöllisyyden tutkimiseen. Musikaalisuus on tässä tutkimuksessa määritelty kykynä hahmottaa musiikin rytmiä, melodiaa ja sointia sekä musiikin rakenteita. Näitä ominaisuuksia mitattiin Carl Seashoren sävelkorkeuden ja -keston testeillä sekä Kai Karman musiikin rakenteiden testillä. Aineisto koostuu 107 perheestä ja 93 yksittäisestä henkilöstä (yhteensä 915 tutkittavaa). Osa perheistä on hyvin pieniä (2 henkilöä) ja osa kokonaisia sukuja (suurimmassa 50 henkilöä). Tutkittavat genotyypattiin SNP sirulla, jossa oli yli 700 000 geenimerkkiä. Kytkentäanalyysi paljasti musikaalisuudelle useita kiinnostavia geenipaikkoja. Kaikkien kolmen kytkentäohjelman paras tulos asettuu kromosomin 4q12 kohdalle. Suurin osa muistakin tuloksista toistettiin useammalla ohjelmalla, mutta erojakin löytyi. Assosiaatioanalyysilla ei löydetty yhtään genominlaajuisella merkitsevyystasolla merkitsevää tulosta. Todennäköisesti syynä oli liian pieni otoskoko

    Development of a tool for copy number analysis of cancer genomes using high throughput sequencing data

    Get PDF
    Genomic copy number alterations (CNA) and loss of heterozygozity (LOH) are two types of genomic instabilities associated with cancer. Acquisition of these genomic instabilities affects the expression level of oncogenes and tumor suppressor genes. Thus, accurate detection of these abnormalities is a crucial step in identifying novel oncogenes and tumor suppressor genes. Whole-genome sequencing of tumor tissues has enabled new opportunities for the detection of such aberrations and the characterization of genomic aberrations in tumor samples. In this work, a fast tool for the identification of CNAs and copy-neutral LOH in tumor samples using whole-genome sequencing data was developed. The developed tool segments the genome by analyzing the read-depth and B-allele fraction profiles using a double sliding window method. It requires a matched normal sample to correct for biases such as GC-content and mapability and to discriminate somatic from germline events. The developed tool was evaluated on both simulated and real whole-genome sequencing data against competing, state of the art tools to demonstrate its accuracy. The tool, written in the Python programming language, is fast and performs segmentation of a whole genome in less than two minutes

    Introgression patterns in Scottish blue mussel (Mytilus edulis) populations

    Get PDF
    Background: The blue mussel, Mytilus edulis L., is an important contributor to the shellfish sector of Scottish aquaculture, with 7,270 tonnes worth £8.8 million being produced for the year 2015. Since 2010, production values have fluctuated as a result of inconsistent spat settlement, several business closures, and heightened levels of marine toxins in some areas. On Scotland’s west coast, some farms (most notably Loch Etive) have suffered production losses from the appearance of non-marketable mussels with particularly fragile shells and poor quality meat. Recent research has demonstrated that these undesirable traits have a genetic factor, linked to the presence of a non-native but related species Mytilus trossulus (Gould, 1850) and often its hybrids with the native M. edulis. M. trossulus has been classed as a commercially damaging species under Scottish law, but there is insufficient data on hybridisation and introgression patterns in Scottish mussel populations to evaluate any possible impacts this could have on production. Existing research has focused on single locus genotyping to identify Mytilus spp. and their hybrids in Scotland. By instead utilising multilocus genotyping, introgression could be identified and a better understanding of population structure could be gained, with implications for management to maintain productivity and profitability. The aim of the research presented here was to develop and validate a suite of new species diagnostic markers for multilocus genotyping of field populations of Scottish mussels, thereby establishing a more complete picture of the taxonomic relationships between species than previous studies have permitted. Results: Analysis of SNPs identified with RADseq confirmed the presence of three genetically distinct Mytilus species in Scotland: M. edulis, M. galloprovincialis and M. trossulus. RADseq and KASP genotyping technology successfully identified and validated a suite of 12 highly robust diagnostic SNP markers for multilocus genotyping of Mytilus mussel populations. These markers permitted more comprehensive genotyping than previous studies had, allowing presumed pure species individuals to be distinguished from first generation (F1) hybrids and introgressed (FX) genotypes in reference populations, and subsequently presented the possibility of exploring introgression in a wider scale study. Multilocus genotyping of mussel populations from around Scotland revealed widespread introgression of M. edulis with both M. galloprovincialis and M. trossulus. No pure M. galloprovincialis was identified and pure M. trossulus was restricted to a single site in Loch Etive, possibly part of a relict population. F1 hybrids between M. edulis and M. trossulus were identified in Loch Etive and in Loch Fyne on the west coast. This was evidence of ongoing hybridisation and suggested an active hybrid zone existed in Scotland, something that previous single locus genotyping studies had not acknowledged. A link between shell fragility and M. trossulus introgression was recognised at a single site outside of Loch Etive, but this was not apparent anywhere else and the actual causes of shell fragility remain unevaluated. There was a clear difference between the genetics of most farmed stock and wild populations, which indicated an anthropogenic effect on introgression and subsequent species composition, and had implications for future farm site selection and broodstock sourcing. Temporal species composition in Loch Etive differed over a short time period, but high proportions of M. trossulus alleles were observable some 25 months after a major fallowing event had taken place. Pure M. trossulus was also identifiable, which was consistent with the presence of an established population of M. trossulus existing in this area. Conclusion: Multilocus genotyping has produced a more in depth picture of species diversity in Scottish mussel populations. SNP assays revealed widespread introgression between three genetically distinct species – M. edulis, M. galloprovincialis and M. trossulus – and furthermore recognised that, to date, single locus genotyping has overestimated the abundance of pure Mytilus mussels in Scottish waters. However, this hitherto unidentified genetic complexity does not appear disadvantageous to mussel production, despite the prevalence of M. trossulus introgression among farmed populations, and it is somewhat unlikely that genetics are the sole cause of undesirable shell characteristics among Mytilus spp. mussels

    Computational identification of synonymous SNPs in the human genome and their potential role in disease

    Get PDF
    The potential phenotypic effects of synonymous SNPs (sSNPs) have long been overlooked. Although several sSNPs are no longer thought to be silent, no one has identified which sSNPs may contribute to phenotypic variation on a genome-wide scale. sSNPs that cause a change in codon-usage frequency or mRNA secondary structures may alter translational and protein folding kinetics. In addition, sSNPs that alter splice-site consensus sequences may cause aberrant slicing, which could change the protein product. A sSNP that contributes to any of these molecular mechanisms may thus alter protein structure and function. To computationally identify sSNPs with a potential impact, SynSNP was created. SynSNP is a text-based tool written in Python. All sSNPs published within dbSNP are first identified. SynSNP uses established bioinformatics tools to determine which of the sSNPs may potentially result in a molecular effect. The potentially functional sSNPs are then assessed to determine whether any have previously been associated with a trait or disease in genome-wide association studies (GWAS) and/or occur within genes known to be associated with disease in OMIM (Online Mendelian Inheritance in Man). Of the 90,102 identified sSNPs, 21,086 (23.4%) were predicted to potentially have a functional impact, through one or more of the three molecular mechanisms investigated. Of the sSNPs predicted to potentially have a functional impact, 14 (0.07%) had previously been associated with a trait or disease in GWAS. A subset of 4,057 (19.2%) of the potentially functional sSNPs were within genes known to be associated with disease in OMIM. Only six (0.03%) of the potentially functional sSNPs had previously been associated with a trait or disease in GWAS and occurred within genes known to be associated with disease in OMIM. SynSNP could be developed further to aid the discovery of more sSNPs with a potential functional impact. A significant proportion of sSNPs may have a functional impact and their potential role in disease should therefore not be underestimated or neglected

    Population-Sequencing as a Biomarker of Burkholderia mallei and Burkholderia pseudomallei Evolution through Microbial Forensic Analysis

    Get PDF
    Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations

    Using “Omics” to Discover Predictive Biomarkers in Women at High Risk of Spontaneous Preterm Birth

    Get PDF
    Spontaneous preterm birth (sPTB) is a complex pregnancy syndrome that remains poorly understood and is associated with significant perinatal morbidity and mortality worldwide. Current research suggests that there are multiple disordered physiological processes that trigger a final common pathway of early labour, rather than a single specific cause. It is this heterogeneity that has hindered the discovery of a single predictive biomarker and existing screening methods for sPTB prediction are insufficient to detect all women at risk. Consequently, our inability to identify women at risk inhibits efforts of prevention, which cannot be achieved without better understanding of causation or a more robust way of accurately discriminating those at high risk. The development in “omics” technology has led to exciting breakthroughs in other areas of medicine and offers new avenues of investigation for sPTB prediction. The primary aim of the thesis was to establish a way of combining different types of ‘omics’ analysis from the same individual in a pilot study to identify candidate biomarker predictors or pathways. Three different “omic” methodologies; genomics, transcriptomics and metabolomics, were used to analyse blood taken from asymptomatic women high-risk for sPTB at 16 and 20 weeks of pregnancy. Lastly, I investigated if there are distinct differences in biomarkers between PPROM and sPTB subgroups of spontaneous preterm birth. On an individual omics level only transcriptomics showed an association with sPTB. Gene set enrichment in this population demonstrates that the selenoamino acid pathway differentiates asymptomatic high-risk women. Hierarchical clustering in a non-linear distance matrix differentiated all but one of the sPTB and PPROM cases. More studies are required to validate the findings from our analysis. Data from each omics discipline was combined together in a single data matrix and machine learning analyses applied. The area under the curve (AUC) of receiver operating characteristic (ROC) values for Linear discriminant analysis (0.90), Genetic expression programming (0.70), K-Means (1.00), Linear support vector machine (0.96), Support vector machine with a Gaussian Kernel (0.96), Probabilistic neural network (1.00) and Random Forest (0.96) demonstrate that most machine learning methods perform well on our dataset. Sample sizes needed to reach excellent (AUC = 0.9) vs. moderate (AUC = 0.7) prediction performance were found to be within realistic ranges. This study provides a conceptual analytical framework for the prediction of sPTB. For a larger cohort prediction power is excellent, making individualized preterm prediction a realistic possibility
    corecore