726 research outputs found

    OPENMENDEL: A Cooperative Programming Project for Statistical Genetics

    Full text link
    Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDELproject (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project.Comment: 16 pages, 2 figures, 2 table

    Novel techniques for accelerating statistical operations on compressed genomic data

    Full text link
    Over the last decades, the availability of genetic data has exploded and genomic information is widely used in a variety of fields today. While the cost of genotyping and sequence assembly has been steadily decreasing, software in quantitative genetics has been struggling to keep up with increasing computational demands. Many existing software solutions use strategies for shared-memory parallelism and instruction-level parallelism. However, partly due to a lack of suitable hardware instructions, the dissemination of software that utilizes accelerator hardware has been limited. In this thesis, novel methods for the efficient processing of genomic data are presented. By utilizing low-precision integer instructions on modern NVIDIA® GPUs, the necessity to decompress SNP data for statistical evaluations is avoided. Due to the memory efficiency of compressed genomic storage formats, datasets of large populations with a high number of SNPs can be analyzed on a single datacenter GPU. The benefits of these new techniques are demonstrated through examples of important quantities in quantitative genetics. First, it is shown that the analytical calculation of population statistics, such as the genomic relationship matrix or linkage disequilibrium, is significantly accelerated compared to existing methods. Second, the numerical evaluation of a single-step BLUP model is used to demonstrate that the use of accelerators can significantly reduce computing times required for estimating genetic values based on iterative-solver methods. Lastly, it is illustrated that the estimation of parameters for an important covariance model can be significantly improved

    The discovery of novel recessive genetic disorders in dairy cattle : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Animal Science at AL Rae Centre of Genetics and Breeding, Massey University, Palmerston North, New Zealand

    Get PDF
    The selection of desirable characteristics in livestock has resulted in the transmission of advantageous genetic variants for generations. The advent of artificial insemination has accelerated the propagation of these advantageous genetic variants and led to tremendous advances in animal productivity. However, this intensive selection has led to the rapid uptake of deleterious alleles as well. Recently, a recessive mutation in the GALNT2 gene was identified to dramatically impair growth and production traits in dairy cattle causing small calf syndrome. The research presented here seeks to further investigate the presence and impact of recessive mutations in dairy cattle. A primary aim of genetics is to identify causal variants and understand how they act to manipulate a phenotype. As datasets have expanded, larger analyses are now possible and statistical methods to discover causal mutations have become commonplace. One such method, the genome-wide association study (GWAS), presents considerable exploratory utility in identifying quantitative trait loci (QTL) and causal mutations. GWAS' have predominantly focused on identifying additive genetic effects assuming that each allele at a locus acts independently of the other, whereas non-additive effects including dominant, recessive, and epistatic effects have been neglected. Here, we developed a single-locus non-additive GWAS model intended for the detection of dominant and recessive genetic mechanisms. We applied our non-additive GWAS model to growth, developmental, and lactation phenotypes in dairy cattle. We identified several candidate causal mutations that are associated with moderate to large deleterious recessive disorders of animal welfare and production. These mutations included premature-stop (MUS81, ITGAL, LRCH4, RBM34), splice disrupting (FGD4, GALNT2), and missense (PLCD4, MTRF1, DPF2, DOCK8, SLC25A4, KIAA0556, IL4R) variants, and these occur at surprisingly high frequencies in cattle. We further investigated these candidates for anatomical, molecular, and metabolic phenotypes to understand how these disorders might manifest. In some cases, these mutations were analogous to disorder-causing mutations in other species, these included: Coffin-Siris syndrome (DPF2); Charcot Marie Tooth disease (FGD4); a congenital disorder of glycosylation (GALNT2); hyper Immunoglobulin-E syndrome (DOCK8); Joubert syndrome (KIAA0556); and mitochondrial disease (SLC25A4). These discoveries demonstrate that deleterious recessive mutations exist in dairy cattle at remarkably high frequencies and we are able to detect these disorders through modern genotyping and phenotyping capabilities. These are important findings that can be used to improve the health and productivity of dairy cattle in New Zealand and internationally

    Metabolomics : a tool for studying plant biology

    Get PDF
    In recent years new technologies have allowed gene expression, protein and metabolite profiles in different tissues and developmental stages to be monitored. This is an emerging field in plant science and is applied to diverse plant systems in order to elucidate the regulation of growth and development. The goal in plant metabolomics is to analyze, identify and quantify all low molecular weight molecules of plant organisms. The plant metabolites are extracted and analyzed using various sensitive analytical techniques, usually mass spectrometry (MS) in combination with chromatography. In order to compare the metabolome of different plants in a high through-put manner, a number of biological, analytical and data processing steps have to be performed. In the work underlying this thesis we developed a fast and robust method for routine analysis of plant metabolite patterns using Gas Chromatography-Mass Spectrometry (GC/MS). The method was performed according to Design of Experiment (DOE) to investigate factors affecting the extraction and derivatization of the metabolites from leaves of the plant Arabidopsis thaliana. The outcome of metabolic analysis by GC/MS is a complex mixture of approximately 400 overlapping peaks. Resolving (deconvoluting) overlapping peaks is time-consuming, difficult to automate and additional processing is needed in order to compare samples. To avoid deconvolution being a major bottleneck in high through-put analyses we developed a new semi-automated strategy using hierarchical methods for processing GC/MS data that can be applied to all samples simultaneously. The two methods include base-line correction of the non-processed MS-data files, alignment, time-window determinations, Alternating Regression and multivariate analysis in order to detect metabolites that differ in relative concentrations between samples. The developed methodology was applied to study the effects of the plant hormone GA on the metabolome, with specific emphasis on auxin levels in Arabidopsis thaliana mutants defective in GA biosynthesis and signalling. A large series of plant samples was analysed and the resulting data were processed in less than one week with minimal labour; similar to the time required for the GC/MS analyses of the samples

    Multivariat analyse som verktøy til forståelse og reduksjon av kompleksitet av matematiske modeller i systembiologi

    Get PDF
    In the area of systems biology, technologies develop very fast, which allows us to collect massive amounts of various data. The main interest of scientists is to receive an insight into the obtained data sets and discover their inherent properties. Since the data often are rather complex and intimidating equations may be required for modelling, data analysis can be quite challenging for the majority of bio-scientists who do not master advanced mathematics. In this thesis it is proposed to use multivariate statistical methods as a tool for understanding the properties of complex models used for describing biological systems. The methods of multivariate analysis employed in this thesis search for latent variables that form a basis of all processes in a system. This often reduces dimensions of the system and makes it easier to get the whole picture of what is going on. Thus, in this work, methods of multivariate analysis were used with a descriptive purpose in Papers I and IV to discover effects of input variables on a response. Often it is necessary to know a functional form that could have generated the collected data in order to study the behaviour of the system when one or another parameter is tuned. For this purpose, we propose the Direct Look-Up (DLU) approach that is claimed here to be a worthy alternative to the already existing fitting methods due to its high computational speed and ability to avoid many problems such as subjectivity, choice of initial values, local optima and so on (Papers II and III). Another aspect covered in this thesis is an interpretation of function parameters by the custom human language with the use of multivariate analysis. This would enable mathematicians and bio-scientists to understand each other when describing the same object. It was accomplished here by using the concept of a metamodel and sensory analysis in Paper IV. In Paper I, a similar approach was used even though the main focus of the paper was slightly different. The original aim of the article was to show the advantages of the multi-way GEMANOVA analysis over the traditional ANOVA analysis for certain types of data. However, in addition, the relationship between human profiling of data samples and function parameters was discovered. In situations when funds for conducting experiments are limited and it is unrealizable to study all possible parameter combinations, it is necessary to have a smart way of choosing a few but most representative conditions for a particular system. In Paper V Multi-level Binary Replacement design (MBR) was developed as such, which can also be used for searching for a relevant parameter range. This new design method was applied here in Papers II and IV for selection of samples for further analyses.Teknologiutviklingen innenfor systembiologien er nå så rask at det gir mulighet til å samle svært store datamengder på kort tid og til relativ lav pris. Hovedinteressen til forskerne er typisk å få innsikt i dataene og deres iboende egenskaper. Siden data kan være ganske komplekse og ofte beskrives ved kompliserte, gjerne ikke-lineære, funksjoner, kan dataanalyse være ganske utfordrende for mange bioforskere som ikke behersker avansert matematikk. I dette arbeidet er det foreslått å bruke multivariat statistisk analyse for å komme nærmere en forståelse av egenskapene av kompliserte modeller som blir brukt for å beskrive biologiske systemer. De multivariate metodene som er benyttet i denne avhandlingen søker etter latente variabler som utgjør en lineær basis og tilnærming til de komplekse prosessene i et system. Dermed kan man oppnå en forenkling av systemet som er lettere å tolke. I dette arbeidet ble multivariate analysemetoder brukt i denne beskrivende hensikten i Artikler (Papers) I og IV til å oppdage effekter av funksjonsparametre på egenskapene til komplekse matematiske modeller. Ofte er det nødvendig å finne en matematisk funksjon som kunne ha generert de innsamlede dataene for å studere oppførselen av systemet. Med den hensikt foreslår vi en metode for modelltilpasning ved DLU-metoden (the Direct Look-Up) som her påstås å være et verdifullt alternativ til de eksisterende estimeringsmetodene på grunn av høy fart og evne til å unngå typiske problemer som for eksempel subjektivitet, valg av initialverdier, lokale optima, m.m (Artikler II og III). Et annet aspekt dekket i denne avhandlingen er bruken av multivariat analyse til å gi tolking av matematiske funksjonsparametre ved hjelp av et dagligdags vokabular. Dette kan gjøre det enklere for matematikere og bioforskere å forstå hverandre når de beskriver det samme objektet. Det var utført her ved å benytte ideen om en metamodell og sensorisk analyse i Artikkel IV. I Artikkel I var en lignende metode også brukt for å få sensoriske beskrivelser av bilder generert fra differensiallikninger. Hovedfokuset i Artikkel I var imidlertid et annet, nemlig å vise fordelen ved multi-way GEMANOVA-analyse fremfor den tradisjonelle ANOVA-analysen for visse datatyper. I denne artikkelen ble GEMANOVA brukt til å avdekke sammenhengen mellom kompliserte kombinasjoner av funksjonsparametrene og bildedeskriptorer. I situasjoner der ressurser til å utføre eksperimenter er begrenset og det er umulig å prøve ut alle kombinasjoner av parametre, er det behov for metoder som kan bestemme et fåtall av parameterinnstillinger som er mest mulig representative for et bestemt system. I Artikkel V ble derfor Multi-level Binary Replacement (MBR) design utviklet som en sådan, og den kan også brukes for å søke etter et relevant parameterrom for datasimuleringer. Den nye designmetoden ble anvendt i Artikler II og IV for utvelgelse av parameterverdier for videre analyser

    Accelerated matrix-vector multiplications for matrices involving genotype covariates with applications in genomic prediction

    Get PDF
    In the last decade, a number of methods have been suggested to deal with large amounts of genetic data in genomic predictions. Yet, steadily growing population sizes and the suboptimal use of computational resources are pushing the practical application of these approaches to their limits. As an extension to the C/CUDA library miraculix, we have developed tailored solutions for the computation of genotype matrix multiplications which is a critical bottleneck in the empirical evaluation of many statistical models. We demonstrate the benefits of our solutions at the example of single-step models which make repeated use of this kind of multiplication. Targeting modern Nvidia® GPUs as well as a broad range of CPU architectures, our implementation significantly reduces the time required for the estimation of breeding values in large population sizes. miraculix is released under the Apache 2.0 license and is freely available at https://github.com/alexfreudenberg/miraculix
    corecore