9,499 research outputs found
netgwas: An R Package for Network-Based Genome-Wide Association Studies
Graphical models are powerful tools for modeling and making statistical
inferences regarding complex associations among variables in multivariate data.
In this paper we introduce the R package netgwas, which is designed based on
undirected graphical models to accomplish three important and interrelated
goals in genetics: constructing linkage map, reconstructing linkage
disequilibrium (LD) networks from multi-loci genotype data, and detecting
high-dimensional genotype-phenotype networks. The netgwas package deals with
species with any chromosome copy number in a unified way, unlike other
software. It implements recent improvements in both linkage map construction
(Behrouzi and Wit, 2018), and reconstructing conditional independence network
for non-Gaussian continuous data, discrete data, and mixed
discrete-and-continuous data (Behrouzi and Wit, 2017). Such datasets routinely
occur in genetics and genomics such as genotype data, and genotype-phenotype
data. We demonstrate the value of our package functionality by applying it to
various multivariate example datasets taken from the literature. We show, in
particular, that our package allows a more realistic analysis of data, as it
adjusts for the effect of all other variables while performing pairwise
associations. This feature controls for spurious associations between variables
that can arise from classical multiple testing approach. This paper includes a
brief overview of the statistical methods which have been implemented in the
package. The main body of the paper explains how to use the package. The
package uses a parallelization strategy on multi-core processors to speed-up
computations for large datasets. In addition, it contains several functions for
simulation and visualization. The netgwas package is freely available at
https://cran.r-project.org/web/packages/netgwasComment: 32 pages, 9 figures; due to the limitation "The abstract field cannot
be longer than 1,920 characters", the abstract appearing here is slightly
shorter than that in the PDF fil
De novo construction of polyploid linkage maps using discrete graphical models
Linkage maps are used to identify the location of genes responsible for
traits and diseases. New sequencing techniques have created opportunities to
substantially increase the density of genetic markers. Such revolutionary
advances in technology have given rise to new challenges, such as creating
high-density linkage maps. Current multiple testing approaches based on
pairwise recombination fractions are underpowered in the high-dimensional
setting and do not extend easily to polyploid species. We propose to construct
linkage maps using graphical models either via a sparse Gaussian copula or a
nonparanormal skeptic approach. Linkage groups (LGs), typically chromosomes,
and the order of markers in each LG are determined by inferring the conditional
independence relationships among large numbers of markers in the genome.
Through simulations, we illustrate the utility of our map construction method
and compare its performance with other available methods, both when the data
are clean and contain no missing observations and when data contain genotyping
errors and are incomplete. We apply the proposed method to two genotype
datasets: barley and potato from diploid and polypoid populations,
respectively. Our comprehensive map construction method makes full use of the
dosage SNP data to reconstruct linkage map for any bi-parental diploid and
polyploid species. We have implemented the method in the R package netgwas.Comment: 25 pages, 7 figure
Recommended from our members
Genomic and phenotypic analysis of Vavilov's historic landraces reveals the impact of environment and genomic islands of agronomic traits.
The Vavilov Institute of Plant Genetic Resources (VIR), in St. Petersburg, Russia, houses a unique genebank, with historical collections of landraces. When they were collected, the geographical distribution and genetic diversity of most crops closely reflected their historical patterns of cultivation established over the preceding millennia. We employed a combination of genomics, computational biology and phenotyping to characterize VIR's 147 chickpea accessions from Turkey and Ethiopia, representing chickpea's center of origin and a major location of secondary diversity. Genotyping by sequencing identified 14,059 segregating polymorphisms and genome-wide association studies revealed 28 GWAS hits in potential candidate genes likely to affect traits of agricultural importance. The proportion of polymorphisms shared among accessions is a strong predictor of phenotypic resemblance, and of environmental similarity between historical sampling sites. We found that 20 out of 28 polymorphisms, associated with multiple traits, including days to maturity, plant phenology, and yield-related traits such as pod number, localized to chromosome 4. We hypothesize that selection and introgression via inadvertent hybridization between more and less advanced morphotypes might have resulted in agricultural improvement genes being aggregated to genomic 'agro islands', and in genotype-to-phenotype relationships resembling widespread pleiotropy
Populations in statistical genetic modelling and inference
What is a population? This review considers how a population may be defined
in terms of understanding the structure of the underlying genetics of the
individuals involved. The main approach is to consider statistically
identifiable groups of randomly mating individuals, which is well defined in
theory for any type of (sexual) organism. We discuss generative models using
drift, admixture and spatial structure, and the ancestral recombination graph.
These are contrasted with statistical models for inference, principle component
analysis and other `non-parametric' methods. The relationships between these
approaches are explored with both simulated and real-data examples. The
state-of-the-art practical software tools are discussed and contrasted. We
conclude that populations are a useful theoretical construct that can be well
defined in theory and often approximately exist in practice
Advancing the analysis of bisulfite sequencing data in its application to ecological plant epigenetics
The aim of this thesis is to bridge the gap between the state-of-the-art bioinformatic tools and resources, currently at the forefront of epigenetic analysis, and their emerging applications to non-model species in the context of plant ecology. New, high-resolution research tools are presented; first in a specific sense, by providing new genomic resources for a selected non-model plant species, and also in a broader sense, by developing new software pipelines to streamline the analysis of bisulfite sequencing data, in a manner which is applicable to a wide range of non-model plant species. The selected species is the annual field pennycress, Thlaspi arvense, which belongs in the same lineage of the Brassicaceae as the closely-related model species, Arabidopsis thaliana, and yet does not benefit from such extensive genomic resources. It is one of three key species in a Europe-wide initiative to understand how epigenetic mechanisms contribute to natural variation, stress responses and long-term adaptation of plants.
To this end, this thesis provides a high-quality, chromosome-level assembly for T. arvense, alongside a rich complement of feature annotations of particular relevance to the study of epigenetics. The genome assembly encompasses a hybrid approach, involving both PacBio continuous long reads and circular consensus sequences, alongside Hi-C sequencing, PCR-free Illumina sequencing and genetic maps. The result is a significant improvement in contiguity over the existing draft state from earlier studies.
Much of the basis for building an understanding of epigenetic mechanisms in non-model species centres around the study of DNA methylation, and in particular the analysis of bisulfite sequencing data to bring methylation patterns into nucleotide-level resolution. In order to maintain a broad level of comparison between T. arvense and the other selected species under the same initiative, a suite of software pipelines which include mapping, the quantification of methylation values, differential methylation between groups, and epigenome-wide association studies, have also been developed. Furthermore, presented herein is a novel algorithm which can facilitate accurate variant calling from bisulfite sequencing data using conventional approaches, such as FreeBayes or Genome Analysis ToolKit (GATK), which until now was feasible only with specifically-adapted software. This enables researchers to obtain high-quality genetic variants, often essential for contextualising the results of epigenetic experiments, without the need for additional sequencing libraries alongside. Each of these aspects are thoroughly benchmarked, integrated to a robust workflow management system, and adhere to the principles of FAIR (Findability, Accessibility, Interoperability and Reusability). Finally, further consideration is given to the unique difficulties presented by population-scale data, and a number of concepts and ideas are explored in order to improve the feasibility of such analyses.
In summary, this thesis introduces new high-resolution tools to facilitate the analysis of epigenetic mechanisms, specifically relating to DNA methylation, in non-model plant data. In addition, thorough benchmarking standards are applied, showcasing the range of technical considerations which are of principal importance when developing new pipelines and tools for the analysis of bisulfite sequencing data. The complete “Epidiverse Toolkit” is available at https://github.com/EpiDiverse and will continue to be updated and improved in the future.:ABSTRACT
ACKNOWLEDGEMENTS
1 INTRODUCTION
1.1 ABOUT THIS WORK
1.2 BIOLOGICAL BACKGROUND
1.2.1 Epigenetics in plant ecology
1.2.2 DNA methylation
1.2.3 Maintenance of 5mC patterns in plants
1.2.4 Distribution of 5mC patterns in plants
1.3 TECHNICAL BACKGROUND
1.3.1 DNA sequencing
1.3.2 The case for a high-quality genome assembly
1.3.3 Sequence alignment for NGS
1.3.4 Variant calling approaches
2 BUILDING A SUITABLE REFERENCE GENOME
2.1 INTRODUCTION
2.2 MATERIALS AND METHODS
2.2.1 Seeds for the reference genome development
2.2.2 Sample collection, library preparation, and DNA sequencing
2.2.3 Contig assembly and initial scaffolding
2.2.4 Re-scaffolding
2.2.5 Comparative genomics
2.3 RESULTS
2.3.1 An improved reference genome sequence
2.3.2 Comparative genomics
2.4 DISCUSSION
3 FEATURE ANNOTATION FOR EPIGENOMICS
3.1 INTRODUCTION
3.2 MATERIALS AND METHODS
3.2.1 Tissue preparation for RNA sequencing
3.2.2 RNA extraction and sequencing
3.2.3 Transcriptome assembly
3.2.4 Genome annotation
3.2.5 Transposable element annotations
3.2.6 Small RNA annotations
3.2.7 Expression atlas
3.2.8 DNA methylation
3.3 RESULTS
3.3.1 Transcriptome assembly
3.3.2 Protein-coding genes
3.3.3 Non-coding loci
3.3.4 Transposable elements
3.3.5 Small RNA
3.3.6 Pseudogenes
3.3.7 Gene expression atlas
3.3.8 DNA Methylation
3.4 DISCUSSION
4 BISULFITE SEQUENCING METHODS
4.1 INTRODUCTION
4.2 PRINCIPLES OF BISULFITE SEQUENCING
4.3 EXPERIMENTAL DESIGN
4.4 LIBRARY PREPARATION
4.4.1 Whole Genome Bisulfite Sequencing (WGBS)
4.4.2 Reduced Representation Bisulfite Sequencing (RRBS)
4.4.3 Target capture bisulfite sequencing
4.5 BIOINFORMATIC ANALYSIS OF BISULFITE DATA
4.5.1 Quality Control
4.5.2 Read Alignment
4.5.3 Methylation Calling
4.6 ALTERNATIVE METHODS
5 FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS
5.1 INTRODUCTION
5.2 MATERIALS AND METHODS
5.2.1 Reference species
5.2.2 Natural accessions
5.2.3 Read simulation
5.2.4 Read alignment
5.2.5 Mapping rates
5.2.6 Precision-recall
5.2.7 Coverage deviation
5.2.8 DNA methylation analysis
5.3 RESULTS
5.4 DISCUSSION
5.5 A PIPELINE FOR WGBS ANALYSIS
6 THERE AND BACK AGAIN: INFERRING GENOMIC INFORMATION
6.1 INTRODUCTION
6.1.1 Implementing a new approach
6.2 MATERIALS AND METHODS
6.2.1 Validation datasets
6.2.2 Read processing and alignment
6.2.3 Variant calling
6.2.4 Benchmarking
6.3 RESULTS
6.4 DISCUSSION
6.5 A PIPELINE FOR SNP VARIANT ANALYSIS
7 POPULATION-LEVEL EPIGENOMICS
7.1 INTRODUCTION
7.2 CHALLENGES IN POPULATION-LEVEL EPIGENOMICS
7.3 DIFFERENTIAL METHYLATION
7.3.1 A pipeline for case/control DMRs
7.3.2 A pipeline for population-level DMRs
7.4 EPIGENOME-WIDE ASSOCIATION STUDIES (EWAS)
7.4.1 A pipeline for EWAS analysis
7.5 GENOTYPING-BY-SEQUENCING (EPIGBS)
7.5.1 Extending the epiGBS pipeline
7.6 POPULATION-LEVEL HAPLOTYPES
7.6.1 Extending the EpiDiverse/SNP pipeline
8 CONCLUSION
APPENDICES
A. SUPPLEMENT: BUILDING A SUITABLE REFERENCE GENOME
B. SUPPLEMENT: FEATURE ANNOTATION FOR EPIGENOMICS
C. SUPPLEMENT: FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS
D. SUPPLEMENT: INFERRING GENOMIC INFORMATION
BIBLIOGRAPH
THREaD Mapper Studio: a novel, visual web server for the estimation of genetic linkage maps
The estimation of genetic linkage maps is a key component in plant and animal research, providing both an indication of the genetic structure of an organism and a mechanism for identifying candidate genes associated with traits of interest. Because of this importance, several computational solutions to genetic map estimation exist, mostly implemented as stand-alone software packages. However, the estimation process is often largely hidden from the user. Consequently, problems such as a program crashing may occur that leave a user baffled. THREaD Mapper Studio (http://cbr.jic.ac.uk/threadmapper) is a new web site that implements a novel, visual and interactive method for the estimation of genetic linkage maps from DNA markers. The rationale behind the web site is to make the estimation process as transparent and robust as possible, while also allowing users to use their expert knowledge during analysis. Indeed, the 3D visual nature of the tool allows users to spot features in a data set, such as outlying markers and potential structural rearrangements that could cause problems with the estimation procedure and to account for them in their analysis. Furthermore, THREaD Mapper Studio facilitates the visual comparison of genetic map solutions from third party software, aiding users in developing robust solutions for their data sets
Genotype imputation accuracy in a F2 pig population using high density and low density SNP panels
Background: F2 resource populations have been used extensively to map QTL segregating between pig breeds. A limitation associated with the use of these resource populations for fine mapping of QTL is the reduced number of founding individuals and recombinations of founding haplotypes occurring in the population. These limitations, however, become advantageous when attempting to impute unobserved genotypes using within family segregation information. A trade-off would be to re-type F2 populations using high density SNP panels for founding individuals and low density panels (tagSNP) in F2 individuals followed by imputation. Subsequently a combined meta-analysis of several populations would provide adequate power and resolution for QTL mapping, and could be achieved at relatively low cost. Such a strategy allows the wealth of phenotypic information that has previously been obtained on experimental resource populations to be further mined for QTL identification. In this study we used experimental and simulated high density genotypes (HD-60K) from an F2 cross to estimate imputation accuracy under several genotyping scenarios. Results: Selection of tagSNP using physical distance or linkage disequilibrium information produced similar imputation accuracies. In particular, tagSNP sets averaging 1 SNP every 2.1 Mb (1,200 SNP genome-wide) yielded imputation accuracies (IA) close to 0.97. If instead of using custom panels, the commercially available 9K chip is used in the F2, IA reaches 0.99. In order to attain such high imputation accuracy the F0 and F1 generations should be genotyped at high density. Alternatively, when only the F0 is genotyped at HD, while F1 and F2 are genotyped with a 9K panel, IA drops to 0.90. Conclusions: Combining 60K and 9K panels with imputation in F2 populations is an appealing strategy to re-genotype existing populations at a fraction of the cost.Fil: Gualdron Duarte, Jose Luis. Michigan State University; Estados Unidos. Universidad de Buenos Aires. Facultad de Agronomia. Departamento de Producción Animal; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Bates, Ronald O.. Michigan State University; Estados UnidosFil: Ernst, Catherine W.. Michigan State University; Estados UnidosFil: Raney, Nancy E.. Michigan State University; Estados UnidosFil: Cantet, Rodolfo Juan Carlos. Universidad de Buenos Aires. Facultad de Agronomia. Departamento de Producción Animal; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Steibel, Juan P.. Michigan State University; Estados Unido
Genomic tools and molecular breeding approaches for the domestication of field cress (Lepidium campestre L.)
Field cress (Lepidium campestre L.) is a biennial self-pollinated plant with a small genome size. The ever-increasing global population alongside climate change prompts urgent actions to save the ecosystem. Domesticating multi-purpose species such as field cress could be considered as part of the solution to mitigate the challenges posed by climate change and population growth. In addition to the oil producing potential, the domestication of field cress in arable lands has multitude effects – such as protecting environmental contamination and contributing as food and feed uses. In clues of these potentials, identifying the genomic variation underlying important traits using genomic tools is pivotal approach in field cress domestication. The main goal of the research in this thesis was to develop genomic tools for field cress domestication, specifically aiming at constructing the genetic linkage map, identifying the quantitative trait loci (QTL) underpinning domestication traits, and elucidating the common genetic variants associated with the seed yield as well as seed oil, protein, and moisture contents in field cress. An integrated mapping approach were performed to developing the first genetic linkage map for field cress. Relying on the linkage map, the identification of domestication QTL using linkage analysis as well as common variants using genome-wide association study (GWAS) were succeeded. Furthermore, the developed linkage map will be used in guiding to develop the reference genome using whole-genome sequencing (WGS) in field cress. Given further functional genomic efforts, the identified QTL and single variants could facilitate the process of domestication and genomicsassisted breeding in field cress, including the use of evolving approaches such as genome-wide prediction in the field cress
- …