622 research outputs found

    Distribution of Genomic Variation in the USDA Soybean Germplasm Collection and Relationship with Phenotypic Variation

    Get PDF
    The USDA Soybean Germplasm Collection harbors a large stock of genetic diversity with potential to accelerate soybean cultivar development. The extent and nature of favorable alleles contained in the collection are not well known nor is the distribution of genetic variation and how it relates to phenotypic variation. The genotyping of the entire USDA Soybean Germplasm Collection marked the beginning of a systematic exploration of genetic diversity for genetic research and breeding. In this research, we conducted the first comprehensive analysis of population structure on the collection of ~14,400 soybean accessions [Glycine max (L.) Merr. and G. soja Siebold & Zucc.] that were genotyped using a 50KSNP chip. Accessions originating from Japan and Korea diverged from the Chinese accessions. The ancestry of founders of the American accessions derived mostly from two Chinese subpopulations, which reflects the composition of the American accessions as a whole. A genome-wide association study on ~12,000 accession conducted on seed protein and oil is the largest reported to date in plants and identified strong single nucleotide polymorphisms (SNPs) signals on chromosomes 20 and 15. The haplotype effects of the chromosome 20 region show a strong negative relationship between oil and protein at this locus, indicating negative pleiotropic effects or multiple closely linked loci in repulsion phase linkage. Genome-wide association mapping for ten descriptive traits identified a total of 23 known genes and unknown genes controlling the phenotypic variants. Because some of those genes had been cloned, we were able to show that the narrow SNP signal regions had chromosomal base pair spans that, with few exceptions, bracketed the base pair region of the cloned gene coding sequences, despite variation in SNP distribution of chip SNP set. We also elucidate the genetic basis of local adaptation by exploring the natural variation available in 3,012 locally adapted landrace accessions from across the geographical range of soybean. Our approach using selection mapping and landscape genomic association methods identified important candidate genes related to drought and heat stress, and revealed important signatures of directional selection that are likely involved on geographic divergence of soybean. Advisors: Aaron J. Lorenz and George L. Grae

    Distribution of Genomic Variation in the USDA Soybean Germplasm Collection and Relationship with Phenotypic Variation

    Get PDF
    The USDA Soybean Germplasm Collection harbors a large stock of genetic diversity with potential to accelerate soybean cultivar development. The extent and nature of favorable alleles contained in the collection are not well known nor is the distribution of genetic variation and how it relates to phenotypic variation. The genotyping of the entire USDA Soybean Germplasm Collection marked the beginning of a systematic exploration of genetic diversity for genetic research and breeding. In this research, we conducted the first comprehensive analysis of population structure on the collection of ~14,400 soybean accessions [Glycine max (L.) Merr. and G. soja Siebold & Zucc.] that were genotyped using a 50KSNP chip. Accessions originating from Japan and Korea diverged from the Chinese accessions. The ancestry of founders of the American accessions derived mostly from two Chinese subpopulations, which reflects the composition of the American accessions as a whole. A genome-wide association study on ~12,000 accession conducted on seed protein and oil is the largest reported to date in plants and identified strong single nucleotide polymorphisms (SNPs) signals on chromosomes 20 and 15. The haplotype effects of the chromosome 20 region show a strong negative relationship between oil and protein at this locus, indicating negative pleiotropic effects or multiple closely linked loci in repulsion phase linkage. Genome-wide association mapping for ten descriptive traits identified a total of 23 known genes and unknown genes controlling the phenotypic variants. Because some of those genes had been cloned, we were able to show that the narrow SNP signal regions had chromosomal base pair spans that, with few exceptions, bracketed the base pair region of the cloned gene coding sequences, despite variation in SNP distribution of chip SNP set. We also elucidate the genetic basis of local adaptation by exploring the natural variation available in 3,012 locally adapted landrace accessions from across the geographical range of soybean. Our approach using selection mapping and landscape genomic association methods identified important candidate genes related to drought and heat stress, and revealed important signatures of directional selection that are likely involved on geographic divergence of soybean. Advisors: Aaron J. Lorenz and George L. Grae

    Distribution of Genomic Variation in the USDA Soybean Germplasm Collection and Relationship with Phenotypic Variation

    Get PDF
    The USDA Soybean Germplasm Collection harbors a large stock of genetic diversity with potential to accelerate soybean cultivar development. The extent and nature of favorable alleles contained in the collection are not well known nor is the distribution of genetic variation and how it relates to phenotypic variation. The genotyping of the entire USDA Soybean Germplasm Collection marked the beginning of a systematic exploration of genetic diversity for genetic research and breeding. In this research, we conducted the first comprehensive analysis of population structure on the collection of ~14,400 soybean accessions [Glycine max (L.) Merr. and G. soja Siebold & Zucc.] that were genotyped using a 50KSNP chip. Accessions originating from Japan and Korea diverged from the Chinese accessions. The ancestry of founders of the American accessions derived mostly from two Chinese subpopulations, which reflects the composition of the American accessions as a whole. A genome-wide association study on ~12,000 accession conducted on seed protein and oil is the largest reported to date in plants and identified strong single nucleotide polymorphisms (SNPs) signals on chromosomes 20 and 15. The haplotype effects of the chromosome 20 region show a strong negative relationship between oil and protein at this locus, indicating negative pleiotropic effects or multiple closely linked loci in repulsion phase linkage. Genome-wide association mapping for ten descriptive traits identified a total of 23 known genes and unknown genes controlling the phenotypic variants. Because some of those genes had been cloned, we were able to show that the narrow SNP signal regions had chromosomal base pair spans that, with few exceptions, bracketed the base pair region of the cloned gene coding sequences, despite variation in SNP distribution of chip SNP set. We also elucidate the genetic basis of local adaptation by exploring the natural variation available in 3,012 locally adapted landrace accessions from across the geographical range of soybean. Our approach using selection mapping and landscape genomic association methods identified important candidate genes related to drought and heat stress, and revealed important signatures of directional selection that are likely involved on geographic divergence of soybean. Advisors: Aaron J. Lorenz and George L. Grae

    Development of a multi-omics approach to identify highly correlated transcriptomic, proteomic and metabolic signatures in maize B73 and FR697 drought stressed nodal roots

    Get PDF
    Maize is one of the most important crops grown in the continental US and worldwide, and as such, major interest is directed towards understanding the impact of drought conditions on maize growth and development. Nodal roots, which develop from the base of the stem and produce the framework of the mature root system, can continue to grow under water stress conditions that inhibit the growth of the leaves and stem. To better understand the molecular mechanisms that led to this remarkable ability, we analyzed multiomics (transcriptome, proteome, metabolome) datasets generated from the growth zone of nodal roots collected from the reference inbred line B73 and from inbred line FR697, which exhibits a relatively greater ability to maintain root elongation under water-stressed conditions. We developed an informatics analytics pipeline consisting of a discriminatory multiomics data integration approach combining sparse Generalized Canonical Correlation Analysis (sGCCA) and generalized Partial Least Square analysis (PLS) to incorporate all datasets into one holistic global network and form clusters spanning all omics levels. Significant elements from these clusters were connected to various observations associated with water stress in the root tip samples and reinforced by their roles in biological pathways. We also generated an annotated "SuperTranscriptome" assembly from Pacbio Iso-Seq and RNA-Seq datasets to serve as a representative assembly for the FR697 genotype. The results were incorporated into the KBCommons maize database for storage and analysis from various viewpoints. To visualize interactions between the many elements, we are also developing a suite of 3D visualization, collectively called the "KBCommons Omics Studio", integrated with the KBCommons framework. Using these methods, we showcase possible biomarkers related to drought stress and allied observations. Supported by NSF Plant Genome Program IOS #1444448

    From Classical to Modern Computational Approaches to Identify Key Genetic Regulatory Components in Plant Biology

    Get PDF
    The selection of plant genotypes with improved productivity and tolerance to environmental constraints has always been a major concern in plant breeding. Classical approaches based on the generation of variability and selection of better phenotypes from large variant collections have improved their efficacy and processivity due to the implementation of molecular biology techniques, particularly genomics, Next Generation Sequencing and other omics such as proteomics and metabolomics. In this regard, the identification of interesting variants before they develop the phenotype trait of interest with molecular markers has advanced the breeding process of new varieties. Moreover, the correlation of phenotype or biochemical traits with gene expression or protein abundance has boosted the identification of potential new regulators of the traits of interest, using a relatively low number of variants. These important breakthrough technologies, built on top of classical approaches, will be improved in the future by including the spatial variable, allowing the identification of gene(s) involved in key processes at the tissue and cell levels

    Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

    Get PDF
    The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

    Beyond skin-deep: targeting the plant surface for crop improvement

    Get PDF
    The aboveground plant surface is a well-adapted tissue layer that acts as an interface between the plant and its surrounding environment. As such, its primary role is to protect against desiccation and maintain the gaseous exchange required for photosynthesis. Further, this surface layer provides a barrier against pathogens and herbivory, while attracting pollinators and agents of seed dispersal. In the context of agriculture, the plant surface is strongly linked to postharvest crop quality and yield. The epidermal layer contains several unique cell types adapted for these functions, while the nonlignified aboveground plant organs are covered by a hydrophobic cuticular membrane. This review aims to provide an overview of the latest understanding of the molecular mechanisms underlying crop cuticle and epidermal cell formation, with focus placed on genetic elements contributing towards quality, yield, drought tolerance, herbivory defence, pathogen resistance, pollinator attraction and sterility, while highlighting the interrelatedness of plant surface development and traits. Potential crop improvement strategies utilising this knowledge are outlined in the context of the recent development of new breeding technique

    Identification of rumen microbial biomarkers linked to methane emission in Holstein dairy cows

    Get PDF
    Mitigation of greenhouse gas emissions is relevant for reducing the environmental impact of ruminant production. In this study, the rumen microbiome from Holstein cows was characterized through a combination of 16S rRNA gene and shotgun metagenomic sequencing. Methane production (CH4) and dry matter intake (DMI) were individually measured over 4–6 weeks to calculate the CH4 yield (CH4y = CH4/DMI) per cow. We implemented a combination of clustering, multivariate and mixed model analyses to identify a set of operational taxonomic unit (OTU) jointly associated with CH4y and the structure of ruminal microbial communities. Three ruminotype clusters (R1, R2 and R3) were identified, and R2 was associated with higher CH4y. The taxonomic composition on R2 had lower abundance of Succinivibrionaceae and Methanosphaera, and higher abundance of Ruminococcaceae, Christensenellaceae and Lachnospiraceae. Metagenomic data confirmed the lower abundance of Succinivibrionaceae and Methanosphaera in R2 and identified genera (Fibrobacter and unclassified Bacteroidales) not highlighted by metataxonomic analysis. In addition, the functional metagenomic analysis revealed that samples classified in cluster R2 were overrepresented by genes coding for KEGG modules associated with methanogenesis, including a significant relative abundance of the methyl‐coenzyme M reductase enzyme. Based on the cluster assignment, we applied a sparse partial least‐squares discriminant analysis at the taxonomic and functional levels. In addition, we implemented a sPLS regression model using the phenotypic variation of CH4y. By combining these two approaches, we identified 86 discriminant bacterial OTUs, notably including families linked to CH4 emission such as Succinivibrionaceae, Ruminococcaceae, Christensenellaceae, Lachnospiraceae and Rikenellaceae. These selected OTUs explained 24% of the CH4y phenotypic variance, whereas the host genome contribution was ~14%. In summary, we identified rumen microbial biomarkers associated with the methane production of dairy cows; these biomarkers could be used for targeted methane‐reduction selection programmes in the dairy cattle industry provided they are heritable.info:eu-repo/semantics/publishedVersio

    Benchmarking of differential abundance methods and development of bioinformatics and statistical tools for metagenomics data analysis

    Get PDF
    L'analisi di dati nell'ambito del microbioma e della metagenomica è stato il tema principale del mio dottorato. L'obiettivo primario di questa tesi si muove attorno all'osservazione dei limiti dei metodi per lo studio dell'abbondanza differenziale e culmina con la creazione di un framework analitico che permette la loro misurazione e comparazione. Come obiettivo secondario, inoltre, la tesi vuole enfatizzare la necessità di una solida analisi statistica esplorativa ed inferenziale nei dati di metabarcoding, tramite la presentazione di alcuni casi studio. Inizio presentando 2 studi strettamente collegati in cui i metodi per l'analisi di abbondanza differenziale sono i protagonisti. L'analisi di abbondanza differenziale è lo strumento principale per individuare differenze nelle composizioni delle comunità microbiche in gruppi di campioni di diversa provenienza. Rappresenta quindi il primo passo per la comprensione delle comunità microbiche, delle relazioni tra i loro membri e di questi con l'ambiente. Il primo studio riguarda un lavoro di confronto tra metodi. A partire da una collezione di dataset metagenomici, l'obiettivo era di valutare le performance di metodi per l'analisi dell'abbondanza differenziale, anche nati in altri ambiti di ricerca (e.g., RNA-Seq e single-cell RNA-Seq). Invece, con il secondo studio presento un software che ho sviluppato grazie ai risultati ottenuti dalla precedente ricerca. Attualmente, il pacchetto software, in linguaggio R, è disponibile su Bioconductor (i.e., una piattaforma open-source per l'analisi e la visualizzazione di dati biologici). Esso consente agli utenti di replicare sui propri dataset il confronto tra metodi per lo studio dell'abbondanza differenziale e la conseguente analisi delle performance. Infine, mostro alcune delle sfide che ho incontrato nell'analisi di questo tipo di dato attraverso 2 casi studio riguardanti il microbioma umano, la sua composizione e dinamica, sia in stato di salute che malattia. Nel primo studio, dei soggetti sani sono stati trattati con una mistura di probiotici per valutare variazioni del microbiota intestinale ed eventuali associazioni con alcuni aspetti psicologici. Un'attenta analisi esplorativa, l'impiego di tecniche di clustering e l'utilizzo di modelli di regressione lineare ad effetti misti hanno consentito di svelare un forte effetto soggetto-specifico e la presenza di diversi batteriotipi di partenza che mascheravano l'effetto complessivo del trattamento probiotico. Invece, nel secondo studio mostro come, a partire da campioni salivari, sono stati individuati dei biomarcatori associati all'esofagite eosinofila (i.e., una malattia cronica immuno-mediata a carico dell'esofago che causa disfagia, occlusioni e stenosi esofagee). Nonostante la bassa numerosità campionaria è stato possibile costruire un modello per discriminare tra casi e controlli con una buona accuratezza. Anche se ancora prematuro, questo risultato rappresenta un passo promettente verso la diagnosi non invasiva di questa malattia che per il momento viene fatta solo tramite biopsia esofagea.Microbiome and metagenomics data analysis has been the main theme of my PhD programme. As a main goal, the thesis moves from the observed limitations of the differential abundance analysis tools to a benchmark and a framework against which they could be measured and compared. Furthermore, as a secondary goal, the presentation of some case studies wants to emphasise the need for a sound exploratory and inferential statistical analysis in metabarcoding data. Firstly, I present two closely related studies in which differential abundance analysis methods play the main role. The differential abundance analysis is the principal approach to detect differences in microbial community compositions between different sample groups, and hence, for understanding microbial community structures and the relationships between microbial compositions and the environment. I start by introducing a benchmarking study in which differential abundance analysis methods, even from different domains (e.g., RNA-Seq and single-cell RNA-Seq), were used in a collection of microbiome datasets to evaluate their performance. Then, I continue with the presentation of software package that I developed from the results obtained in the previous research. The software package, in R language, is currently available on Bioconductor (i.e., an open-source software platform for analysing and visualising biological data). It allows users to replicate the benchmarking of differential abundance analysis methods and evalute their performances on their own datasets. Secondly, I highlight the microbiome data analysis challenges presenting two case studies about the human microbiome and its composition and dynamics in both disease and healthy states. In the first study, healthy volunteers were treated with a probiotic mixture and the changes in the gut microbiome were studied in conjunction with some psychological aspects. A careful data exploration, clustering, and mixed-effects regression models, unveiled subject-specific effects and the presence of different bacteriotypes which masked the probiotic effect. Instead, in the second study I show how to identify disease-related microbial biomarkers for eosinophilic oesophagitis (i.e., a chronic immune-mediated inflammatory disease of the oesophagus that causes dysphagia, food impaction of the oesophagus, and esophageal strictures) from saliva. Despite the low sample size it was possible to train a model to discriminate between case and control states with a decent accuracy. While still premature, this represents a promising step for the non-invasive diagnosis of eosinophilic oesophagitis which is now possible only through esophageal biopsy

    Distinct expression and methylation patterns for genes with different fates following a single whole-genome duplication in flowering plants

    Get PDF
    For most sequenced flowering plants, multiple whole-genome duplications (WGDs) are found. Duplicated genes following WGD often have different fates that can quickly disappear again, be retained for long(er) periods, or subsequently undergo small-scale duplications. However, how different expression, epigenetic regulation, and functional constraints are associated with these different gene fates following a WGD still requires further investigation due to successive WGDs in angiosperms complicating the gene trajectories. In this study, we investigate lotus (Nelumbo nucifera), an angiosperm with a single WGD during the K–pg boundary. Based on improved intraspecific-synteny identification by a chromosome-level assembly, transcriptome, and bisulfite sequencing, we explore not only the fundamental distinctions in genomic features, expression, and methylation patterns of genes with different fates after a WGD but also the factors that shape post-WGD expression divergence and expression bias between duplicates. We found that after a WGD genes that returned to single copies show the highest levels and breadth of expression, gene body methylation, and intron numbers, whereas the long-retained duplicates exhibit the highest degrees of protein–protein interactions and protein lengths and the lowest methylation in gene flanking regions. For those long-retained duplicate pairs, the degree of expression divergence correlates with their sequence divergence, degree in protein–protein interactions, and expression level, whereas their biases in expression level reflecting subgenome dominance are associated with the bias of subgenome fractionation. Overall, our study on the paleopolyploid nature of lotus highlights the impact of different functional constraints on gene fate and duplicate divergence following a single WGD in plant
    corecore