122 research outputs found
EXPLoRA-web: linkage analysis of quantitative trait loci using bulk segregant analysis
Identification of genomic regions associated with a phenotype of interest is a fundamental step toward solving questions in biology and improving industrial research. Bulk segregant analysis (BSA) combined with high-throughput sequencing is a technique to efficiently identify these genomic regions associated with a trait of interest. However, distinguishing true from spuriously linked genomic regions and accurately delineating the genomic positions of these truly linked regions requires the use of complex statistical models currently implemented in software tools that are generally difficult to operate for non-expert users. To facilitate the exploration and analysis of data generated by bulked segregant analysis, we present EXPLoRA-web, a web service wrapped around our previously published algorithm EXPLoRA, which exploits linkage disequilibrium to increase the power and accuracy of quantitative trait loci identification in BSA analysis. EXPLoRA-web provides a user friendly interface that enables easy data upload and parallel processing of different parameter configurations. Results are provided graphically and as BED file and/or text file and the input is expected in widely used formats, enabling straightforward BSA data analysis. The web server is available at http://bioinformatics.intec.ugent.be/explora-web/
An integrated platform for genome assembly, comparative genomics and management of genomic variation databases
The use of long read DNA sequencing technologies is producing an explosion of high-quality
de-novo genome assemblies. The availability of these genomes represents a major step
forward for evolution, population genomics, epidemiology, among other applications. A major
bottleneck for many research groups continues to be the availability of tools to build and
analyze the large datasets of genomes that can be produced using these technologies. In this
talk, I summarize the functionalities developed by my research group in the version four of
the Next Generation Sequencing Experience Platform (NGSEP) to perform a comprehensive
analysis of long and short DNA sequencing reads. First, we designed new algorithms for
assembly of haploid and diploid samples from long DNA sequencing reads. A minimizers table
is constructed from the reads , using K-mer hash codes calculated from rankings relative to
the mode of the k-mer counts distribution. Statistics collected during this process are used as
features to build layout paths. For diploid samples, we integrated a reimplementation of the
ReFHap algorithm to perform molecular phasing. Benchmark experiments using PacBio HiFi
and Nanopore sequencing data for different species show that our solution has competitive
contiguity and efficiency, as well as superior accuracy in some cases, compared to other
currently used software. We also developed a functionality to perform ortholog identification
and gene-based alignment of assembled genomes. Proteomes for each genome are extracted
and homology relationships are efficiently predicted building indexes of aminoacid sequences
by k-mer ocurrance. Then, genes are clustered in orthogroups based on the topology of the
graph induced by the predicted relationships. Gene presence/absence matrices are derived
from these orthogroups. If genome assemblies are provided as input, synteny relationships
are identified for each pair of genomes. We also implemented algorithms to perform alignment
of short and long reads to a reference genome. Based on aligned long reads, we improved the
classical variants detector to detect long structural variants. Adding up these developments,
NGSEP is a comprehensive tool to perform de-novo and reference-based analysis of DNA
sequencing reads in a wide variety of experimental settings to solve different research goals.Book of abstract: 4th Belgrade Bioinformatics Conference, June 19-23, 202
Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data
BACKGROUND: Massively parallel transcriptome sequencing (RNA-Seq) is becoming the method of choice for studying functional effects of genetic variability and establishing causal relationships between genetic variants and disease. However, RNA-Seq poses new technical and computational challenges compared to genome sequencing. In particular, mapping transcriptome reads onto the genome is more challenging than mapping genomic reads due to splicing. Furthermore, detection and genotyping of single nucleotide variants (SNVs) requires statistical models that are robust to variability in read coverage due to unequal transcript expression levels. RESULTS: In this paper we present a strategy to more reliably map transcriptome reads by taking advantage of the availability of both the genome reference sequence and transcript databases such as CCDS. We also present a novel Bayesian model for SNV discovery and genotyping based on quality scores. CONCLUSIONS: Experimental results on RNA-Seq data generated from blood cell tissue of three Hapmap individuals show that our methods yield increased accuracy compared to several widely used methods. The open source code implementing our methods, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/NGSTools/
Linkage disequilibrium based genotype calling from low-coverage shotgun sequencing reads
Background Recent technology advances have enabled sequencing of individual genomes, promising to revolutionize biomedical research. However, deep sequencing remains more expensive than microarrays for performing whole-genome SNP genotyping. Results In this paper we introduce a new multi-locus statistical model and computationally efficient genotype calling algorithms that integrate shotgun sequencing data with linkage disequilibrium (LD) information extracted from reference population panels such as Hapmap or the 1000 genomes project. Experiments on publicly available 454, Illumina, and ABI SOLiD sequencing datasets suggest that integration of LD information results in genotype calling accuracy comparable to that of microarray platforms from sequencing data of low-coverage. A software package implementing our algorithm, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/GeneSeq/. Conclusions Integration of LD information leads to significant improvements in genotype calling accuracy compared to prior LD-oblivious methods, rendering low-coverage sequencing as a viable alternative to microarrays for conducting large-scale genome-wide association studies
Improved linkage analysis of Quantitative Trait Loci using bulk segregants unveils a novel determinant of high ethanol tolerance in yeast
Background: Bulk segregant analysis (BSA) coupled to high throughput sequencing is a powerful method to map genomic regions related with phenotypes of interest. It relies on crossing two parents, one inferior and one superior for a trait of interest. Segregants displaying the trait of the superior parent are pooled, the DNA extracted and sequenced. Genomic regions linked to the trait of interest are identified by searching the pool for overrepresented alleles that normally originate from the superior parent. BSA data analysis is non-trivial due to sequencing, alignment and screening errors.
Results: To increase the power of the BSA technology and obtain a better distinction between spuriously and truly linked regions, we developed EXPLoRA (EXtraction of over-rePresented aLleles in BSA), an algorithm for BSA data analysis that explicitly models the dependency between neighboring marker sites by exploiting the properties of linkage disequilibrium through a Hidden Markov Model (HMM). Reanalyzing a BSA dataset for high ethanol tolerance in yeast allowed reliably identifying QTLs linked to this phenotype that could not be identified with statistical significance in the original study. Experimental validation of one of the least pronounced linked regions, by identifying its causative gene VPS70, confirmed the potential of our method.
Conclusions: EXPLoRA has a performance at least as good as the state-of-the-art and it is robust even at low signal to noise ratio's i.e. when the true linkage signal is diluted by sampling, screening errors or when few segregants are available
Prototipos para la iluminación diurna, calentamiento de agua y suministro de gas en material reciclado para una vivienda de bajos costos
Trabajo de InvestigaciónSe realizan un documento con los procedimientos técnicos y materiales
necesarios para la elaboración de tres prototipos para la iluminación diurna,
calentamiento de agua y suministro de gas en materiales reciclados.PregradoIngeniero Civi
QTL mapping for pod quality and yield traits in snap bean (Phaseolus vulgaris L.).
Pod quality and yield traits in snap bean (Phaseolus vulgaris L.) influence consumer preferences, crop adoption by farmers, and the ability of the product to be commercially competitive locally and globally. The objective of the study was to identify the quantitative trait loci (QTL) for pod quality and yield traits in a snap × dry bean recombinant inbred line (RIL) population. A total of 184 F6 RILs derived from a cross between Vanilla (snap bean) and MCM5001 (dry bean) were grown in three field sites in Kenya and one greenhouse environment in Davis, CA, USA. They were genotyped at 5,951 single nucleotide polymorphisms (SNPs), and composite interval mapping was conducted to identify QTL for 16 pod quality and yield traits, including pod wall fiber, pod string, pod size, and harvest metrics. A combined total of 44 QTL were identified in field and greenhouse trials. The QTL for pod quality were identified on chromosomes Pv01, Pv02, Pv03, Pv04, Pv06, and Pv07, and for pod yield were identified on Pv08. Co-localization of QTL was observed for pod quality and yield traits. Some identified QTL overlapped with previously mapped QTL for pod quality and yield traits, with several others identified as novel. The identified QTL can be used in future marker-assisted selection in snap bean
Combining image analysis, genome wide association studies and different field trials to reveal stable genetic regions related to panicle architecture and the number of spikelets per panicle in rice
Number of spikelets per panicle (NSP) is a key trait to increase yield potential in rice (O. sativa). The architecture of the rice inflorescence which is mainly determined by the length and number of primary (PBL and PBN) and secondary (SBL and SBN) branches can influence NSP. Although several genes controlling panicle architecture and NSP in rice have been identified, there is little evidence of (i) the genetic control of panicle architecture and NSP in different environments and (ii) the presence of stable genetic associations with panicle architecture across environments. This study combines image phenotyping of 225 accessions belonging to a genetic diversity array of indica rice grown under irrigated field condition in two different environments and Genome Wide Association Studies (GWAS) based on the genotyping of the diversity panel, providing 83,374 SNPs. Accessions sown under direct seeding in one environement had reduced Panicle Length (PL), NSP, PBN, PBL, SBN, and SBL compared to those established under transplanting in the second environment. Across environments, NSP was significantly and positively correlated with PBN, SBN and PBL. However, the length of branches (PBL and SBL) was not significantly correlated with variables related to number of branches (PBN and SBN), suggesting independent genetic control. Twenty- three GWAS sites were detected with P ≤ 1.0E-04 and 27 GWAS sites with p ≤ 5.9E−04. We found 17 GWAS sites related to NSP, 10 for PBN and 11 for SBN, 7 for PBL and 11 for SBL. This study revealed new regions related to NSP, but only three associations were related to both branching number (PBN and SBN) and NSP. Two GWAS sites associated with SBL and SBN were stable across contrasting environments and were not related to genes previously reported. The new regions reported in this study can help improving NSP in rice for both direct seeded and transplanted conditions. The integrated approach of high-throughput phenotyping, multi-environment field trials and GWAS has the potential to dissect complex traits, such as NSP, into less complex traits and to match single nucleotide polymorphisms with relevant function under different environments, offering a potential use for molecular breeding
Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques
Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics
- …