95,679 research outputs found
Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions
Nanopore sequencing technology has the potential to render other sequencing
technologies obsolete with its ability to generate long reads and provide
portability. However, high error rates of the technology pose a challenge while
generating accurate genome assemblies. The tools used for nanopore sequence
analysis are of critical importance as they should overcome the high error
rates of the technology. Our goal in this work is to comprehensively analyze
current publicly available tools for nanopore sequence analysis to understand
their advantages, disadvantages, and performance bottlenecks. It is important
to understand where the current tools do not perform well to develop better
tools. To this end, we 1) analyze the multiple steps and the associated tools
in the genome assembly pipeline using nanopore sequence data, and 2) provide
guidelines for determining the appropriate tools for each step. We analyze
various combinations of different tools and expose the tradeoffs between
accuracy, performance, memory usage and scalability. We conclude that our
observations can guide researchers and practitioners in making conscious and
effective choices for each step of the genome assembly pipeline using nanopore
sequence data. Also, with the help of bottlenecks we have found, developers can
improve the current tools or build new ones that are both accurate and fast, in
order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201
Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence
BACKGROUND: Previous studies have suggested that recent segmental duplications, which are often involved in chromosome rearrangements underlying genomic disease, account for some 5% of the human genome. We have developed rapid computational heuristics based on BLAST analysis to detect segmental duplications, as well as regions containing potential sequence misassignments in the human genome assemblies. RESULTS: Our analysis of the June 2002 public human genome assembly revealed that 107.4 of 3,043.1 megabases (Mb) (3.53%) of sequence contained segmental duplications, each with size equal or more than 5 kb and 90% identity. We have also detected that 38.9 Mb (1.28%) of sequence within this assembly is likely to be involved in sequence misassignment errors. Furthermore, we have identified a significant subset (199,965 of 2,327,473 or 8.6%) of single-nucleotide polymorphisms (SNPs) in the public databases that are not true SNPs but are potential paralogous sequence variants. CONCLUSION: Using two distinct computational approaches, we have identified most of the sequences in the human genome that have undergone recent segmental duplications. Near-identical segmental duplications present a major challenge to the completion of the human genome sequence. Potential sequence misassignments detected in this study would require additional efforts to resolve
The Sugarcane genome sequencing effort: An overview of the strategy, goals and existing data : [Abstract W538]
Sugarcane is a major feedstock used for biofuel production worldwide. Sugarcane cultivars (Saccharum spp) are derived from interspecific hybridization obtained a century ago by crossing Saccharum officinarum (2n=8x=80) and S. spontaneum (2n=5x=40 to 2n=16x=128). The challenge in a sugarcane genome sequencing project is the size (10 Gb) and complexity of its genome structure that is highly polyploid and aneuploid (2n= ca 110 to 120) with a complete set of homo(eo)logous genes predicted to range from 10 to 12 copies (alleles). A initial strategy is to capture much of the gene-rich recombinationally-active euchromatin. The Sugarcane Genome Sequencing Initiative (SUGESI) was envisaged to join efforts to produce a reference sequence of one sugarcane cultivar using a combination of approaches, including BAC sequencing and whole genome shot-gun approaches. Cultivar R570 was chosen since it is the most intensively characterized to date. We expect that around 4-5 thousand BAC sequences can cover the monoploid euchromatic genome of this cultivar. BAC selection is underway using overgo and EST hybridization data. A next step is to sequence cultivars of interest to breeding programs. Under the shot-gun approach gene rich regions are being targeted for genotypes that are parents of mapping populations. This should allow the identification of very large numbers of polymorphic markers that are expected to assist genome assembly. Pilot experiments are underway to define the best technologies for gene-rich region or promoter identification. A database is under construction (http://sugarcanegenome.org). The initiative is led by researchers in Australia, Brazil, China, France, South Africa and United States. (Texte intégral
Recommended from our members
novoBreak: local assembly for breakpoint detection in cancer genomes.
We present novoBreak, a genome-wide local assembly algorithm that discovers somatic and germline structural variation breakpoints in whole-genome sequencing data. novoBreak consistently outperformed existing algorithms on real cancer genome data and on synthetic tumors in the ICGC-TCGA DREAM 8.5 Somatic Mutation Calling Challenge primarily because it more effectively utilized reads spanning breakpoints. novoBreak also demonstrated great sensitivity in identifying short insertions and deletions
A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified
single-cell genomes, and metagenomes has enabled investigation of a wide range
of organisms and ecosystems. However, sampling variation in short-read data
sets and high sequencing error rates of modern sequencers present many new
computational challenges in data interpretation. These challenges have led to
the development of new classes of mapping tools and {\em de novo} assemblers.
These algorithms are challenged by the continued improvement in sequencing
throughput. We here describe digital normalization, a single-pass computational
algorithm that systematizes coverage in shotgun sequencing data sets, thereby
decreasing sampling variation, discarding redundant data, and removing the
majority of errors. Digital normalization substantially reduces the size of
shotgun data sets and decreases the memory and time requirements for {\em de
novo} sequence assembly, all without significantly impacting content of the
generated contigs. We apply digital normalization to the assembly of microbial
genomic data, amplified single-cell genomic data, and transcriptomic data. Our
implementation is freely available for use and modification
SEED: efficient clustering of next-generation sequences.
MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online
Functional analysis and transcriptional output of the Göttingen minipig genome
In the past decade the Göttingen minipig has gained increasing recognition as animal model in pharmaceutical and safety research because it recapitulates many aspects of human physiology and metabolism. Genome-based comparison of drug targets together with quantitative tissue expression analysis allows rational prediction of pharmacology and cross-reactivity of human drugs in animal models thereby improving drug attrition which is an important challenge in the process of drug development.; Here we present a new chromosome level based version of the Göttingen minipig genome together with a comparative transcriptional analysis of tissues with pharmaceutical relevance as basis for translational research. We relied on mapping and assembly of WGS (whole-genome-shotgun sequencing) derived reads to the reference genome of the Duroc pig and predict 19,228 human orthologous protein-coding genes. Genome-based prediction of the sequence of human drug targets enables the prediction of drug cross-reactivity based on conservation of binding sites. We further support the finding that the genome of Sus scrofa contains about ten-times less pseudogenized genes compared to other vertebrates. Among the functional human orthologs of these minipig pseudogenes we found HEPN1, a putative tumor suppressor gene. The genomes of Sus scrofa, the Tibetan boar, the African Bushpig, and the Warthog show sequence conservation of all inactivating HEPN1 mutations suggesting disruption before the evolutionary split of these pig species. We identify 133 Sus scrofa specific, conserved long non-coding RNAs (lncRNAs) in the minipig genome and show that these transcripts are highly conserved in the African pigs and the Tibetan boar suggesting functional significance. Using a new minipig specific microarray we show high conservation of gene expression signatures in 13 tissues with biomedical relevance between humans and adult minipigs. We underline this relationship for minipig and human liver where we could demonstrate similar expression levels for most phase I drug-metabolizing enzymes. Higher expression levels and metabolic activities were found for FMO1, AKR/CRs and for phase II drug metabolizing enzymes in minipig as compared to human. The variability of gene expression in equivalent human and minipig tissues is considerably higher in minipig organs, which is important for study design in case a human target belongs to this variable category in the minipig. The first analysis of gene expression in multiple tissues during development from young to adult shows that the majority of transcriptional programs are concluded four weeks after birth. This finding is in line with the advanced state of human postnatal organ development at comparative age categories and further supports the minipig as model for pediatric drug safety studies.; Genome based assessment of sequence conservation combined with gene expression data in several tissues improves the translational value of the minipig for human drug development. The genome and gene expression data presented here are important resources for researchers using the minipig as model for biomedical research or commercial breeding. Potential impact of our data for comparative genomics, translational research, and experimental medicine are discussed
- …
