1,469,850 research outputs found
What's in your next-generation sequence data? An exploration of unmapped DNA and RNA sequence reads from the bovine reference individual.
BackgroundNext-generation sequencing projects commonly commence by aligning reads to a reference genome assembly. While improvements in alignment algorithms and computational hardware have greatly enhanced the efficiency and accuracy of alignments, a significant percentage of reads often remain unmapped.ResultsWe generated de novo assemblies of unmapped reads from the DNA and RNA sequencing of the Bos taurus reference individual and identified the closest matching sequence to each contig by alignment to the NCBI non-redundant nucleotide database using BLAST. As expected, many of these contigs represent vertebrate sequence that is absent, incomplete, or misassembled in the UMD3.1 reference assembly. However, numerous additional contigs represent invertebrate species. Most prominent were several species of Spirurid nematodes and a blood-borne parasite, Babesia bigemina. These species are either not present in the US or are not known to infect taurine cattle and the reference animal appears to have been host to unsequenced sister species.ConclusionsWe demonstrate the importance of exploring unmapped reads to ascertain sequences that are either absent or misassembled in the reference assembly and for detecting sequences indicative of parasitic or commensal organisms
Findings of the E2E NLG Challenge
This paper summarises the experimental setup and results of the first shared
task on end-to-end (E2E) natural language generation (NLG) in spoken dialogue
systems. Recent end-to-end generation systems are promising since they reduce
the need for data annotation. However, they are currently limited to small,
delexicalised datasets. The E2E NLG shared task aims to assess whether these
novel approaches can generate better-quality output by learning from a dataset
containing higher lexical richness, syntactic complexity and diverse discourse
phenomena. We compare 62 systems submitted by 17 institutions, covering a wide
range of approaches, including machine learning architectures -- with the
majority implementing sequence-to-sequence models (seq2seq) -- as well as
systems based on grammatical rules and templates.Comment: Accepted to INLG 201
BamView: visualizing and interpretation of next-generation sequencing read alignments.
So-called next-generation sequencing (NGS) has provided the ability to sequence on a massive scale at low cost, enabling biologists to perform powerful experiments and gain insight into biological processes. BamView has been developed to visualize and analyse sequence reads from NGS platforms, which have been aligned to a reference sequence. It is a desktop application for browsing the aligned or mapped reads [Ruffalo, M, LaFramboise, T, Koyutürk, M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 2011;27:2790-6] at different levels of magnification, from nucleotide level, where the base qualities can be seen, to genome or chromosome level where overall coverage is shown. To enable in-depth investigation of NGS data, various views are provided that can be configured to highlight interesting aspects of the data. Multiple read alignment files can be overlaid to compare results from different experiments, and filters can be applied to facilitate the interpretation of the aligned reads. As well as being a standalone application it can be used as an integrated part of the Artemis genome browser, BamView allows the user to study NGS data in the context of the sequence and annotation of the reference genome. Single nucleotide polymorphism (SNP) density and candidate SNP sites can be highlighted and investigated, and read-pair information can be used to discover large structural insertions and deletions. The application will also calculate simple analyses of the read mapping, including reporting the read counts and reads per kilobase per million mapped reads (RPKM) for genes selected by the user
Clustering and Alignment of Polymorphic Sequences for HLA-DRB1 Genotyping
Located on Chromosome 6p21, classical human leukocyte antigen genes are highly polymorphic. HLA alleles associate with a variety of phenotypes, such as narcolepsy, autoimmunity, as well as immunologic response to infectious disease. Moreover, high resolution genotyping of these loci is critical to achieving long-term survival of allogeneic transplants. Development of methods to obtain high resolution analysis of HLA genotypes will lead to improved understanding of how select alleles contribute to human health and disease risk. Genomic DNAs were obtained from a cohort of n = 383 subjects recruited as part of an Ulcerative Colitis study and analyzed for HLA-DRB1. HLA genotypes were determined using sequence specific oligonucleotide probes and by next-generation sequencing using the Roche/454 GSFLX instrument. The Clustering and Alignment of Polymorphic Sequences (CAPSeq) software application was developed to analyze next-generation sequencing data. The application generates HLA sequence specific 6-digit genotype information from next-generation sequencing data using MUMmer to align sequences and the R package diffusionMap to classify sequences into their respective allelic groups. The incorporation of Bootstrap Aggregating, Bagging to aid in sorting of sequences into allele classes resulted in improved genotyping accuracy. Using Bagging iterations equal to 60, the genotyping results obtained using CAPSeq when compared with sequence specific oligonucleotide probe characterized 4-digit genotypes exhibited high rates of concordance, matching at 759 out of 766 (99.1%) alleles. © 2013 Ringquist et al
Genetic Sequence Matching Using D4M Big Data Approaches
Recent technological advances in Next Generation Sequencing tools have led to
increasing speeds of DNA sample collection, preparation, and sequencing. One
instrument can produce over 600 Gb of genetic sequence data in a single run.
This creates new opportunities to efficiently handle the increasing workload.
We propose a new method of fast genetic sequence analysis using the Dynamic
Distributed Dimensional Data Model (D4M) - an associative array environment for
MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and
statistical properties, the method leverages big data techniques and the
implementation of an Apache Acculumo database to accelerate computations
one-hundred fold over other methods. Comparisons of the D4M method with the
current gold-standard for sequence analysis, BLAST, show the two are comparable
in the alignments they find. This paper will present an overview of the D4M
genetic sequence algorithm and statistical comparisons with BLAST.Comment: 6 pages; to appear in IEEE High Performance Extreme Computing (HPEC)
201
- …
