225 research outputs found
Detection of microRNAs in color space
MotivationDeep sequencing provides inexpensive opportunities to characterize the transcriptional diversity of known genomes. The AB SOLiD technology generates millions of short sequencing reads in color-space; that is, the raw data is a sequence of colors, where each color represents 2 nt and each nucleotide is represented by two consecutive colors. This strategy is purported to have several advantages, including increased ability to distinguish sequencing errors from polymorphisms. Several programs have been developed to map short reads to genomes in color space. However, a number of previously unexplored technical issues arise when using SOLiD technology to characterize microRNAs.ResultsHere we explore these technical difficulties. First, since the sequenced reads are longer than the biological sequences, every read is expected to contain linker fragments. The color-calling error rate increases toward the 3(') end of the read such that recognizing the linker sequence for removal becomes problematic. Second, mapping in color space may lead to the loss of the first nucleotide of each read. We propose a sequential trimming and mapping approach to map small RNAs. Using our strategy, we reanalyze three published insect small RNA deep sequencing datasets and characterize 22 new microRNAs.Availability and implementationA bash shell script to perform the sequential trimming and mapping procedure, called SeqTrimMap, is available at: http://www.mirbase.org/tools/seqtrimmap/[email protected] informationSupplementary data are available at Bioinformatics online
Sensitive Long-Indel-Aware Alignment of Sequencing Reads
The tremdendous advances in high-throughput sequencing technologies have made
population-scale sequencing as performed in the 1000 Genomes project and the
Genome of the Netherlands project possible. Next-generation sequencing has
allowed genom-wide discovery of variations beyond single-nucleotide
polymorphisms (SNPs), in particular of structural variations (SVs) like
deletions, insertions, duplications, translocations, inversions, and even more
complex rearrangements. Here, we design a read aligner with special emphasis on
the following properties: (1) high sensitivity, i.e. find all (reasonable)
alignments; (2) ability to find (long) indels; (3) statistically sound
alignment scores; and (4) runtime fast enough to be applied to whole genome
data. We compare performance to BWA, bowtie2, stampy and find that our methods
is especially advantageous on reads containing larger indels
Approximate Two-Party Privacy-Preserving String Matching with Linear Complexity
Consider two parties who want to compare their strings, e.g., genomes, but do
not want to reveal them to each other. We present a system for
privacy-preserving matching of strings, which differs from existing systems by
providing a deterministic approximation instead of an exact distance. It is
efficient (linear complexity), non-interactive and does not involve a third
party which makes it particularly suitable for cloud computing. We extend our
protocol, such that it mitigates iterated differential attacks proposed by
Goodrich. Further an implementation of the system is evaluated and compared
against current privacy-preserving string matching algorithms.Comment: 6 pages, 4 figure
Languages of lossless seeds
Several algorithms for similarity search employ seeding techniques to quickly
discard very dissimilar regions. In this paper, we study theoretical properties
of lossless seeds, i.e., spaced seeds having full sensitivity. We prove that
lossless seeds coincide with languages of certain sofic subshifts, hence they
can be recognized by finite automata. Moreover, we show that these subshifts
are fully given by the number of allowed errors k and the seed margin l. We
also show that for a fixed k, optimal seeds must asymptotically satisfy l ~
m^(k/(k+1)).Comment: In Proceedings AFL 2014, arXiv:1405.527
A case report of congenital myasthenic syndrome caused by a mutation in the CHRNE gene in the Iranian population
Congenital myasthenic syndrome (CMS) refers to a heterogeneous group of inherited disorders, characterized by defective transmission at the neuromuscular junction (NMJ). Patients with CMS showed similar muscle weakness, while other clinical manifestations are mostly dependent on genetic factors. This disease, caused by different DNA mutations, is genetically inherited. It is also associated with mutations of genes at NMJ, involving the acetylcholine receptor (AChR) subunits. Here, we present the case of a five-year-old Iranian boy with CMS, undergoing targeted sequencing of a panel of genes, associated with arthrogryposis and CMS. The patient had six affected relatives in his genetic pedigree chart. The investigations indicated a homozygous single base pair deletion at exon 12 of the CHRNE gene (chr17:4802186delC). This region was conserved across mammalian evolution and was not submitted to the 1000 Genomes Project database. Overall, the CHRNE variant may be classified as a significant variant in the etiology of CMS. It can be suggested that the Iranian CMS population carry regional pathogenic mutations, which can be detected via targeted and whole genome sequencing
SEAL: a distributed short read mapping and duplicate removal tool
Summary: SEAL is a scalable tool for short read pair mapping and duplicate removal. It computes mappings that are consistent with those produced by BWA and removes duplicates according to the same criteria employed by Picard MarkDuplicates. On a 16-node Hadoop cluster, it is capable of processing about 13 GB per hour in map+rmdup mode, while reaching a throughput of 19 GB per hour in mapping-only mode
Analysis of quality raw data of second generation sequencers with Quality Assessment Software
<p>Abstract</p> <p>Background</p> <p>Second generation technologies have advantages over Sanger; however, they have resulted in new challenges for the genome construction process, especially because of the small size of the reads, despite the high degree of coverage. Independent of the program chosen for the construction process, DNA sequences are superimposed, based on identity, to extend the reads, generating contigs; mismatches indicate a lack of homology and are not included. This process improves our confidence in the sequences that are generated.</p> <p>Findings</p> <p>We developed Quality Assessment Software, with which one can review graphs showing the distribution of quality values from the sequencing reads. This software allow us to adopt more stringent quality standards for sequence data, based on quality-graph analysis and estimated coverage after applying the quality filter, providing acceptable sequence coverage for genome construction from short reads.</p> <p>Conclusions</p> <p>Quality filtering is a fundamental step in the process of constructing genomes, as it reduces the frequency of incorrect alignments that are caused by measuring errors, which can occur during the construction process due to the size of the reads, provoking misassemblies. Application of quality filters to sequence data, using the software Quality Assessment, along with graphing analyses, provided greater precision in the definition of cutoff parameters, which increased the accuracy of genome construction.</p
Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples
Motivation: Whole-genome high-coverage sequencing has been widely used for
personal and cancer genomics as well as in various research areas. However, in
the lack of an unbiased whole-genome truth set, the global error rate of
variant calls and the leading causal artifacts still remain unclear even given
the great efforts in the evaluation of variant calling methods.
Results: We made ten SNP and INDEL call sets with two read mappers and five
variant callers, both on a haploid human genome and a diploid genome at a
similar coverage. By investigating false heterozygous calls in the haploid
genome, we identified the erroneous realignment in low-complexity regions and
the incomplete reference genome with respect to the sample as the two major
sources of errors, which press for continued improvements in these two areas.
We estimated that the error rate of raw genotype calls is as high as 1 in
10-15kb, but the error rate of post-filtered calls is reduced to 1 in 100-200kb
without significant compromise on the sensitivity.
Availability: BWA-MEM alignment: http://bit.ly/1g8XqRt; Scripts:
https://github.com/lh3/varcmp; Additional data:
https://figshare.com/articles/Towards_better_understanding_of_artifacts_in_variating_calling_from_high_coverage_samples/981073Comment: Published versio
- …