301 research outputs found
Highly Scalable Algorithms for Robust String Barcoding
String barcoding is a recently introduced technique for genomic-based
identification of microorganisms. In this paper we describe the engineering of
highly scalable algorithms for robust string barcoding. Our methods enable
distinguisher selection based on whole genomic sequences of hundreds of
microorganisms of up to bacterial size on a well-equipped workstation, and can
be easily parallelized to further extend the applicability range to thousands
of bacterial size genomes. Experimental results on both randomly generated and
NCBI genomic data show that whole-genome based selection results in a number of
distinguishers nearly matching the information theoretic lower bounds for the
problem
Robust and scalable barcoding for massively parallel long‑read sequencing
Nucleic-acid barcoding is an enabling technique for many applications, but its use remains limited
in emerging long-read sequencing technologies with intrinsically low raw accuracy. Here, we apply
so-called NS-watermark barcodes, whose error correction capability was previously validated
in silico, in a proof of concept where we synthesize 3840 NS-watermark barcodes and use them
to asymmetrically tag and simultaneously sequence amplicons from two evolutionarily distant
species (namely Bordetella pertussis and Drosophila mojavensis) on the ONT MinION platform. To our
knowledge, this is the largest number of distinct, non-random tags ever sequenced in parallel and the
frst report of microarray-based synthesis as a source for large oligonucleotide pools for barcoding.
We recovered the identity of more than 86% of the barcodes, with a crosstalk rate of 0.17% (i.e., one
misassignment every 584 reads). This falls in the range of the index hopping rate of established, highaccuracy Illumina sequencing, despite the increased number of tags and the relatively low accuracy
of both microarray-based synthesis and long-read sequencing. The robustness of NS-watermark
barcodes, together with their scalable design and compatibility with low-cost massive synthesis,
makes them promising for present and future sequencing applications requiring massive labeling, such
as long-read single-cell RNA-Seq.Fil: Ezpeleta, JoaquÃn. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentina.Fil: Labari, Ignacio Garcia. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentina.Fil: Bulacio, Pilar. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentina.Fil: Tapia, Elizabeth. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentina.Fil: Ezpeleta, JoaquÃn. Universidad Nacional de Rosario. Facultad de Ciencias Exactas, IngenierÃa y Agrimensura; Argentina.Fil: Bulacio, Pilar. Universidad Nacional de Rosario. Facultad de Ciencias Exactas, IngenierÃa y Agrimensura; Argentina.Fil: Tapia, Elizabeth. Universidad Nacional de Rosario. Facultad de Ciencias Exactas, IngenierÃa y Agrimensura; Argentina.Fil: Villanova, Gabriela Vanina. Consejo Nacional de Investigaciones CientÃficas y Técnicas; Argentina.Fil: Lavista Llanos, SofÃa. Consejo Nacional de Investigaciones CientÃficas y Técnicas; Argentina.Fil: Villanova, Gabriela Vanina. Universidad Nacional de Rosario. Facultad de Ciencias BioquÃmicas y Farmacéuticas. Laboratorio Mixto de BiotecnologÃa Acuática. Centro CientÃfico Tecnológico y Educativo Acuario del RÃo Paraná; Argentina.Fil: Posner, Victoria MarÃa. Universidad Nacional de Rosario. Facultad de Ciencias BioquÃmicas y Farmacéuticas. Laboratorio Mixto de BiotecnologÃa Acuática. Centro CientÃfico Tecnológico y Educativo Acuario del RÃo Paraná; Argentina.Fil: Arranz, Silvia Eda. Universidad Nacional de Rosario. Facultad de Ciencias BioquÃmicas y Farmacéuticas. Laboratorio Mixto de BiotecnologÃa Acuática. Centro CientÃfico Tecnológico y Educativo Acuario del RÃo Paraná; Argentina.Fil: Krsticevic, Flavia. The Hebrew University of Jerusalem. Robert H Smith Faculty of Agriculture, Food and Environment; Israel
An efficient and accurate framework for large-scale sequences of DNA barcodes
Dissertação de mestrado integrado em Engenharia InformáticaDNA barcodes are short sequences of pre-defined gene regions that contain a sufficient
amount of intra- and inter-species genetic information. High-throughput sequencing techniques are currently used to identify large sequences of DNA barcodes in a species genome, in a relatively short time.
Domain experts require adequate self-contained tools to accurately and efficiently process
DNA barcode data in a reasonable time, taking advantage of current parallel and heterogeneous computing systems. They also expect to use these tools on different computing platforms, from laptops to high-performance servers, without requiring a broad knowledge in software engineering to develop efficient computational applications.
The main goal of this project was to develop a framework and associated user-friendly tools
for domain experts to efficiently support DNA barcoding studies, providing an abstraction
of the performance issues.
4SpecID is the key outcome of this work: an application software that integrates a
semi-automated auditing and annotation tool for reference libraries, to ensure the quality
standards of the compiled data, aiming to enable a grounded decision when identifying
species from DNA barcodes. Its graphics interface aids the end user to specify the operations
and it also simplifies data filtering and remote file handling.
The C++ ported version (from MATLAB) was fully tested and is more robust than
the original version. Architecture features common to laptop and compute servers were
exploited, namely parallel programming techniques and memory models.
The presented validation and performance results show significant improvements on
execution times, not only on the sequential version, but also by using the available parallel
capabilities of the underlying computing platforms.Os códigos de barras de ADN são pequenas sequência de regiões genéticas predefinidas
que contêm uma quantidade suficiente de informação genética intra e interespécies.
Técnicas de sequenciamento de alto desempenho são usadas na identificação de grandes
sequências de códigos de barras de ADN no genoma de uma espécie.
No entanto, é necessário que sejam desenvolvidas ferramentas adequadas para que os
especialistas de domÃnio processem dados de código de barras de ADN de forma precisa e
num intervalo de tempo viável, utilizando os sistemas de computação paralelos e heterogêneos que existem. Destas ferramentas é esperado que possam ser utilizadas recorrendo a
diferentes plataformas de computação, de laptops a servidores de alto desempenho, sem
exigir um amplo conhecimento em engenharia de software para serem utilizadas ou usadas
para a criação de outras ferramentas.
O objetivo principal deste projeto é desenvolver uma estrutura que forneça uma abstração
dos possÃveis desafios de desempenho e permitir que especialistas no domÃnio tenham
uma forma computacional eficiente para realizar um estudo de código de barras de DNA.
Neste projecto desenvolveu-se uma ferramenta, 4SpecID, que visa permitir uma decisão
fundamentada na identificação de espécies através de códigos de barras de DNA: uma
auditoria semi-automática e ferramenta de anotação para bibliotecas de referência, para
garantir os padrões de qualidade dos dados compilados.
Este projeto também explorou as vantagens das arquiteturas de servidores de computação
e laptops mais comuns, como técnicas de programação paralela e modelos de memória. Os
resultados de validação e desempenho apresentados mostram que é possÃvel obter melhores
tempos de execução utilizando as caracterÃsticas disponÃveis das plataformas subjacentes
High-Throughput SNP Genotyping by SBE/SBH
Despite much progress over the past decade, current Single Nucleotide
Polymorphism (SNP) genotyping technologies still offer an insufficient degree
of multiplexing when required to handle user-selected sets of SNPs. In this
paper we propose a new genotyping assay architecture combining multiplexed
solution-phase single-base extension (SBE) reactions with sequencing by
hybridization (SBH) using universal DNA arrays such as all -mer arrays. In
addition to PCR amplification of genomic DNA, SNP genotyping using SBE/SBH
assays involves the following steps: (1) Synthesizing primers complementing the
genomic sequence immediately preceding SNPs of interest; (2) Hybridizing these
primers with the genomic DNA; (3) Extending each primer by a single base using
polymerase enzyme and dideoxynucleotides labeled with 4 different fluorescent
dyes; and finally (4) Hybridizing extended primers to a universal DNA array and
determining the identity of the bases that extend each primer by hybridization
pattern analysis. Our contributions include a study of multiplexing algorithms
for SBE/SBH genotyping assays and preliminary experimental results showing the
achievable tradeoffs between the number of array probes and primer length on
one hand and the number of SNPs that can be assayed simultaneously on the
other. Simulation results on datasets both randomly generated and extracted
from the NCBI dbSNP database suggest that the SBE/SBH architecture provides a
flexible and cost-effective alternative to genotyping assays currently used in
the industry, enabling genotyping of up to hundreds of thousands of
user-specified SNPs per assay.Comment: 19 page
DNA Barcoding in the Cycadales: Testing the Potential of Proposed Barcoding Markers for Species Identification of Cycads
Barcodes are short segments of DNA that can be used to uniquely identify an unknown specimen to species, particularly when diagnostic morphological features are absent. These sequences could offer a new forensic tool in plant and animal conservation—especially for endangered species such as members of the Cycadales. Ideally, barcodes could be used to positively identify illegally obtained material even in cases where diagnostic features have been purposefully removed or to release confiscated organisms into the proper breeding population. In order to be useful, a DNA barcode sequence must not only easily PCR amplify with universal or near-universal reaction conditions and primers, but also contain enough variation to generate unique identifiers at either the species or population levels. Chloroplast regions suggested by the Plant Working Group of the Consortium for the Barcode of Life (CBoL), and two alternatives, the chloroplast psbA-trnH intergenic spacer and the nuclear ribosomal internal transcribed spacer (nrITS), were tested for their utility in generating unique identifiers for members of the Cycadales. Ease of amplification and sequence generation with universal primers and reaction conditions was determined for each of the seven proposed markers. While none of the proposed markers provided unique identifiers for all species tested, nrITS showed the most promise in terms of variability, although sequencing difficulties remain a drawback. We suggest a workflow for DNA barcoding, including database generation and management, which will ultimately be necessary if we are to succeed in establishing a universal DNA barcode for plants
Recommended from our members
Inference of single-cell phylogenies from lineage tracing data using Cassiopeia.
The pairing of CRISPR/Cas9-based gene editing with massively parallel single-cell readouts now enables large-scale lineage tracing. However, the rapid growth in complexity of data from these assays has outpaced our ability to accurately infer phylogenetic relationships. First, we introduce Cassiopeia-a suite of scalable maximum parsimony approaches for tree reconstruction. Second, we provide a simulation framework for evaluating algorithms and exploring lineage tracer design principles. Finally, we generate the most complex experimental lineage tracing dataset to date, 34,557 human cells continuously traced over 15 generations, and use it for benchmarking phylogenetic inference approaches. We show that Cassiopeia outperforms traditional methods by several metrics and under a wide variety of parameter regimes, and provide insight into the principles for the design of improved Cas9-enabled recorders. Together, these should broadly enable large-scale mammalian lineage tracing efforts. Cassiopeia and its benchmarking resources are publicly available at www.github.com/YosefLab/Cassiopeia
DNA Barcode Sequence Identification Incorporating Taxonomic Hierarchy and within Taxon Variability
For DNA barcoding to succeed as a scientific endeavor an accurate and expeditious query sequence identification method is needed. Although a global multiple–sequence alignment can be generated for some barcoding markers (e.g. COI, rbcL), not all barcoding markers are as structurally conserved (e.g. matK). Thus, algorithms that depend on global multiple–sequence alignments are not universally applicable. Some sequence identification methods that use local pairwise alignments (e.g. BLAST) are unable to accurately differentiate between highly similar sequences and are not designed to cope with hierarchic phylogenetic relationships or within taxon variability. Here, I present a novel alignment–free sequence identification algorithm–BRONX–that accounts for observed within taxon variability and hierarchic relationships among taxa. BRONX identifies short variable segments and corresponding invariant flanking regions in reference sequences. These flanking regions are used to score variable regions in the query sequence without the production of a global multiple–sequence alignment. By incorporating observed within taxon variability into the scoring procedure, misidentifications arising from shared alleles/haplotypes are minimized. An explicit treatment of more inclusive terminals allows for separate identifications to be made for each taxonomic level and/or for user–defined terminals. BRONX performs better than all other methods when there is imperfect overlap between query and reference sequences (e.g. mini–barcode queries against a full–length barcode database). BRONX consistently produced better identifications at the genus–level for all query types
- …