301 research outputs found

    Highly Scalable Algorithms for Robust String Barcoding

    Full text link
    String barcoding is a recently introduced technique for genomic-based identification of microorganisms. In this paper we describe the engineering of highly scalable algorithms for robust string barcoding. Our methods enable distinguisher selection based on whole genomic sequences of hundreds of microorganisms of up to bacterial size on a well-equipped workstation, and can be easily parallelized to further extend the applicability range to thousands of bacterial size genomes. Experimental results on both randomly generated and NCBI genomic data show that whole-genome based selection results in a number of distinguishers nearly matching the information theoretic lower bounds for the problem

    Robust and scalable barcoding for massively parallel long‑read sequencing

    Get PDF
    Nucleic-acid barcoding is an enabling technique for many applications, but its use remains limited in emerging long-read sequencing technologies with intrinsically low raw accuracy. Here, we apply so-called NS-watermark barcodes, whose error correction capability was previously validated in silico, in a proof of concept where we synthesize 3840 NS-watermark barcodes and use them to asymmetrically tag and simultaneously sequence amplicons from two evolutionarily distant species (namely Bordetella pertussis and Drosophila mojavensis) on the ONT MinION platform. To our knowledge, this is the largest number of distinct, non-random tags ever sequenced in parallel and the frst report of microarray-based synthesis as a source for large oligonucleotide pools for barcoding. We recovered the identity of more than 86% of the barcodes, with a crosstalk rate of 0.17% (i.e., one misassignment every 584 reads). This falls in the range of the index hopping rate of established, highaccuracy Illumina sequencing, despite the increased number of tags and the relatively low accuracy of both microarray-based synthesis and long-read sequencing. The robustness of NS-watermark barcodes, together with their scalable design and compatibility with low-cost massive synthesis, makes them promising for present and future sequencing applications requiring massive labeling, such as long-read single-cell RNA-Seq.Fil: Ezpeleta, Joaquín. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentina.Fil: Labari, Ignacio Garcia. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentina.Fil: Bulacio, Pilar. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentina.Fil: Tapia, Elizabeth. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentina.Fil: Ezpeleta, Joaquín. Universidad Nacional de Rosario. Facultad de Ciencias Exactas, Ingeniería y Agrimensura; Argentina.Fil: Bulacio, Pilar. Universidad Nacional de Rosario. Facultad de Ciencias Exactas, Ingeniería y Agrimensura; Argentina.Fil: Tapia, Elizabeth. Universidad Nacional de Rosario. Facultad de Ciencias Exactas, Ingeniería y Agrimensura; Argentina.Fil: Villanova, Gabriela Vanina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina.Fil: Lavista Llanos, Sofía. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina.Fil: Villanova, Gabriela Vanina. Universidad Nacional de Rosario. Facultad de Ciencias Bioquímicas y Farmacéuticas. Laboratorio Mixto de Biotecnología Acuática. Centro Científico Tecnológico y Educativo Acuario del Río Paraná; Argentina.Fil: Posner, Victoria María. Universidad Nacional de Rosario. Facultad de Ciencias Bioquímicas y Farmacéuticas. Laboratorio Mixto de Biotecnología Acuática. Centro Científico Tecnológico y Educativo Acuario del Río Paraná; Argentina.Fil: Arranz, Silvia Eda. Universidad Nacional de Rosario. Facultad de Ciencias Bioquímicas y Farmacéuticas. Laboratorio Mixto de Biotecnología Acuática. Centro Científico Tecnológico y Educativo Acuario del Río Paraná; Argentina.Fil: Krsticevic, Flavia. The Hebrew University of Jerusalem. Robert H Smith Faculty of Agriculture, Food and Environment; Israel

    An efficient and accurate framework for large-scale sequences of DNA barcodes

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaDNA barcodes are short sequences of pre-defined gene regions that contain a sufficient amount of intra- and inter-species genetic information. High-throughput sequencing techniques are currently used to identify large sequences of DNA barcodes in a species genome, in a relatively short time. Domain experts require adequate self-contained tools to accurately and efficiently process DNA barcode data in a reasonable time, taking advantage of current parallel and heterogeneous computing systems. They also expect to use these tools on different computing platforms, from laptops to high-performance servers, without requiring a broad knowledge in software engineering to develop efficient computational applications. The main goal of this project was to develop a framework and associated user-friendly tools for domain experts to efficiently support DNA barcoding studies, providing an abstraction of the performance issues. 4SpecID is the key outcome of this work: an application software that integrates a semi-automated auditing and annotation tool for reference libraries, to ensure the quality standards of the compiled data, aiming to enable a grounded decision when identifying species from DNA barcodes. Its graphics interface aids the end user to specify the operations and it also simplifies data filtering and remote file handling. The C++ ported version (from MATLAB) was fully tested and is more robust than the original version. Architecture features common to laptop and compute servers were exploited, namely parallel programming techniques and memory models. The presented validation and performance results show significant improvements on execution times, not only on the sequential version, but also by using the available parallel capabilities of the underlying computing platforms.Os códigos de barras de ADN são pequenas sequência de regiões genéticas predefinidas que contêm uma quantidade suficiente de informação genética intra e interespécies. Técnicas de sequenciamento de alto desempenho são usadas na identificação de grandes sequências de códigos de barras de ADN no genoma de uma espécie. No entanto, é necessário que sejam desenvolvidas ferramentas adequadas para que os especialistas de domínio processem dados de código de barras de ADN de forma precisa e num intervalo de tempo viável, utilizando os sistemas de computação paralelos e heterogêneos que existem. Destas ferramentas é esperado que possam ser utilizadas recorrendo a diferentes plataformas de computação, de laptops a servidores de alto desempenho, sem exigir um amplo conhecimento em engenharia de software para serem utilizadas ou usadas para a criação de outras ferramentas. O objetivo principal deste projeto é desenvolver uma estrutura que forneça uma abstração dos possíveis desafios de desempenho e permitir que especialistas no domínio tenham uma forma computacional eficiente para realizar um estudo de código de barras de DNA. Neste projecto desenvolveu-se uma ferramenta, 4SpecID, que visa permitir uma decisão fundamentada na identificação de espécies através de códigos de barras de DNA: uma auditoria semi-automática e ferramenta de anotação para bibliotecas de referência, para garantir os padrões de qualidade dos dados compilados. Este projeto também explorou as vantagens das arquiteturas de servidores de computação e laptops mais comuns, como técnicas de programação paralela e modelos de memória. Os resultados de validação e desempenho apresentados mostram que é possível obter melhores tempos de execução utilizando as características disponíveis das plataformas subjacentes

    High-Throughput SNP Genotyping by SBE/SBH

    Full text link
    Despite much progress over the past decade, current Single Nucleotide Polymorphism (SNP) genotyping technologies still offer an insufficient degree of multiplexing when required to handle user-selected sets of SNPs. In this paper we propose a new genotyping assay architecture combining multiplexed solution-phase single-base extension (SBE) reactions with sequencing by hybridization (SBH) using universal DNA arrays such as all kk-mer arrays. In addition to PCR amplification of genomic DNA, SNP genotyping using SBE/SBH assays involves the following steps: (1) Synthesizing primers complementing the genomic sequence immediately preceding SNPs of interest; (2) Hybridizing these primers with the genomic DNA; (3) Extending each primer by a single base using polymerase enzyme and dideoxynucleotides labeled with 4 different fluorescent dyes; and finally (4) Hybridizing extended primers to a universal DNA array and determining the identity of the bases that extend each primer by hybridization pattern analysis. Our contributions include a study of multiplexing algorithms for SBE/SBH genotyping assays and preliminary experimental results showing the achievable tradeoffs between the number of array probes and primer length on one hand and the number of SNPs that can be assayed simultaneously on the other. Simulation results on datasets both randomly generated and extracted from the NCBI dbSNP database suggest that the SBE/SBH architecture provides a flexible and cost-effective alternative to genotyping assays currently used in the industry, enabling genotyping of up to hundreds of thousands of user-specified SNPs per assay.Comment: 19 page

    DNA Barcoding in the Cycadales: Testing the Potential of Proposed Barcoding Markers for Species Identification of Cycads

    Get PDF
    Barcodes are short segments of DNA that can be used to uniquely identify an unknown specimen to species, particularly when diagnostic morphological features are absent. These sequences could offer a new forensic tool in plant and animal conservation—especially for endangered species such as members of the Cycadales. Ideally, barcodes could be used to positively identify illegally obtained material even in cases where diagnostic features have been purposefully removed or to release confiscated organisms into the proper breeding population. In order to be useful, a DNA barcode sequence must not only easily PCR amplify with universal or near-universal reaction conditions and primers, but also contain enough variation to generate unique identifiers at either the species or population levels. Chloroplast regions suggested by the Plant Working Group of the Consortium for the Barcode of Life (CBoL), and two alternatives, the chloroplast psbA-trnH intergenic spacer and the nuclear ribosomal internal transcribed spacer (nrITS), were tested for their utility in generating unique identifiers for members of the Cycadales. Ease of amplification and sequence generation with universal primers and reaction conditions was determined for each of the seven proposed markers. While none of the proposed markers provided unique identifiers for all species tested, nrITS showed the most promise in terms of variability, although sequencing difficulties remain a drawback. We suggest a workflow for DNA barcoding, including database generation and management, which will ultimately be necessary if we are to succeed in establishing a universal DNA barcode for plants

    DNA Barcode Sequence Identification Incorporating Taxonomic Hierarchy and within Taxon Variability

    Get PDF
    For DNA barcoding to succeed as a scientific endeavor an accurate and expeditious query sequence identification method is needed. Although a global multiple–sequence alignment can be generated for some barcoding markers (e.g. COI, rbcL), not all barcoding markers are as structurally conserved (e.g. matK). Thus, algorithms that depend on global multiple–sequence alignments are not universally applicable. Some sequence identification methods that use local pairwise alignments (e.g. BLAST) are unable to accurately differentiate between highly similar sequences and are not designed to cope with hierarchic phylogenetic relationships or within taxon variability. Here, I present a novel alignment–free sequence identification algorithm–BRONX–that accounts for observed within taxon variability and hierarchic relationships among taxa. BRONX identifies short variable segments and corresponding invariant flanking regions in reference sequences. These flanking regions are used to score variable regions in the query sequence without the production of a global multiple–sequence alignment. By incorporating observed within taxon variability into the scoring procedure, misidentifications arising from shared alleles/haplotypes are minimized. An explicit treatment of more inclusive terminals allows for separate identifications to be made for each taxonomic level and/or for user–defined terminals. BRONX performs better than all other methods when there is imperfect overlap between query and reference sequences (e.g. mini–barcode queries against a full–length barcode database). BRONX consistently produced better identifications at the genus–level for all query types
    • …
    corecore