8 research outputs found

    Gsufsort: Constructing suffix arrays, LCP arrays and BWTs for string collections

    Get PDF
    Background: The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows-Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. Result: In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22-39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. Conclusions: gsufsort is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections

    Hadooping the genome: The impact of big data tools on biology

    Get PDF
    This essay examines the consequences of the so-called ‘big data’ technologies in biomedicine. Analyzing algorithms and data structures used by biologists can provide insight into how biologists perceive and understand their objects of study. As such, I examine some of the most widely used algorithms in genomics: those used for sequence comparison or sequence mapping. These algorithms are derived from the powerful tools for text searching and indexing that have been developed since the 1950s and now play an important role in online search. In biology, sequence comparison algorithms have been used to assemble genomes, process next-generation sequence data, and, most recently, for ‘precision medicine.’ I argue that the predominance of a specific set of text-matching and pattern-finding tools has influenced problem choice in genomics. It allowed genomics to continue to think of genomes as textual objects and to increasingly lock genomics into ‘big data’-driven text-searching methods. Many ‘big data’ methods are designed for finding patterns in human-written texts. However, genomes and other’ omic data are not human-written and are unlikely to be meaningful in the same way

    TeloPortWrapper: A New Tool for Understanding the Dynamic World of Fungal Telomere Ends

    Get PDF
    Telomeres are repetitive DNA sequence motifs found at eukaryote chromosome ends. Telomeres help protect chromosome ends from DNA damage and promote chromosome stability. Chromosomes play important roles in aging, mutation, and cancer. Eukaryotic pathogens also use telomeres to mutate and manage virulence genes. In response to chromosome end breakage newly formed telomeres, called de novo telomeres, are formed to recreate the lost telomere and sub-telomeric regions. Magnaporthe oryzae is a fungal pathogen which causes wheat blast, a deadly plant disease in wheat. Magnaporthe oryzae is also known for its highly variable sub-regions which show high amounts of induced variability due to de novo telomere formation. This variance is associated with mutation and adaptation. Little is known about de novo telomere formation as telomeres are often underrepresented in standard sequencing assemblies. TeloPortWrapper is a new tool to collect and sort telomeric reads from the raw reads, identify and analyze de novo telomeres, and create visual result summaries. Using TeloPortWrapper on 940,225,828 reads across 14 different Magnaporthe oryzae strain genomes and aligning them to their assembled genome, it was found that a vast majority of new telomeric regions are being pulled from sections in the middle of chromosomes rather than close to the breakage point itself, as was previously assumed. Manually checking iii all the aligned reads confirmed that TeloPortWrapper is an accurate tool for collecting and analyzing de novo telomeres

    Accessing genetic variability in Spanish barleys through high-throughput sequencing

    Get PDF
    193 Pags.- Tabls.- Grafcs. Research memory presented by Carlos Pérez Cantalapiedra to obtain the title of Doctor in Plant Biology and Biotechnology from Universidad Autónoma de Barcelona (UAB). This work has been done at Estación Experimental de Aula Dei (EEAD), belonging to Consejo Superior de Investigaciones Científicas (CSIC), in Zaragoza[EN] High-throughput sequencing (HTS) has revolutionized plant research. It has made it possible to sequence the genomes of multiple organisms. The sequence-enriched physical map of barley was published in late 2012. A first step to exploit barley genomics, for practical purposes, was facilitating geneticists and breeders access to the barley physical map. This was the aim which led us to the development of Barleymap, a software tool which allows locating genetic markers in the barley physical-genetic map. This application effectively integrates and maps markers from different widely used barley genotyping platforms, and, in general, any marker with sequence information.[ES] La secuenciación de alto rendimiento (HTS, por sus siglas en inglés) ha revolucionado la investigación, haciendo posible secuenciar los genomas de múltiples organismos. El mapa físico de cebada y sus secuencias asociadas fueron publicados a finales de 2012. Para sacar partido de estos recursos, había que facilitar el acceso a ellos a genetistas y mejoradores. Este fue el objetivo que nos llevó a desarrollar Barleymap, una herramienta informática que permite localizar marcadores genéticos en el genoma de cebada. La aplicación integra y localiza marcadores de distintas plataformas de genotipado de cebada ampliamente utilizadas.Peer reviewe

    Scalable succinct indexing for large text collections

    Get PDF
    Self-indexes save space by emulating operations of traditional data structures using basic operations on bitvectors. Succinct text indexes provide full-text search functionality which is traditionally provided by suffix trees and suffix arrays for a given text, while using space equivalent to the compressed representation of the text. Succinct text indexes can therefore provide full-text search functionality over inputs much larger than what is viable using traditional uncompressed suffix-based data structures. Fields such as Information Retrieval involve the processing of massive text collections. However, the in-memory space requirements of succinct text indexes during construction have hampered their adoption for large text collections. One promising approach to support larger data sets is to avoid constructing the full suffix array by using alternative indexing representations. This thesis focuses on several aspects related to the scalability of text indexes to larger data sets. We identify practical improvements in the core building blocks of all succinct text indexing algorithms, and subsequently improve the index performance on large data sets. We evaluate our findings using several standard text collections and demonstrate: (1) the practical applications of our improved indexing techniques; and (2) that succinct text indexes are a practical alternative to inverted indexes for a variety of top-k ranked document retrieval problems
    corecore