747 research outputs found

    Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

    Get PDF
    Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201

    Big Data meets High Performance Computing: Genomics and Natural Language Processing as case studies

    Get PDF
    The main objective of this thesis is to clarify a way to the convergence between the Big Data and the High Performance Computing world. In order to do this, a study of the application of this kind of technologies to two real world scientific problems is performed. These two problems are the sequence alignment in genomics and the natural language processing. These problems have a very big input and output size, and are computationally intensive, requiring a very high execution time. By facing these problems, also new tools that can be used by professionals in the areas are developed. Conclusions about convergence between these two worlds are presented, taking into account results from this study

    FANSe: an accurate algorithm for quantitative mapping of large scale sequencing reads

    Get PDF
    The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes. The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns. We developed a new, fast and accurate algorithm for nucleic acid sequence analysis, FANSe, with adjustable mismatch allowance settings and ability to handle indels to accurately and quantitatively map millions of reads to small or large reference genomes. It is a seed-based algorithm which uses the whole read information for mapping and high sensitivity and low ambiguity are achieved by using short and non-overlapping reads. Furthermore, FANSe uses hotspot score to prioritize the processing of highly possible matches and implements modified Smith–Watermann refinement with reduced scoring matrix to accelerate the calculation without compromising its sensitivity. The FANSe algorithm stably processes datasets from various sequencing platforms, masked or unmasked and small or large genomes. It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets

    Genome sequence alignment in processing-In-memory architectures

    Get PDF
    Finalmente, también realizamos un estudio experimental de varias arquitecturas con diferentes tecnologías de memoria (DDR y HBM) y núcleos de procesamiento de distintos tipos, explotando, en algunos casos, procesamiento en la memoria (PIM). La aplicación de referencia es Bowtie2, una aplicación completa para el alineamiento de secuencias en el genoma. La implementación y evaluación de estas arquitecturas se realiza utilizando un simulador arquitectural basado en gem5.La combinación de la aparición de un cuello de botella en el acceso a los datos y la creciente importancia de las aplicaciones de procesamiento intensivo de datos, muy limitadas por el sistema de memoria, crea un importante problema que debe ser abordado. Por ello, en esta tesis nos proponemos afrontar este problema e intentar reducir su efecto en la medida de lo posible. El principal objetivo de esta tesis es el diseño de nuevas soluciones arquitecturales y algorítmicas para superar el problema del cuello de botella conocido como memory-wall y mejorar el rendimiento de aplicaciones con gran uso de memoria que no son capaces de beneficiarse lo suficiente de las jerarquías de memoria actuales. Además, creemos que es esencial centrarse en la eficiencia energética, un factor cuya importancia crece cada día y uno de los factores más limitantes en la computación de alto rendimiento. Las principales contribuciones de esta tesis son: Primero, analizamos el comportamiento de aplicaciones con accesos de memoria aleatorios, que no aprovechan correctamente las nuevas arquitecturas de memoria con jerarquías cache profundas. Específicamente, analizamos la estructura de datos FM-index y un algoritmo de búsqueda de secuencias basado en esa estructura, ampliamente usado en el alineamiento de secuencias en el genoma. Después de este análisis y de obtener un conocimiento más detallado del cuello de botella de la memoria, proponemos una nueva versión de FM-index que permite reducir el consumo de ancho de banda de memoria, de forma que mejora significativamente el rendimiento computacional. Posteriormente, proponemos una nueva arquitectura energéticamente eficiente, basada en un cubo de memoria en 3D (3D-Stacked) al que añadimos unos núcleos sencillos de bajo consumo en su capa lógica. Esta arquitectura permite la ejecución cerca de los datos (near-data-processing

    Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet

    Get PDF
    Peer reviewe

    Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet

    Get PDF
    The suffix array augmented with the LCP array is perhaps the most important data structure in modern string processing. There has been a lot of recent research activity on constructing these arrays in external memory. In this paper, we engineer the two fastest LCP array construction algorithms (ESA 2016) and improve them in three ways. First, we speed up the algorithms by up to a factor of two through parallelism. Just 8 threads is sufficient for making the algorithms essentially I/O bound. Second, we reduce the disk space usage of the algorithms making them in-place: The input (text and suffix array) is treated as read-only and the working disk space never exceeds the size of the final output (the LCP array). Third, we add support for large alphabets. All previous implementations assume the byte alphabet

    Nucleotide Sequence Similarity Search Using Techniques from Content-Based Image Retrieval

    Get PDF
    The amount of DNA data continues to increase exponentially as a result of high- throughput next generation sequencing. Current state-of-the-art tools for nucleotide sequence similarity search are not equipped to deal with this growth and new thinking is needed to tackle the rising scalability challenges. This thesis investigates the experimental approach of translating DNA sequences into images and applying state of the art techniques from the field of content- based image retrieval to index and search the resulting images. The challenges of translating DNA sequences into images are discussed and two algorithms for image generation are proposed. We look into the different feature descriptors that are available and evaluate them in the context of the generated images. Lastly the approach as a whole is evaluated with the mean average precision metric using BLAST as the gold standard reference. The results show that the proposed approach is not successful in approaching BLAST in retrieval performance, but offers a significant reduce in index sizes and thus better performance and scalability on large DNA databases

    Scalable succinct indexing for large text collections

    Get PDF
    Self-indexes save space by emulating operations of traditional data structures using basic operations on bitvectors. Succinct text indexes provide full-text search functionality which is traditionally provided by suffix trees and suffix arrays for a given text, while using space equivalent to the compressed representation of the text. Succinct text indexes can therefore provide full-text search functionality over inputs much larger than what is viable using traditional uncompressed suffix-based data structures. Fields such as Information Retrieval involve the processing of massive text collections. However, the in-memory space requirements of succinct text indexes during construction have hampered their adoption for large text collections. One promising approach to support larger data sets is to avoid constructing the full suffix array by using alternative indexing representations. This thesis focuses on several aspects related to the scalability of text indexes to larger data sets. We identify practical improvements in the core building blocks of all succinct text indexing algorithms, and subsequently improve the index performance on large data sets. We evaluate our findings using several standard text collections and demonstrate: (1) the practical applications of our improved indexing techniques; and (2) that succinct text indexes are a practical alternative to inverted indexes for a variety of top-k ranked document retrieval problems

    Efficient Storage of Genomic Sequences in High Performance Computing Systems

    Get PDF
    ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction

    Hardware / Software System for Portable and Low-Cost Genome Assembly

    Full text link
    “The enjoyment of the highest attainable standard of health is one of the fundamental rights of every human being without distinction of race, religion, political belief, economic or social condition” [56]. Genomics (the study of the entire DNA) provides such a standard of health for people with rare diseases and helps control the spread of pandemics. Still, millions of human beings are unable to access genomics due to its cost, and portability. In genomics, DNA sequencers digitise DNA information, and computers analyse the digitised information. We have desktop and thumb-sized DNA sequencers, that digitise the DNA data rapidly. But computations necessary for the analysis of this data are inevitably performed on high-performance computers (HPCs) and cloud computers. These computations not only require powerful computers but also necessitate high-speed networks since the data generated are in the hundreds of gigabytes. Relying on HPCs and high-speed networks, deny the benefits that can be reaped by genomics for the masses who live in remote areas and in poorer nations. Developing a low-cost and portable genomics computation platform would provide personalised treatment based on an individual’s DNA and identify the source of the fast-spreading epidemics in remote areas and areas without HPC or network infrastructure. But developing a low-cost and portable genome analysing computing platform is a challenging task. This thesis develops novel computer architecture solutions to assemble the whole human DNA and COVID-19 virus RNA on a low-cost and portable platform. The first phase of the solution describes a ring-pipelined processor architecture for a key genome assembly algorithm. The human genome is partitioned to fit into the small memory footprint of embedded processors. These techniques allow an entire human genome to be assembled using highly portable and low-cost embedded processor cores. These processor cores can be housed within a single chip. Each processor was only 0.08 mm 2 and consumed just 37.5 mW. It has only 2 GB memory, 32-bit instruction width, and a clock with a 1 GHz frequency. The second phase of the solution describes how application-specific instruction-set processors can be sped up to execute a key genome assembly algorithm. A fully automated design system is presented, which improves the performance of large applications (such as genome assembly algorithm) and generates application-specific instructions for a commercial processor design tool (Xtensa). The tool enhances the base processor, which was used in the ring pipeline processor architecture. Thus, the alignment algorithms execute 2.1 times faster with only 11% additional hardware. The energy-delay product was reduced by 7.3× compared to the base processor. This tool is the only one of its type which can handle applications which are large. The third phase of the solution designs a portable low-cost genome assembly computer (PGA). PGA enhances the ring pipeline architecture with the customised processor found in phase two and with improved inter-processor communication. The results show that the COVID-19 virus RNA can be assembled in under 10 minutes and the whole human genome can be assembled in 11 days on a portable platform (HPC take around two days) for 30× coverage. PGA has an area footprint of just 5.68 mm 2 in a 28 nm technology node and is far smaller than a high-performance computer processor chip. The PGA consumes only 4W of power, which is lower than the power requirement of a high-performance processor chip. The manufacturing cost of the PGA also would be much cheaper than the high-performance system cost, when produced in volume. The developed solution can be powered by a USB port of a laptop. This thesis is the first of its type to show the design of a single-chip solution to be able to process a complex genomic problem. This thesis contributes to attaining one of the fundamental rights of every human being wherever they may live
    corecore