    Prefix-Free Parsing for Building Big BWTs

    High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited and enable the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. Therefore, prefix-free parsing eases BWT construction, which is pertinent to many bioinformatics applications

    third generation sequencing data analytics on mobile devices cache oblivious and out of core approaches as a proof of concept

    Abstract Mobile (third-generation) sequencing technologies, including Oxford Nanopore's MinION and SmidgION, have the benefit of outputting long sequence reads (up to hundred thousands of bases) in a portable manner. These sequencing devices fit in the palm of a hand and only require a USB outlet. Unfortunately, the development of data analysis tools for these technologies is in a nascent stage, impeding on the portability of these devices. The objective of this work is to introduce an out-of-core approach to port Nanopore analytics on mobile devices such as tablets or smartphones, often used in extreme experimental settings with special ergonomics needs and ease of sterilization. In this paper, we present a serial k-mer parser/counter for FAST5 files, and a de Bruijn graph construction method which can run on a hand-held device. In order to accomplish this portability we develop novel cache oblivious data structures and out-of-core chunked processing methods. Our toolset, which we refer to as Nanopore Portable Analytics Library (NanoPAL), wase implemented in ISO C++ v.14 and compiled for Android devices. Using MinION data (Zaire Ebolavirus species and others), we evaluate the time required to parse and build the de Bruijn graph with respect to the file sizes and RAM allocation. These metrics were compared to those of minimap/miniasm. On an LG Nexus 5 with 2GB or RAM, 2MB L2 cache and 16GB storage, the out-of-core NanoPAL is able to process FAST5 files at about 30 minutes per 0.5 GB, creating sorted k-mer and de Bruijn graph files. The recompiled minimap/miniasm tool cannot complete FAST5 files larger than 170MB. In conjunction with base calling/error correction, and with addition of assembly procedures downstream, NanoPAL can be effectively used to perform analyses of MinION/SmidgION data locally on a mobile device

    Combinatorial and Probabilistic Approaches to Motif Recognition

    Short substrings of genomic data that are responsible for biological processes, such as gene expression, are referred to as motifs. Motifs with the same function may not entirely match, due to mutation events at a few of the motif positions. Allowing for non-exact occurrences significantly complicates their discovery. Given a number of DNA strings, the motif recognition problem is the task of detecting motif instances in every given sequence without knowledge of the position of the instances or the pattern shared by these substrings. We describe a novel approach to motif recognition, and provide theoretical and experimental results that demonstrate its efficiency and accuracy. Our algorithm, MCL-WMR, builds an edge-weighted graph model of the given motif recognition problem and uses a graph clustering algorithm to quickly determine important subgraphs that need to be searched further for valid motifs. By considering a weighted graph model, we narrow the search dramatically to smaller problems that can be solved with significantly less computation. The Closest String problem is a subproblem of motif recognition, and it is NP-hard. We give a linear-time algorithm for a restricted version of the Closest String problem, and an efficient polynomial-time heuristic that solves the general problem with high probability. We initiate the study of the smoothed complexity of the Closest String problem, which in turn explains our empirical results that demonstrate the great capability of our probabilistic heuristic. Important to this analysis is the introduction of a perturbation model of the Closest String instances within which we provide a probabilistic analysis of our algorithm. The smoothed analysis suggests reasons why a well-known fixed parameter tractable algorithm solves Closest String instances extremely efficiently in practice. Although the Closest String model is robust to the oversampling of strings in the input, it is severely affected by the existence of outliers. We propose a refined model, the Closest String with Outliers problem, to overcome this limitation. A systematic parameterized complexity analysis accompanies the introduction of this problem, providing a surprising insight into the sensitivity of this problem to slightly different parameterizations. Through the application of probabilistic and combinatorial insights into the Closest String problem, we develop sMCL-WMR, a program that is much faster than its predecessor MCL-WMR. We apply and adapt sMCL-WMR and MCL-WMR to analyze the promoter regions of the canola seed-coat. Our results identify important regions of the canola genome that are responsible for specific biological activities. This knowledge may be used in the long-term aim of developing crop varieties with specific biological characteristics, such as being disease-resistant

    Le syndrome de fatigue chronique : méconnu mais pas sans solution

    Travail d'intégration réalisé dans le cadre du cours PHT-6113.Introduction : Le syndrome de fatigue chronique se définit comme une pathologie multi-systémique, caractérisée principalement par une fatigue sévère et incapacitante. Celle-ci est accompagnée de plusieurs autres symptômes tels que le malaise post-exercice, la douleur et des manifestations neuroendocriniennes, immunitaires et autonomiques. Objectif : Le but de ce travail est de conscientiser les divers professionnels, principalement les physiothérapeutes, à la reconnaissance de ce syndrome et les outiller tant au niveau de l’évaluation que du traitement. Description sommaire : Ce syndrome est d’étiologie inconnue, tout comme la fibromyalgie. D’ailleurs, ceux-ci se ressemblent grandement sur le plan de la présentation clinique et de la physiopathologie. Plusieurs chercheurs considèrent même la possibilité d’une étiologie commune. Actuellement, deux seuls traitements ont été prouvés efficaces chez cette clientèle, soit la thérapie cognitivo-comportementale et les programmes d’exercices graduels. Les différentes anomalies musculaires et cardiovasculaires ont aussi été documentées dans la littérature. Ces éléments justifient l’implication du physiothérapeute dans l’élaboration d’un plan d’intervention. Résultats : Des paramètres optimaux pour les exercices cardiovasculaires ont donc été déterminés à partir des meilleures évidences disponibles. Ceux pour l’entraînement musculaire, quant à eux, ont peu été étudiés à ce jour, pour cette clientèle. Toutefois, l’existence de lignes directrices générales sur l’entraînement musculaire, combiné à l’analyse des déficiences présentes chez ces patients, ont permis, dans le cadre de ce travail, l’élaboration d’un programme combinant les exercices cardiovasculaires et musculaires. Conclusion : Bien que la littérature actuelle permette une prise en charge des patients atteints du SFC, beaucoup de paramètres restent à être prouvés par des données probantes

    Acceleration of FM-Index Queries Through Prefix-Free Parsing

    FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [Ferragina and Fischer, 2007] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [Deng et al., 2022] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing - which takes parameters that let us tune the average length of the phrases - instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory
