73 research outputs found

    Parallel methods for short read assembly

    Get PDF
    This work is on the parallel de novo assembly of genomic sequences from short sequence reads. With short reads eliminating the reliability of read overlaps in predicting genomic co-location, a revival of graph-based methods has underpinned the development of short-read assemblers. While these methods predate short read technology, their reach has not extended significantly beyond bacterial genomes due to the memory resources required in their use. These memory limitations are exacerbated by the high coverage needed to compensate for shorter read lengths. As a result, prior to our work, short-read de novo assembly had been demonstrated on relatively small genome sizes with a few million bases. In our work, we advance the field of short sequence assembly in a number of ways. First, we extend models and ideas proposed and tested with small genomes on serial machines to large-scale distributed memory parallel machines. Second, we present ideas for assembly that are especially suited to the reconstruction of very large genomes on these machines. Additionally, we present the first assembler that specifically takes advantage a variable number of fragment sizes or insert lengths concurrently when making assembly decisions, while still working well for data with one insertion length

    Advances in Bioinformatics : contributions to high-throughput proteomics-based identification, quantification and systems biology

    Full text link
    Tesis Doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Ciencias, Departamento de Biología Molecular. Fecha de lectura: 27-03-2017El análisis de datos en proteómica de alto rendimiento lleva implícito el reto de extraer significado biológico a partir de un gran número de identificaciones y cuantificaciones. En la última década la tecnología en este campo ha sufrido una transformación sin precedentes; esto ha requerido un desarrollo colosal de las herramientas bioinformáticas para establecer los fundamentos de los algoritmos que se usarán en los próximos años de investigación en proteómica. En la primera publicación, analizamos en profundidad el rendimiento e influencia de los algoritmos de identificación de péptidos tras la aparición de la espectrometría de masas de alta resolución. Se muestra que, en muchos casos, reducir la tolerancia de la masa del ion precursor para identificar péptidos nos lleva a un aumento en el número de péptidos identificados incorrectamente, subestimados en gran medida por la tasa de error FDR1. Proponemos aquí un cambio en el algoritmo de búsqueda, consistente en el uso de ventana ancha para la masa del precursor, seguida de un filtrado de dicha masa tras calcular la puntuación asignada a la identificación. La segunda publicación trata del WSPP2, un modelo estadístico que desarrollamos para el análisis de experimentos de proteómica cuantitativa de alto rendimiento. Se puede utilizar en numerosas combinaciones de métodos de marcaje isotópico estable (SIL) y espectrómetros de masa. Además, aporta un marco estadístico general para estos experimentos, permitiendo la comparación de resultados entre distintos laboratorios gracias a su capacidad única para separar las diferentes fuentes de varianza, así como la interpretación de resultados a distintos niveles. En el tercer y último artículo presentamos un método innovador para el análisis de biología de sistemas en proteómica, aprovechando la base estadística del WSPP, y teniendo en cuenta el grado de coordinación del proteoma. Esto fue posible gracias al desarrollo del Algoritmo de Integración Genérico (GIA), que permitió que la información cuantitativa se pudiera integrar desde cualquier nivel inferior a cualquier nivel superior (en vez de limitarnos a la secuencia estándar espectro-péptido-proteína). Todos estos modelos están implementados en SanXoT, un paquete de software que pone en práctica los modelos mencionados para proteómica cuantitativa. Estos tres pasos representaron un cambio drástico en los métodos empleados para analizar proteomas en nuestro laboratorio, y abren la puerta a infinidad de posibilidades para futuros desarrollos y mejoras en proteómica de alto rendimiento.The analysis of high-throughput proteomics data presents the challenge of extracting biological meaning from a wealth of protein identifications and quantifications. In the last decade, technology in this area has undergone a major transformation that required a continuous and enormous development of bioinformatic tools to establish the foundations of the algorithms to be used in the next years of proteomics research. In this work we present three papers that represent three milestones in this endeavour. In the first publication, we present a deep analysis on the performance and influence of the peptide identification search algorithms upon the appearance of high-resolution and high-accuracy mass spectrometres. It is shown that, in many relevant cases, using smaller precursor ion mass tolerances to identify peptides leads to an increased number of incorrectly identified peptides greatly underestimated by the false discovery rate (FDR). Here we propose a change in the search algorithm, consisting of the use of wide mass windows followed by a post-scoring mass filtering. The second publication is dedicated to the WSPP (initialism for Weighted Spectrum, Peptide, Protein) statistical model for the analysis of high-throughput quantitative proteomics experiments. The model can be used in a wide range of combinations of stable isotope labelling (SIL) techniques and mass spectrometres. Additionally, this algorithm provides a general statistical framework for these experiments, allowing the comparison of results across laboratories, thanks to its unique capacity to separate the different sources of variance, allowing the interpretation of the error at different levels. In the third and final paper, we present an innovative method to perform systems biology analyses from the proteomics perspective, considering the degree of coordination of a proteome, and thanks to the statistical basis provided by the WSPP statistical model. This was possible after developing the Generic Integration Algorithm (GIA), which allowed integrating quantitative information from any lower level to any higher level (instead of limiting us to the traditional spectrum-peptide-protein workflow). All these models are implemented in SanXoT, a software package developed to allow the practical use of the mentioned models in quantitative proteomics. These three steps in the research of high-throughput proteomics represented a dramatic change in the way proteomes were analysed in our laboratory, and opened countless possibilities for further development and enhancement of this research topic

    Sorting permutations with pattern-avoiding machines

    Get PDF
    In this work of thesis we introduce and study a new family of sorting devices, which we call pattern-avoiding machines. They consist of two stacks in series, equipped with a greedy procedure. On both stacks we impose a static constraint in terms of pattern containment: reading the content from top to bottom, the first stack is not allowed to contain occurrences of a given pattern σ\sigma, whereas the second one is not allowed to contain occurrences of 2121. By analyzing the behavior of pattern-avoiding machines, we aim to gain a better understanding of the problem of sorting permutations with two consecutive stacks, which is currently one of the most challenging open problems in combinatorics.Comment: PhD Thesis, 137 page

    The mapping task and its various applications in next-generation sequencing

    Get PDF
    The aim of this thesis is the development and benchmarking of computational methods for the analysis of high-throughput data from tiling arrays and next-generation sequencing. Tiling arrays have been a mainstay of genome-wide transcriptomics, e.g., in the identification of functional elements in the human genome. Due to limitations of existing methods for the data analysis of this data, a novel statistical approach is presented that identifies expressed segments as significant differences from the background distribution and thus avoids dataset-specific parameters. This method detects differentially expressed segments in biological data with significantly lower false discovery rates and equivalent sensitivities compared to commonly used methods. In addition, it is also clearly superior in the recovery of exon-intron structures. Moreover, the search for local accumulations of expressed segments in tiling array data has led to the identification of very large expressed regions that may constitute a new class of macroRNAs. This thesis proceeds with next-generation sequencing for which various protocols have been devised to study genomic, transcriptomic, and epigenomic features. One of the first crucial steps in most NGS data analyses is the mapping of sequencing reads to a reference genome. This work introduces algorithmic methods to solve the mapping tasks for three major NGS protocols: DNA-seq, RNA-seq, and MethylC-seq. All methods have been thoroughly benchmarked and integrated into the segemehl mapping suite. First, mapping of DNA-seq data is facilitated by the core mapping algorithm of segemehl. Since the initial publication, it has been continuously updated and expanded. Here, extensive and reproducible benchmarks are presented that compare segemehl to state-of-the-art read aligners on various data sets. The results indicate that it is not only more sensitive in finding the optimal alignment with respect to the unit edit distance but also very specific compared to most commonly used alternative read mappers. These advantages are observable for both real and simulated reads, are largely independent of the read length and sequencing technology, but come at the cost of higher running time and memory consumption. Second, the split-read extension of segemehl, presented by Hoffmann, enables the mapping of RNA-seq data, a computationally more difficult form of the mapping task due to the occurrence of splicing. Here, the novel tool lack is presented, which aims to recover missed RNA-seq read alignments using de novo splice junction information. It performs very well in benchmarks and may thus be a beneficial extension to RNA-seq analysis pipelines. Third, a novel method is introduced that facilitates the mapping of bisulfite-treated sequencing data. This protocol is considered the gold standard in genome-wide studies of DNA methylation, one of the major epigenetic modifications in animals and plants. The treatment of DNA with sodium bisulfite selectively converts unmethylated cytosines to uracils, while methylated ones remain unchanged. The bisulfite extension developed here performs seed searches on a collapsed alphabet followed by bisulfite-sensitive dynamic programming alignments. Thus, it is insensitive to bisulfite-related mismatches and does not rely on post-processing, in contrast to other methods. In comparison to state-of-the-art tools, this method achieves significantly higher sensitivities and performs time-competitive in mapping millions of sequencing reads to vertebrate genomes. Remarkably, the increase in sensitivity does not come at the cost of decreased specificity and thus may finally result in a better performance in calling the methylation rate. Lastly, the potential of mapping strategies for de novo genome assemblies is demonstrated with the introduction of a new guided assembly procedure. It incorporates mapping as major component and uses the additional information (e.g., annotation) as guide. With this method, the complete mitochondrial genome of Eulimnogammarus verrucosus has been successfully assembled even though the sequencing library has been heavily dominated by nuclear DNA. In summary, this thesis introduces algorithmic methods that significantly improve the analysis of tiling array, DNA-seq, RNA-seq, and MethylC-seq data, and proposes standards for benchmarking NGS read aligners. Moreover, it presents a new guided assembly procedure that has been successfully applied in the de novo assembly of a crustacean mitogenome.Diese Arbeit befasst sich mit der Entwicklung und dem Benchmarken von Verfahren zur Analyse von Daten aus Hochdurchsatz-Technologien, wie Tiling Arrays oder Hochdurchsatz-Sequenzierung. Tiling Arrays bildeten lange Zeit die Grundlage für die genomweite Untersuchung des Transkriptoms und kamen beispielsweise bei der Identifizierung funktioneller Elemente im menschlichen Genom zum Einsatz. In dieser Arbeit wird ein neues statistisches Verfahren zur Auswertung von Tiling Array-Daten vorgestellt. Darin werden Segmente als exprimiert klassifiziert, wenn sich deren Signale signifikant von der Hintergrundverteilung unterscheiden. Dadurch werden keine auf den Datensatz abgestimmten Parameterwerte benötigt. Die hier vorgestellte Methode erkennt differentiell exprimierte Segmente in biologischen Daten bei gleicher Sensitivität mit geringerer Falsch-Positiv-Rate im Vergleich zu den derzeit hauptsächlich eingesetzten Verfahren. Zudem ist die Methode bei der Erkennung von Exon-Intron Grenzen präziser. Die Suche nach Anhäufungen exprimierter Segmente hat darüber hinaus zur Entdeckung von sehr langen Regionen geführt, welche möglicherweise eine neue Klasse von macroRNAs darstellen. Nach dem Exkurs zu Tiling Arrays konzentriert sich diese Arbeit nun auf die Hochdurchsatz-Sequenzierung, für die bereits verschiedene Sequenzierungsprotokolle zur Untersuchungen des Genoms, Transkriptoms und Epigenoms etabliert sind. Einer der ersten und entscheidenden Schritte in der Analyse von Sequenzierungsdaten stellt in den meisten Fällen das Mappen dar, bei dem kurze Sequenzen (Reads) auf ein großes Referenzgenom aligniert werden. Die vorliegende Arbeit stellt algorithmische Methoden vor, welche das Mapping-Problem für drei wichtige Sequenzierungsprotokolle (DNA-Seq, RNA-Seq und MethylC-Seq) lösen. Alle Methoden wurden ausführlichen Benchmarks unterzogen und sind in der segemehl-Suite integriert. Als Erstes wird hier der Kern-Algorithmus von segemehl vorgestellt, welcher das Mappen von DNA-Sequenzierungsdaten ermöglicht. Seit der ersten Veröffentlichung wurde dieser kontinuierlich optimiert und erweitert. In dieser Arbeit werden umfangreiche und auf Reproduzierbarkeit bedachte Benchmarks präsentiert, in denen segemehl auf zahlreichen Datensätzen mit bekannten Mapping-Programmen verglichen wird. Die Ergebnisse zeigen, dass segemehl nicht nur sensitiver im Auffinden von optimalen Alignments bezüglich der Editierdistanz sondern auch sehr spezifisch im Vergleich zu anderen Methoden ist. Diese Vorteile sind in realen und simulierten Daten unabhängig von der Sequenzierungstechnologie oder der Länge der Reads erkennbar, gehen aber zu Lasten einer längeren Laufzeit und eines höheren Speicherverbrauchs. Als Zweites wird das Mappen von RNA-Sequenzierungsdaten untersucht, welches bereits von der Split-Read-Erweiterung von segemehl unterstützt wird. Aufgrund von Spleißen ist diese Form des Mapping-Problems rechnerisch aufwendiger. In dieser Arbeit wird das neue Programm lack vorgestellt, welches darauf abzielt, fehlende Read-Alignments mit Hilfe von de novo Spleiß-Information zu finden. Es erzielt hervorragende Ergebnisse und stellt somit eine sinnvolle Ergänzung zu Analyse-Pipelines für RNA-Sequenzierungsdaten dar. Als Drittes wird eine neue Methode zum Mappen von Bisulfit-behandelte Sequenzierungsdaten vorgestellt. Dieses Protokoll gilt als Goldstandard in der genomweiten Untersuchung der DNA-Methylierung, einer der wichtigsten epigenetischen Modifikationen in Tieren und Pflanzen. Dabei wird die DNA vor der Sequenzierung mit Natriumbisulfit behandelt, welches selektiv nicht methylierte Cytosine zu Uracilen konvertiert, während Methylcytosine davon unberührt bleiben. Die hier vorgestellte Bisulfit-Erweiterung führt die Seed-Suche auf einem reduziertem Alphabet durch und verifiziert die erhaltenen Treffer mit einem auf dynamischer Programmierung basierenden Bisulfit-sensitiven Alignment-Algorithmus. Das verwendete Verfahren ist somit unempfindlich gegenüber Bisulfit-Konvertierungen und erfordert im Gegensatz zu anderen Verfahren keine weitere Nachverarbeitung. Im Vergleich zu aktuell eingesetzten Programmen ist die Methode sensitiver und benötigt eine vergleichbare Laufzeit beim Mappen von Millionen von Reads auf große Genome. Bemerkenswerterweise wird die erhöhte Sensitivität bei gleichbleibend guter Spezifizität erreicht. Dadurch könnte diese Methode somit auch bessere Ergebnisse bei der präzisen Bestimmung der Methylierungsraten erreichen. Schließlich wird noch das Potential von Mapping-Strategien für Assemblierungen mit der Einführung eines neuen, Kristallisation-genanntes Verfahren zur unterstützten Assemblierung aufgezeigt. Es enthält Mapping als Hauptbestandteil und nutzt Zusatzinformation (z.B. Annotationen) als Unterstützung. Dieses Verfahren ermöglichte die erfolgreiche Assemblierung des kompletten mitochondrialen Genoms von Eulimnogammarus verrucosus trotz einer vorwiegend aus nukleärer DNA bestehenden genomischen Bibliothek. Zusammenfassend stellt diese Arbeit algorithmische Methoden vor, welche die Analysen von Tiling Array, DNA-Seq, RNA-Seq und MethylC-Seq Daten signifikant verbessern. Es werden zudem Standards für den Vergleich von Programmen zum Mappen von Daten der Hochdurchsatz-Sequenzierung vorgeschlagen. Darüber hinaus wird ein neues Verfahren zur unterstützten Genom-Assemblierung vorgestellt, welches erfolgreich bei der de novo-Assemblierung eines mitochondrialen Krustentier-Genoms eingesetzt wurde

    LIPIcs, Volume 274, ESA 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 274, ESA 2023, Complete Volum

    35th Symposium on Theoretical Aspects of Computer Science: STACS 2018, February 28-March 3, 2018, Caen, France

    Get PDF

    Combinatoire des cartes et polynome de Tutte

    Get PDF
    Les cartes sont les plongements, sans intersection d'arêtes, des graphes dans des surfaces. Les cartes constituent une discrétisation naturelle des surfaces et apparaissent aussi bien en informatique (codage d'informations visuelles) quén physique (surfaces aléatoires de la physique statistique et quantique). Nous établissons des résultats énumératifs pour de nouvelles familles de cartes. En outre, nous définissons des bijections entre les cartes et des classes combinatoires plus simples (chemins planaires, couples d'arbres). Ces bijections révèlent des propriétés structurelles importantes des cartes et permettent leur comptage, leur codage et leur génération aléatoire. Enfin, nous caractérisons un invariant fondamental de la théorie des graphes, le polynôme de Tutte, en nous appuyant sur les cartes. Cette caractérisation permet d'établir des bijections entre plusieurs structures (arbres cou- vrant, suites de degrés, configurations du tas de sable) comptées par le polynôme de Tutte.A map is a graph together with a particular (proper) embedding in a surface. Maps are a natural way of representing discrete surfaces and as such they appear both in computer science (encoding of visual data) and in physics (random lattices of statistical physics and quantum gravity). We establish enumerative results for new classes of maps. Moreover, we define several bijections between maps and simpler combinatorial classes (planar walks, pairs of trees). These bijections highlight some important structural properties and allows one to count, sample randomly and encode maps efficiently. Lastly, we give a new characterization of an important graph invariant, the Tutte polynomial, by making use of maps. This characterization allows us to establish bijections between several structures (spanning trees, sandpile configurations, outdegree sequences) counted by the Tutte polynomial

    Proceedings of JAC 2010. Journées Automates Cellulaires

    Get PDF
    The second Symposium on Cellular Automata “Journ´ees Automates Cellulaires” (JAC 2010) took place in Turku, Finland, on December 15-17, 2010. The first two conference days were held in the Educarium building of the University of Turku, while the talks of the third day were given onboard passenger ferry boats in the beautiful Turku archipelago, along the route Turku–Mariehamn–Turku. The conference was organized by FUNDIM, the Fundamentals of Computing and Discrete Mathematics research center at the mathematics department of the University of Turku. The program of the conference included 17 submitted papers that were selected by the international program committee, based on three peer reviews of each paper. These papers form the core of these proceedings. I want to thank the members of the program committee and the external referees for the excellent work that have done in choosing the papers to be presented in the conference. In addition to the submitted papers, the program of JAC 2010 included four distinguished invited speakers: Michel Coornaert (Universit´e de Strasbourg, France), Bruno Durand (Universit´e de Provence, Marseille, France), Dora Giammarresi (Universit` a di Roma Tor Vergata, Italy) and Martin Kutrib (Universit¨at Gie_en, Germany). I sincerely thank the invited speakers for accepting our invitation to come and give a plenary talk in the conference. The invited talk by Bruno Durand was eventually given by his co-author Alexander Shen, and I thank him for accepting to make the presentation with a short notice. Abstracts or extended abstracts of the invited presentations appear in the first part of this volume. The program also included several informal presentations describing very recent developments and ongoing research projects. I wish to thank all the speakers for their contribution to the success of the symposium. I also would like to thank the sponsors and our collaborators: the Finnish Academy of Science and Letters, the French National Research Agency project EMC (ANR-09-BLAN-0164), Turku Centre for Computer Science, the University of Turku, and Centro Hotel. Finally, I sincerely thank the members of the local organizing committee for making the conference possible. These proceedings are published both in an electronic format and in print. The electronic proceedings are available on the electronic repository HAL, managed by several French research agencies. The printed version is published in the general publications series of TUCS, Turku Centre for Computer Science. We thank both HAL and TUCS for accepting to publish the proceedings.Siirretty Doriast
    corecore