343 research outputs found
Upper clique transversals in graphs
A clique transversal in a graph is a set of vertices intersecting all maximal
cliques. The problem of determining the minimum size of a clique transversal
has received considerable attention in the literature. In this paper, we
initiate the study of the "upper" variant of this parameter, the upper clique
transversal number, defined as the maximum size of a minimal clique
transversal. We investigate this parameter from the algorithmic and complexity
points of view, with a focus on various graph classes. We show that the
corresponding decision problem is NP-complete in the classes of chordal graphs,
chordal bipartite graphs, and line graphs of bipartite graphs, but solvable in
linear time in the classes of split graphs and proper interval graphs.Comment: Full version of a WG 2023 pape
New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly
Great efforts have been devoted to decipher the sequence composition of
the genomes and transcriptomes of diverse organisms. Continuing advances in
high-throughput sequencing technologies have led to a decline in associated
costs, facilitating a rapid increase in the amount of available genetic data. In
particular genome studies have undergone a fundamental paradigm shift where
genome projects are no longer limited by sequencing costs, but rather by
computational problems associated with assembly. There is an urgent demand
for more efficient and more accurate methods. Most recently, “hybrid”
methods that integrate short- and long-read data have been devised to address
this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a
bipartite overlap graph between long reads and restrictively filtered short-read
unitigs. This graph is translated into a long-read overlap graph. By design,
unitigs are both unique and almost free of assembly errors. As a consequence,
only few spurious overlaps are introduced into the graph. Instead of the more
conventional approach of removing tips, bubbles, and other local features,
LazyB extracts subgraphs whose global properties approach a disjoint union of
paths in multiple steps, utilizing properties of proper interval graphs. A
prototype implementation of LazyB, entirely written in Python, not only yields
significantly more accurate assemblies of the yeast, fruit fly, and human
genomes compared to state-of-the-art pipelines, but also requires much less
computational effort. An optimized C++ implementation dubbed MuCHSALSA
further significantly reduces resource demands.
Advances in RNA-seq have facilitated tremendous insights into the role of
both coding and non-coding transcripts. Yet, the complete and accurate
annotation of the transciptomes of even model organisms has remained elusive.
RNA-seq produces reads significantly shorter than the average distance
between related splice events and presents high noise levels and other biases
The computational reconstruction remains a critical bottleneck.
Ryūtō implements an extension of common splice graphs facilitating the integration
of reads spanning multiple splice sites and paired-end reads bridging distant
transcript parts. The decomposition of read coverage patterns is modeled as a
minimum-cost flow problem. Using phasing information from multi-splice and
paired-end reads, nodes with uncertain connections are decomposed step-wise
via Linear Programming.
Ryūtōs performance compares favorably with
state-of-the-art methods on both simulated and real-life datasets. Despite
ongoing research and our own contributions, progress on traditional single
sample assembly has brought no major breakthrough. Multi-sample RNA-Seq
experiments provide more information which, however, is challenging to utilize
due to the large amount of accumulating errors. An extension to Ryūtō
enables the reconstruction of consensus transcriptomes from multiple RNA-seq
data sets, incorporating consensus calling at low level features. Benchmarks
show stable improvements already at 3 replicates.
Ryūtō outperforms competing approaches, providing a better and user-adjustable
sensitivity-precision trade-off. Ryūtō consistently improves assembly on
replicates, demonstrable also when mixing conditions or time series and for
differential expression analysis. Ryūtōs approach towards guided assembly is
equally unique. It allows users to adjust results based on the quality of the
guide, even for multi-sample assembly.:1 Preface
1.1 Assembly: A vast and fast evolving field
1.2 Structure of this Work
1.3 Available
2 Introduction
2.1 Mathematical Background
2.2 High-Throughput Sequencing
2.3 Assembly
2.4 Transcriptome Expression
3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly
3.1 Background
3.2 Strategy
3.3 Data preprocessing
3.4 Processing of the overlap graph
3.5 Post Processing of the Path Decomposition
3.6 Benchmarking
3.7 MuCHSALSA – Moving towards the future
4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly
4.1 Background
4.2 Strategy
4.3 The Ryūtō core algorithm
4.4 Improved Multi-sample transcript assembly with Ryūtō
5 Conclusion & Future Work
5.1 Discussion and Outlook
5.2 Summary and Conclusio
Examining recombination and intra-genomic conflict dynamics in the evolution of anti-microbial resistant bacteria
The spread of antimicrobial resistance (AMR) among pathogenic bacterial species threatens to undercut much of the progress made in treating infectious diseases. AMR genes can disseminate between and within populations via horizontal gene transfer (HGT). Selfish mobile genetic elements (MGEs) can encode resistance and spread between host cells. Homologous recombination can alter the core genes of pathogens with resistant donors via HGT too. MGEs may be cured from host genomes through transformation. Hence, MGEs may be able to avoid deletion by disrupting transformation. This work aims to understand how the dynamics of these processes affect the epidemiology of AMR pathogens.
To understand these dynamics, I co-developed a new version of the popular recombination detection tool Gubbins. Through simulation studies, I find this new version to be both accurate in reconstructing the relationships between isolates, and efficient in terms of its use of computational resources.
I then apply Gubbins to both AMR lineages and species-wide datasets of the pathogen Streptococcus pneumoniae. I find that recombination frequently occurs around core genes involved in both drug resistance and the host immune response. Additionally, an MGE was able to successfully spread within a population by disrupting the transformation machinery, preventing its loss from the host.
Finally, I investigate two recent examples of MGEs disrupting transformation in the gram-negative species Acinetobacter baumannii and Legionella pneumophila. I find that while these insertions may decrease the efficiency of transformations within cells, the observed recombination rates largely reflect the selection pressures on isolates. With MGEs only partially able to inhibit these observable transformation events.
These results show how selection pressures from clinical interventions shape pathogen genomes through diverse, often interspecies, recombination events. The spread of MGEs can also be favoured by both these selection pressures, and their ability to disrupt host cell machinery.Open Acces
Special Topics in Information Technology
This open access book presents thirteen outstanding doctoral dissertations in Information Technology from the Department of Electronics, Information and Bioengineering, Politecnico di Milano, Italy. Information Technology has always been highly interdisciplinary, as many aspects have to be considered in IT systems. The doctoral studies program in IT at Politecnico di Milano emphasizes this interdisciplinary nature, which is becoming more and more important in recent technological advances, in collaborative projects, and in the education of young researchers. Accordingly, the focus of advanced research is on pursuing a rigorous approach to specific research topics starting from a broad background in various areas of Information Technology, especially Computer Science and Engineering, Electronics, Systems and Control, and Telecommunications. Each year, more than 50 PhDs graduate from the program. This book gathers the outcomes of the thirteen best theses defended in 2020-21 and selected for the IT PhD Award. Each of the authors provides a chapter summarizing his/her findings, including an introduction, description of methods, main achievements and future work on the topic. Hence, the book provides a cutting-edge overview of the latest research trends in Information Technology at Politecnico di Milano, presented in an easy-to-read format that will also appeal to non-specialists
Recommended from our members
Quantitative Approaches to the Genomics of Clonal Evolution
Many problems in the biological sciences reduce to questions of genetic evolution. Entire classes of medical pathology, such as malignant neoplasia or infectious disease, can be viewed in the light of Darwinian competition of genomes. With the benefit of today's maturing sequencing technologies we can observe and quantify genetic evolution with nucleotide resolution. This provides a molecular view of genetic material that has adapted, or is in the process of adapting, to its local selection pressures. A series of problems will be discussed in this thesis, all involving the mathematical modeling of genomic data derived from clonally evolving populations. We use a variety of computational approaches to characterize over-represented features in the data, with the underlying hypothesis that we may be detecting fitness-conferring features of the biology.
In Part I we consider the cross-sectional sampling of human tumors via RNA-sequencing, and devise computational pipelines for detecting oncogenic gene fusions and oncovirus infections. Genomic translocation and oncovirus infection can each be a highly penetrant alteration in a tumor's evolutionary history, with famous examples of both populating the cancer biology literature. In order to exert a transforming influence over the host cell, gene fusions and viral genetic programs need to be expressed and thus can be detected via whole transcriptome sequencing of a malignant cell population. We describe our approaches to predicting oncogenic gene fusions (Chapter 2) and quantifying host-viral interactions (Chapter 3) in large panels of human tumor tissue. The alterations that we characterize prompt the larger question of how the genetics of tumors and viruses might vary in time, leading us to the study of serially sampled populations.
In Part II we consider longitudinal sampling of a clonally evolving population. Phylogenetic trees are the standard representation of a clonal process, an evolutionary picture as old as Darwin's voyages on the Beagle. Chapter 4 first reviews phylogenetic inference and then introduces a certain phylogenetic tree space that forms the starting point of our work on the topic. Specifically, Chapter 4 describes the construction of our projective tree space along with an explicit implementation for visualizing point clouds of rescaled trees. The Chapter finishes by defining a method for stable dimensionality reduction of large phylogenies, which is useful for analyzing long genomic time series. In Chapter 5 we consider medically relevant instances of clonal evolution and the longitudinal genetic data sets to which they give rise. We analyze data from (i) the sequencing of cancers along their therapeutic course, (ii) the passaging of a xenografted tumor through a mouse model, and (iii) the seasonal surveillance of H3N2 influenza's hemagglutinin segment. A novel approach to predicting influenza vaccine effectiveness is demonstrated using statistics of point clouds in tree spaces.
Our investigations into clonal processes may be extended beyond naturally occurring genomes. In Part III we focus on the directed clonal evolution of populations of synthetic RNAs in vitro. Analogous to the selection pressures exerted upon malignant cells or viral particles, these synthetic RNA genomes can be evolved against a desired fitness objective. We investigate fitness objectives related to reprogramming ribosomal translation. Chapter 6 identifies high fitness RNA pseudoknot geometries capable of inducing ribosomal frameshift, while Chapter 7 takes an unbiased approach to evolving sequence and structural elements that promote stop codon readthrough
ALGORITHMS AND DATA STRUCTURES FOR INDEXING, QUERYING, AND ANALYZING LARGE COLLECTIONS OF SEQUENCING DATA IN THE PRESENCE OR ABSENCE OF A REFERENCE
High-throughput sequencing has helped to transform our study of biological organisms and processes. For example, RNA-seq is one popular sequencing assay that allows measuring dynamic transcriptomes and enables the discovery (via assem- bly) of novel transcripts. Likewise, metagenomic sequencing lets us probe natural environments to profile organismal diversity and to discover new strains and species that may be integral to the environment or process being studied. The vast amount of available sequencing data, and its growth rate over the past decade also brings with it some immense computational challenges. One of these is how do we design memory-efficient structures for indexing and querying this data. This challenge is not limited to only raw sequencing data (i.e. reads) but also to the growing collection of reference sequences (genomes, and genes) that are assembled from this raw data. We have developed new data structures (both reference-based and reference-free) to index raw sequencing data and assembled reference sequences. Specifically, we describe three separate indices, “Pufferfish”, an index over a set of genomes or tran- scriptomes, and “Rainbowfish” and “Mantis” which are both indices for indexing a set of raw sequencing data sets. All of these indices are designed with consideration of support for high query performance and memory efficient construction and query.
The Pufferfish data structure is based on constructing a compacted, colored, reference de Bruijn graph (ccdbg), and then indexing this structure in an efficient manner. We have developed both sparse and dense indexing schemes which allow trading index space for query speed (though queries always remain asymptotically optimal). Pufferfish provides a full reference index that can return the set of refer- ences, positions and orientations of any k-mer (substring of fixed length “k” ) in the input genomes. We have built an alignment tool, Puffaligner, around this index for aligning sequencing reads to reference sequences. We demonstrate that Puffaligner is able to produce highly-sensitive alignments, similar to those of Bowtie2, but much more quickly and exhibits speed similar to the ultrafast STAR aligner while requiring considerably less memory to construct its index and align reads.
The Rainbowfish and Mantis data structures, on the other hand, are based on reference-free colored de Bruijn graphs (cdbg) constructed over raw sequencing data. Rainbowfish introduces a new efficient representation of the color information which is then adopted and refined by Mantis. Mantis supports graph traversal and other topological analyses, but is also particularly well-suited for large-scale sequence-level search over thousands of samples. We develop multiple and successively-refined versions of the Mantis index, culminating in an index that adopts a minimizer- partitioned representation of the underlying k-mer set and a referential encoding of the color information that exploits fast near-neighbor search and efficient encoding via a minimum spanning tree. We describe, further, how this index can be made incrementally updatable by developing an efficient merge algorithm and storing the overall index in a multi-level log-structured merge (LSM) tree. We demonstrate the utility of this index by building a searchable Mantis, via recursive merging, over 10,000 raw sequencing samples, which we then scale to over 15,000 samples via incremental update. This index can be queried, on a commodity server, to discover the samples likely containing thousands of reference sequences in only a few minutes
Investigating human population structure through time with new computational methods and ancient DNA data
The revolution of sequencing technology has brought an exponential increase in the production of genomic data. This thesis tackles global and continental questions on human demographic history from two directions using genetic data. Manuscript A provides a novel analytical method for estimating migration rate and effective population size utilizing highcoverage whole genome sequences, while manuscript B and C reveal the history of population movement and interactions by directly analyzing genome-wide data from ancient individuals
- …