466 research outputs found
Inapproximability of maximal strip recovery
In comparative genomic, the first step of sequence analysis is usually to
decompose two or more genomes into syntenic blocks that are segments of
homologous chromosomes. For the reliable recovery of syntenic blocks, noise and
ambiguities in the genomic maps need to be removed first. Maximal Strip
Recovery (MSR) is an optimization problem proposed by Zheng, Zhu, and Sankoff
for reliably recovering syntenic blocks from genomic maps in the midst of noise
and ambiguities. Given genomic maps as sequences of gene markers, the
objective of \msr{d} is to find subsequences, one subsequence of each
genomic map, such that the total length of syntenic blocks in these
subsequences is maximized. For any constant , a polynomial-time
2d-approximation for \msr{d} was previously known. In this paper, we show that
for any , \msr{d} is APX-hard, even for the most basic version of the
problem in which all gene markers are distinct and appear in positive
orientation in each genomic map. Moreover, we provide the first explicit lower
bounds on approximating \msr{d} for all . In particular, we show that
\msr{d} is NP-hard to approximate within . From the other
direction, we show that the previous 2d-approximation for \msr{d} can be
optimized into a polynomial-time algorithm even if is not a constant but is
part of the input. We then extend our inapproximability results to several
related problems including \cmsr{d}, \gapmsr{\delta}{d}, and
\gapcmsr{\delta}{d}.Comment: A preliminary version of this paper appeared in two parts in the
Proceedings of the 20th International Symposium on Algorithms and Computation
(ISAAC 2009) and the Proceedings of the 4th International Frontiers of
Algorithmics Workshop (FAW 2010
Bacterial microevolution and the Pangenome
The comparison of multiple genome sequences sampled from a bacterial population reveals considerable diversity in both the core and the accessory parts of the pangenome. This diversity can be analysed in terms of microevolutionary events that took place since the genomes shared a common ancestor, especially deletion, duplication, and recombination. We review the basic modelling ingredients used implicitly or explicitly when performing such a pangenome analysis. In particular, we describe a basic neutral phylogenetic framework of bacterial pangenome microevolution, which is not incompatible with evaluating the role of natural selection. We survey the different ways in which pangenome data is summarised in order to be included in microevolutionary models, as well as the main methodological approaches that have been proposed to reconstruct pangenome microevolutionary history
AccuSyn: Using Simulated Annealing to Declutter Genome Visualizations
We apply Simulated Annealing, a well-known metaheuristic for obtaining near-optimal solutions to optimization problems, to discover conserved synteny relations (similar features) in genomes. The analysis of synteny gives biologists insights into the evolutionary history of species and the functional relationships between genes. However, as even simple organisms have huge numbers of genomic features, syntenic plots initially present an enormous clutter of connections, making the structure difficult to understand. We address this problem by using Simulated Annealing to minimize link crossings. Our interactive web-based synteny browser, AccuSyn, visualizes syntenic relations with circular plots of chromosomes and draws links between similar blocks of genes. It also brings together a huge amount of genomic data by integrating an adjacent view and additional tracks, to visualize the details of the blocks and accompanying genomic data, respectively. Our work shows multiple ways to manually declutter a synteny plot and then thoroughly explains how we integrated Simulated Annealing, along with human interventions as a human-in-the-loop approach, to achieve an accurate representation of conserved synteny relations for any genome. The goal of AccuSyn was to make a fairly complete tool combining ideas from four major areas: genetics, information visualization, heuristic search, and human-in-the-loop. Our results contribute to a better understanding of synteny plots and show the potential that decluttering algorithms have for syntenic analysis, adding more clues for evolutionary development. At this writing, AccuSyn is already actively used in the research being done at the University of Saskatchewan and has already produced a visualization of the recently-sequenced Wheat genome
The zero exemplar distance problem
Given two genomes with duplicate genes, \textsc{Zero Exemplar Distance} is
the problem of deciding whether the two genomes can be reduced to the same
genome without duplicate genes by deleting all but one copy of each gene in
each genome. Blin, Fertin, Sikora, and Vialette recently proved that
\textsc{Zero Exemplar Distance} for monochromosomal genomes is NP-hard even if
each gene appears at most two times in each genome, thereby settling an
important open question on genome rearrangement in the exemplar model. In this
paper, we give a very simple alternative proof of this result. We also study
the problem \textsc{Zero Exemplar Distance} for multichromosomal genomes
without gene order, and prove the analogous result that it is also NP-hard even
if each gene appears at most two times in each genome. For the positive
direction, we show that both variants of \textsc{Zero Exemplar Distance} admit
polynomial-time algorithms if each gene appears exactly once in one genome and
at least once in the other genome. In addition, we present a polynomial-time
algorithm for the related problem \textsc{Exemplar Longest Common Subsequence}
in the special case that each mandatory symbol appears exactly once in one
input sequence and at least once in the other input sequence. This answers an
open question of Bonizzoni et al. We also show that \textsc{Zero Exemplar
Distance} for multichromosomal genomes without gene order is fixed-parameter
tractable if the parameter is the maximum number of chromosomes in each genome.Comment: Strengthened and reorganize
SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS
In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) sux-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (Kmer-Estimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use sux tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs ( 100 bp, mostly found in animals). In our new method (MinCNE), the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a comprehensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods.
Adviser: Jitender S. Deogu
Inferring genome-scale rearrangement phylogeny and ancestral gene order: a Drosophila case study
A simple, fast, and biologically-inspired computational approach to infer genome-scale rearrangement phylogeny and ancestral gene order has been developed and applied to eight Drosophila genomes, providing insights into evolutionary chromosomal dynamics
- …