41,209 research outputs found
Efficient algorithms for gene cluster detection in prokaryotic genomes
Schmidt T. Efficient algorithms for gene cluster detection in prokaryotic genomes. Bielefeld (Germany): Bielefeld University; 2005.The research in genomics science rapidly emerged in the last few years, and the availability of completely sequenced genomes continuously increases due to the use of semi-automatic sequencing machines. Also these sequences, mostly prokaryotic ones, are well annotated, which means that the positions of their genes and parts of their regulatory or metabolic pathways are known. A new task in the field of bioinformatics now is to gain gene or protein information from the comparison of genomes on a higher level.
In the approach of "comparative genomics" researchers in bioinformatics are attempting to locate groups or clusters of orthologous genes that may have the same function in multiple genomes. These researches are often anchored on the simple, but biologically verified fact, that functionally related proteins are usually coded by genes placed in a region of close genomic neighborhood, in different species.
From an algorithmic and combinatorial point of view, the first descriptions of the concept of "closely placed genes" were only fragmentary, and sometimes confusing. The given algorithms often lack the necessary grounds to prove their correctness, or assess their complexity.
Within the first formal models of a conserved genomic neighborhood, genomes are often represented as permutations of their genes, and common intervals, i.e. intervals containing the same set of genes, are interpreted as gene clusters. But here the major disadvantage of representing genomes as permutations is the fact that paralogous copies of the same gene inside one genome can not be modelled. Since especially large genomes contain numerous paralogous genes, this model is insufficient to be used on real genomic data.
In this work, we consider a modified model of gene clusters that allows paralogs, simply by representing genomes as sequences rather than permutations of genes. We define common intervals based on this model, and we present a simple algorithm that finds all common intervals of two sequences in [Theta](n2) time using [Theta](n2) space. Another, more complicated algorithm runs in [Omikron](n2) time and uses only linear space. We also show how to extend these algorithms to more than two genomes and present the implementation of the algorithms as well as the visualization of the located clusters in the tool Gecko. Since the creation of the string representation of a set of genomes is a non-trivial task, we also present the data preparation tool GhostFam that groups all genes from the given set of genomes to their families of homologs. In the evaluation on a set of 20 bacterial genomes, we show that with the presented approach it is possible to correctly locate gene clusters that are known from the literature, and to successfully predict new groups of functionally related genes
Longest Common Separable Pattern between Permutations
In this article, we study the problem of finding the longest common separable
pattern between several permutations. We give a polynomial-time algorithm when
the number of input permutations is fixed and show that the problem is NP-hard
for an arbitrary number of input permutations even if these permutations are
separable. On the other hand, we show that the NP-hard problem of finding the
longest common pattern between two permutations cannot be approximated better
than within a ratio of (where is the size of an optimal
solution) when taking common patterns belonging to pattern-avoiding classes of
permutations.Comment: 15 page
Longest Common Pattern between two Permutations
In this paper, we give a polynomial (O(n^8)) algorithm for finding a longest
common pattern between two permutations of size n given that one is separable.
We also give an algorithm for general permutations whose complexity depends on
the length of the longest simple permutation involved in one of our
permutations
Conservative Hypothesis Tests and Confidence Intervals using Importance Sampling
Importance sampling is a common technique for Monte Carlo approximation,
including Monte Carlo approximation of p-values. Here it is shown that a simple
correction of the usual importance sampling p-values creates valid p-values,
meaning that a hypothesis test created by rejecting the null when the p-value
is <= alpha will also have a type I error rate <= alpha. This correction uses
the importance weight of the original observation, which gives valuable
diagnostic information under the null hypothesis. Using the corrected p-values
can be crucial for multiple testing and also in problems where evaluating the
accuracy of importance sampling approximations is difficult. Inverting the
corrected p-values provides a useful way to create Monte Carlo confidence
intervals that maintain the nominal significance level and use only a single
Monte Carlo sample. Several applications are described, including accelerated
multiple testing for a large neurophysiological dataset and exact conditional
inference for a logistic regression model with nuisance parameters.Comment: 26 pages, 3 figures, 3 tables [significant rewrite of version 1,
including additional examples, title change
Colorful Strips
Given a planar point set and an integer , we wish to color the points with
colors so that any axis-aligned strip containing enough points contains all
colors. The goal is to bound the necessary size of such a strip, as a function
of . We show that if the strip size is at least , such a coloring
can always be found. We prove that the size of the strip is also bounded in any
fixed number of dimensions. In contrast to the planar case, we show that
deciding whether a 3D point set can be 2-colored so that any strip containing
at least three points contains both colors is NP-complete.
We also consider the problem of coloring a given set of axis-aligned strips,
so that any sufficiently covered point in the plane is covered by colors.
We show that in dimensions the required coverage is at most .
Lower bounds are given for the two problems. This complements recent
impossibility results on decomposition of strip coverings with arbitrary
orientations. Finally, we study a variant where strips are replaced by wedges
Interlocked permutations
The zero-error capacity of channels with a countably infinite input alphabet
formally generalises Shannon's classical problem about the capacity of discrete
memoryless channels. We solve the problem for three particular channels. Our
results are purely combinatorial and in line with previous work of the third
author about permutation capacity.Comment: 8 page
- …