41,209 research outputs found

    Efficient algorithms for gene cluster detection in prokaryotic genomes

    Get PDF
    Schmidt T. Efficient algorithms for gene cluster detection in prokaryotic genomes. Bielefeld (Germany): Bielefeld University; 2005.The research in genomics science rapidly emerged in the last few years, and the availability of completely sequenced genomes continuously increases due to the use of semi-automatic sequencing machines. Also these sequences, mostly prokaryotic ones, are well annotated, which means that the positions of their genes and parts of their regulatory or metabolic pathways are known. A new task in the field of bioinformatics now is to gain gene or protein information from the comparison of genomes on a higher level. In the approach of "comparative genomics" researchers in bioinformatics are attempting to locate groups or clusters of orthologous genes that may have the same function in multiple genomes. These researches are often anchored on the simple, but biologically verified fact, that functionally related proteins are usually coded by genes placed in a region of close genomic neighborhood, in different species. From an algorithmic and combinatorial point of view, the first descriptions of the concept of "closely placed genes" were only fragmentary, and sometimes confusing. The given algorithms often lack the necessary grounds to prove their correctness, or assess their complexity. Within the first formal models of a conserved genomic neighborhood, genomes are often represented as permutations of their genes, and common intervals, i.e. intervals containing the same set of genes, are interpreted as gene clusters. But here the major disadvantage of representing genomes as permutations is the fact that paralogous copies of the same gene inside one genome can not be modelled. Since especially large genomes contain numerous paralogous genes, this model is insufficient to be used on real genomic data. In this work, we consider a modified model of gene clusters that allows paralogs, simply by representing genomes as sequences rather than permutations of genes. We define common intervals based on this model, and we present a simple algorithm that finds all common intervals of two sequences in [Theta](n2) time using [Theta](n2) space. Another, more complicated algorithm runs in [Omikron](n2) time and uses only linear space. We also show how to extend these algorithms to more than two genomes and present the implementation of the algorithms as well as the visualization of the located clusters in the tool Gecko. Since the creation of the string representation of a set of genomes is a non-trivial task, we also present the data preparation tool GhostFam that groups all genes from the given set of genomes to their families of homologs. In the evaluation on a set of 20 bacterial genomes, we show that with the presented approach it is possible to correctly locate gene clusters that are known from the literature, and to successfully predict new groups of functionally related genes

    Longest Common Separable Pattern between Permutations

    Get PDF
    In this article, we study the problem of finding the longest common separable pattern between several permutations. We give a polynomial-time algorithm when the number of input permutations is fixed and show that the problem is NP-hard for an arbitrary number of input permutations even if these permutations are separable. On the other hand, we show that the NP-hard problem of finding the longest common pattern between two permutations cannot be approximated better than within a ratio of sqrtOptsqrt{Opt} (where OptOpt is the size of an optimal solution) when taking common patterns belonging to pattern-avoiding classes of permutations.Comment: 15 page

    Longest Common Pattern between two Permutations

    Full text link
    In this paper, we give a polynomial (O(n^8)) algorithm for finding a longest common pattern between two permutations of size n given that one is separable. We also give an algorithm for general permutations whose complexity depends on the length of the longest simple permutation involved in one of our permutations

    Conservative Hypothesis Tests and Confidence Intervals using Importance Sampling

    Full text link
    Importance sampling is a common technique for Monte Carlo approximation, including Monte Carlo approximation of p-values. Here it is shown that a simple correction of the usual importance sampling p-values creates valid p-values, meaning that a hypothesis test created by rejecting the null when the p-value is <= alpha will also have a type I error rate <= alpha. This correction uses the importance weight of the original observation, which gives valuable diagnostic information under the null hypothesis. Using the corrected p-values can be crucial for multiple testing and also in problems where evaluating the accuracy of importance sampling approximations is difficult. Inverting the corrected p-values provides a useful way to create Monte Carlo confidence intervals that maintain the nominal significance level and use only a single Monte Carlo sample. Several applications are described, including accelerated multiple testing for a large neurophysiological dataset and exact conditional inference for a logistic regression model with nuisance parameters.Comment: 26 pages, 3 figures, 3 tables [significant rewrite of version 1, including additional examples, title change

    Colorful Strips

    Full text link
    Given a planar point set and an integer kk, we wish to color the points with kk colors so that any axis-aligned strip containing enough points contains all colors. The goal is to bound the necessary size of such a strip, as a function of kk. We show that if the strip size is at least 2k−12k{-}1, such a coloring can always be found. We prove that the size of the strip is also bounded in any fixed number of dimensions. In contrast to the planar case, we show that deciding whether a 3D point set can be 2-colored so that any strip containing at least three points contains both colors is NP-complete. We also consider the problem of coloring a given set of axis-aligned strips, so that any sufficiently covered point in the plane is covered by kk colors. We show that in dd dimensions the required coverage is at most d(k−1)+1d(k{-}1)+1. Lower bounds are given for the two problems. This complements recent impossibility results on decomposition of strip coverings with arbitrary orientations. Finally, we study a variant where strips are replaced by wedges

    Interlocked permutations

    Full text link
    The zero-error capacity of channels with a countably infinite input alphabet formally generalises Shannon's classical problem about the capacity of discrete memoryless channels. We solve the problem for three particular channels. Our results are purely combinatorial and in line with previous work of the third author about permutation capacity.Comment: 8 page
    • …
    corecore