5,728 research outputs found

    Mutation model for nucleotide sequences based on crystal basis

    Full text link
    A nucleotides sequence is identified, in the two (four) letters alphabet, by the the labels of a vector state of an irreducible representation of U_q(sl(2)) (U_q(sl(2) + sl(2))), in the limit q -> 0. A master equation for the distribution function is written, where the intensity of the one-spin flip is assumed to depend from the variation of the labels of the state. In the two letters approximation, the numerically computed equilibrium distribution for short sequences is nicely fitted by a Yule distribution, which is the observed distribution of the ranked short oligonucleotides frequency in DNA. The four letter alphabet description, applied to the codons, is able to reproduce the form of the fitted rank ordered usage frequencies distribution.Comment: 27 pages, 9 figure

    An Efficient Rank Based Approach for Closest String and Closest Substring

    Get PDF
    This paper aims to present a new genetic approach that uses rank distance for solving two known NP-hard problems, and to compare rank distance with other distance measures for strings. The two NP-hard problems we are trying to solve are closest string and closest substring. For each problem we build a genetic algorithm and we describe the genetic operations involved. Both genetic algorithms use a fitness function based on rank distance. We compare our algorithms with other genetic algorithms that use different distance measures, such as Hamming distance or Levenshtein distance, on real DNA sequences. Our experiments show that the genetic algorithms based on rank distance have the best results

    Test Set Diameter: Quantifying the Diversity of Sets of Test Cases

    Full text link
    A common and natural intuition among software testers is that test cases need to differ if a software system is to be tested properly and its quality ensured. Consequently, much research has gone into formulating distance measures for how test cases, their inputs and/or their outputs differ. However, common to these proposals is that they are data type specific and/or calculate the diversity only between pairs of test inputs, traces or outputs. We propose a new metric to measure the diversity of sets of tests: the test set diameter (TSDm). It extends our earlier, pairwise test diversity metrics based on recent advances in information theory regarding the calculation of the normalized compression distance (NCD) for multisets. An advantage is that TSDm can be applied regardless of data type and on any test-related information, not only the test inputs. A downside is the increased computational time compared to competing approaches. Our experiments on four different systems show that the test set diameter can help select test sets with higher structural and fault coverage than random selection even when only applied to test inputs. This can enable early test design and selection, prior to even having a software system to test, and complement other types of test automation and analysis. We argue that this quantification of test set diversity creates a number of opportunities to better understand software quality and provides practical ways to increase it.Comment: In submissio

    A GRASP-based memetic algorithm with path relinking for the far from most string problem.

    Get PDF
    Política de acceso abierto tomada de: https://www.elsevier.com/about/policies-and-standards/copyrightThe FAR FROM MOST STRING PROBLEM (FFMSP) is a string selection problem. The objective is to find a string whose distance to other strings in a certain input set is above a given threshold for as many of those strings as possible. This problem has links with some tasks in computational biology and its resolution has been shown to be very hard. We propose a memetic algorithm (MA) to tackle the FFMSP. This MA exploits a heuristic objective function for the problem and features initialization of the population via a Greedy Randomized Adaptive Search Procedure (GRASP) metaheuristic, intensive recombination via path relinking and local improvement via hill climbing. An extensive empirical evaluation using problem instances of both random and biological origin is done to assess parameter sensitivity and draw performance comparisons with other state-of-the-art techniques. The MA is shown to perform better than these latter techniques with statistical significance.ANYSELF (TIN2011-28627-C04-01) of MICINN and DNEMESIS (TIC-6083) of Junta de Andalucía

    Network Analysis of Differential Expression for the Identification of Disease-Causing Genes

    Get PDF
    Genetic studies (in particular linkage and association studies) identify chromosomal regions involved in a disease or phenotype of interest, but those regions often contain many candidate genes, only a few of which can be followed-up for biological validation. Recently, computational methods to identify (prioritize) the most promising candidates within a region have been proposed, but they are usually not applicable to cases where little is known about the phenotype (no or few confirmed disease genes, fragmentary understanding of the biological cascades involved). We seek to overcome this limitation by replacing knowledge about the biological process by experimental data on differential gene expression between affected and healthy individuals. Considering the problem from the perspective of a gene/protein network, we assess a candidate gene by considering the level of differential expression in its neighborhood under the assumption that strong candidates will tend to be surrounded by differentially expressed neighbors. We define a notion of soft neighborhood where each gene is given a contributing weight, which decreases with the distance from the candidate gene on the protein network. To account for multiple paths between genes, we define the distance using the Laplacian exponential diffusion kernel. We score candidates by aggregating the differential expression of neighbors weighted as a function of distance. Through a randomization procedure, we rank candidates by p-values. We illustrate our approach on four monogenic diseases and successfully prioritize the known disease causing genes
    corecore