4 research outputs found

    Prism complexity of matrices

    Get PDF

    n-Subword Complexity Measure of DNA Sequences

    Get PDF
    String complexity has many definitions: Kolmogorov complexity [30]; Lempel-Ziv complexity [14] [27]; Linguistic complexity [42], Subword complexity [10] etc. In this thesis we will consider the n-subword complexity studied in [2] and [13]. The n-subword complexity Pw(n) of a genomic sequence w was defined in [13] as the number of distinct factors (subwords) of length n that occur in w. In [2] a new measure called the n-subword deficit was defined as the difference between the number of subwords of length n of a genomic sequence w and of a random genomic sequence of the same length. This definition was applied to short sequences (2000 base pairs). In this thesis, we will expand this definition to be applied, in addition to short sequences, also to very long sequences (from 100 base pairs to 200,000 base pairs). The aim of our work is to answer the following questions: 1. Do biological sequences show an n-subword deficit, and is their n- subword deficit length dependent? 2. Is the n-subword deficit gene specific? 3. Is the n-subword deficit genome specific? Our results indicate that the answers to questions 1 — 3 appears to be Yes, No, and No respectively. Moreover, it was found that the insects Apis mellifera and Drosophila melanogaster have genomes with the lowest maximal n-subword deficit value among other genomes in all experiments that have been conducted

    A comparative study of automated reviewer assignment methods

    Get PDF
    vii, 75 leaves : ill. ; 29 cm.Includes abstract.Includes bibliographical references (leaves 55-60).The reviewer assignment problem is the problem of determining suitable reviewers for papers submitted to journals or conferences. Automated solutions to this problem have used standard information retrieval methods such as the vector space model and latent semantic indexing. In this work we introduce two new methods. One method assigns reviewers using compression approximated information distance. This method approximates the Kolmogorov complexity of papers using their size when compressed by a compression program, and then approximates the relatedness of the papers using an information distance equation. This method performs better than standard information retrieval methods. The second method assigns reviewers using Google desktop a more advanced information retrieval system. The method searches for key terms from a paper needing reviewers in a set of papers written by possible reviewers and uses the search results as votes for reviewers. This method is relatively simple and is very effective for assigning reviewers

    WORD COMPLEXITY AND REPETITIONS IN WORDS 1

    No full text
    With ideas from data compression and combinatorics on words, we introduce a complexity measure for words, called repetition complexity, which quantifies the amount of repetition in a word. The repetition complexity of w, r(w), is defined as the smallest amount of space needed to store w when reduced by repeatedly applying the following procedure: n consecutive occurrences uu... u of the same subword u of w are stored as (u, n). The repetition complexity has interesting relations with well-known complexity measures, such as subword complexity, sub, and Lempel-Ziv complexity, lz. We have always r(w) ≥ lz(w) and could even be that the former is linear while the latter is only logarithmic; e.g., this happens for prefixes of certain infinite words obtained by iterated morphisms. An infinite word α being ultimately periodic is equivalent to: (i
    corecore