26 research outputs found

    Illustration of the proof of Theorem 3.

    No full text
    <p>Shown are two aligned copies of the section of the reference sequence where (black) represents the sequence between two equivalent deletions (green) and (magenta). The filled polygon illustrates the sequence identity that holds if and only if and are equivalent. Following the grey arrows up and down and from left to right, it can be seen that the sequence section consists of repeats of the deletion sequence until the last copy overlaps with the deletion sequence .</p

    Equivalent Indels – Ambiguous Functional Classes and Redundancy in Databases

    Get PDF
    <div><p>There is considerable interest in studying sequenced variations. However, while the positions of substitutions are uniquely identifiable by sequence alignment, the location of insertions and deletions still poses problems. Each insertion and deletion causes a change of sequence. Yet, due to low complexity or repetitive sequence structures, the same indel can sometimes be annotated in different ways. Two indels which differ in allele sequence and position can be one and the same, i.e. the alternative sequence of the whole chromosome is identical in both cases and, therefore, the two deletions are biologically equivalent. In such a case, it is impossible to identify the exact position of an indel merely based on sequence alignment. Thus, variation entries in a mutation database are not necessarily uniquely defined. We prove the existence of a contiguous region around an indel in which all deletions of the same length are biologically identical. Databases often show only one of several possible locations for a given variation. Furthermore, different data base entries can represent equivalent variation events. We identified 1,045,590 such problematic entries of insertions and deletions out of 5,860,408 indel entries in the current human database of Ensembl. Equivalent indels are found in sequence regions of different functions like exons, introns or 5' and 3' UTRs. One and the same variation can be assigned to several different functional classifications of which only one is correct. We implemented an algorithm that determines for each indel database entry its complete set of equivalent indels which is uniquely characterized by the indel itself and a given interval of the reference sequence.</p></div

    Illustration of Theorem 1.

    No full text
    <p>The first line is the reference sequence. The second and third lines contain two deletions. are the nucleotides. If , then the two deletions are equivalent.</p

    Code 2: Pseudocode for the identification of the region of ambiguity for all indels.

    No full text
    <p>The variable is the start position of the deletion, is the most downstream start position of equivalent indels and is the most upstream start position. Compared to the code of <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0062803#pone-0062803-t002" target="_blank">Table 2</a> the index now cyclicly provides the nucleotide indel[] directly from the indel allele.</p

    Different variations lead to the same alternative sequence and, therefore, the variations are equivalent.

    No full text
    <p>The first variation is a deletion at position 15, the second variation is a deletion at position 14, and the third variation is a deletion at position 5.</p><p>This deletion is annotated in Ensembl as lying in the start codon of transcript HRNR-001 and therefore leads to the loss of the start codon. The equivalent indel has no effect on the protein. (Sequences are all shown as reverse complementary, because the transcript is located on the reverse strand.) Regular characters denote the upstream region and bold, italic characters the coding sequence.</p

    Number of observed insertions (a) and deletions (b) versus length.

    No full text
    <p>There is strong correlation between length (x-axis) and frequency (y-axis) of the indels.</p

    Example of the prefix in a sequence.

    No full text
    <p>In this example, the missing sequence is  = CATT which is repeated times and the prefix of the repeated sequence is CA, which is shown in bold italic.</p

    An example of equivalent deletions in a non-repetitive sequence on human chromosome .

    No full text
    <p>The deletions are bold and italic. All four variations are equivalent although they are located in a non-repetitive sequence. All four variations are annotated in dbSNP. The alternative sequence is in each case ACGTGGG.</p

    An example of the algorithm code 1: A deletion sequence is shifted downstream to detect equivalent deletions.

    No full text
    <p>The deletion sequence is printed in bold italic. The nucleotide following the deletion sequence is compared with the first nucleotide of the deletion. If both are equal, the variation is shifted 1 bp downstream, otherwise the algorithm terminates. In our example the algorithm terminates after the 6th alignment.</p

    Different functional classes of insertion <i>rs55710688</i> on human chromosome at position .

    No full text
    <p>The coding region is shown in lower case. The insertion <i>rs55710688</i> is bold italic. Lines 1 and 2 represent the alignment as annotated in dbSNP: The insertion lies inside the coding region and causes a start lost. Lines 3 and 4 represent an alternative alignment: The insertion lies outside the coding region. The insertion does not affect the start codon ATG.</p
    corecore