370 research outputs found

    NTRFinder: a software tool to find nested tandem repeats

    Get PDF
    We introduce the software tool NTRFinder to search for a complex repetitive structure in DNA we call a nested tandem repeat (NTR). An NTR is a recurrence of two or more distinct tandem motifs interspersed with each other. We propose that NTRs can be used as phylogenetic and population markers. We have tested our algorithm on both real and simulated data, and present some real NTRs of interest. NTRFinder can be downloaded from http://www.maths.otago.ac.nz/~aamatroud/

    TRStalker: an efficient heuristic for finding fuzzy tandem repeats

    Get PDF
    Motivation: Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events

    Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs.

    Get PDF
    Variable number tandem repeats (VNTRs) are composed of consecutive repetitive DNA with hypervariable repeat count and composition. They include protein coding sequences and associations with clinical disorders. It has been difficult to incorporate VNTR analysis in disease studies that use short-read sequencing because the traditional approach of mapping to the human reference is less effective for repetitive and divergent sequences. In this work, we solve VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies. We develop software to build a RPGG, and use the RPGG to estimate VNTR composition with short reads. We use this to discover VNTRs with length stratified by continental population, and expression quantitative trait loci, indicating that RPGG analysis of VNTRs will be critical for future studies of diversity and disease

    A fast and effective approach for the detection of units in tandem repeat proteins

    Get PDF
    openA fast and effective approach for the detection of units in tandem repeat proteinsA fast and effective approach for the detection of units in tandem repeat protein

    Festparameter-Algorithmen fuer die Konsens-Analyse Genomischer Daten

    Get PDF
    Fixed-parameter algorithms offer a constructive and powerful approach to efficiently obtain solutions for NP-hard problems combining two important goals: Fixed-parameter algorithms compute optimal solutions within provable time bounds despite the (almost inevitable) computational intractability of NP-hard problems. The essential idea is to identify one or more aspects of the input to a problem as the parameters, and to confine the combinatorial explosion of computational difficulty to a function of the parameters such that the costs are polynomial in the non-parameterized part of the input. This makes especially sense for parameters which have small values in applications. Fixed-parameter algorithms have become an established algorithmic tool in a variety of application areas, among them computational biology where small values for problem parameters are often observed. A number of design techniques for fixed-parameter algorithms have been proposed and bounded search trees are one of them. In computational biology, however, examples of bounded search tree algorithms have been, so far, rare. This thesis investigates the use of bounded search tree algorithms for consensus problems in the analysis of DNA and RNA data. More precisely, we investigate consensus problems in the contexts of sequence analysis, of quartet methods for phylogenetic reconstruction, of gene order analysis, and of RNA secondary structure comparison. In all cases, we present new efficient algorithms that incorporate the bounded search tree paradigm in novel ways. On our way, we also obtain results of parameterized hardness, showing that the respective problems are unlikely to allow for a fixed-parameter algorithm, and we introduce integer linear programs (ILP's) as a tool for classifying problems as fixed-parameter tractable, i.e., as having fixed-parameter algorithms. Most of our algorithms were implemented and tested on practical data.Festparameter-Algorithmen bieten einen konstruktiven Ansatz zur Loesung von kombinatorisch schwierigen, in der Regel NP-harten Problemen, der zwei Ziele beruecksichtigt: innerhalb von beweisbaren Laufzeitschranken werden optimale Ergebnisse berechnet. Die entscheidende Idee ist dabei, einen oder mehrere Aspekte der Problemeingabe als Parameter der Problems aufzufassen und die kombinatorische Explosion der algorithmischen Schwierigkeit auf diese Parameter zu beschraenken, so dass die Laufzeitkosten polynomiell in Bezug auf den nicht-parametrisierten Teil der Eingabe sind. Gibt es einen Festparameter-Algorithmus fuer ein kombinatorisches Problem, nennt man das Problem festparameter-handhabbar. Die Entwicklung von Festparameter-Algorithmen macht vor allem dann Sinn, wenn die betrachteten Parameter im Anwendungsfall nur kleine Werte annehmen. Festparameter-Algorithmen sind zu einem algorithmischen Standardwerkzeug in vielen Anwendungsbereichen geworden, unter anderem in der algorithmischen Biologie, wo in vielen Anwendungen kleine Parameterwerte beobachtet werden koennen. Zu den bekannten Techniken fuer den Entwurf von Festparameter-Algorithmen gehoeren unter anderem groessenbeschraenkte Suchbaeume. In der algorithmischen Biologie gibt es bislang nur wenige Beispiele fuer die Anwendung von groessenbeschraenkten Suchbaeumen. Diese Arbeit untersucht den Einsatz groessenbeschraenkter Suchbaeume fuer NP-harte Konsens-Probleme in der Analyse von DNS- und RNS-Daten. Wir betrachten Konsens-Probleme in der Analyse von DNS-Sequenzdaten, in der Analyse von sogenannten Quartettdaten zur Erstellung von phylogenetischen Hypothesen, in der Analyse von Daten ueber die Anordnung von Genen und beim Vergleich von RNS-Strukturdaten. In allen Faellen stellen wir neue effiziente Algorithmen vor, in denen das Paradigma der groessenbeschraenkten Suchbaeume auf neuartige Weise realisiert wird. Auf diesem Weg zeigen wir auch Ergebnisse parametrisierter Haerte, die zeigen, dass fuer die dabei betrachteten Probleme ein Festparameter-Algorithmus unwahrscheinlich ist. Ausserdem fuehren wir ganzzahliges lineares Programmieren als eine neue Technik ein, um die Festparameter-Handhabbarkeit eines Problems zu zeigen. Die Mehrzahl der hier vorgestellten Algorithmen wurde implementiert und auf Anwendungsdaten getestet

    Homology inference with specific molecular constraints

    Get PDF
    Evolutionary processes can be considered at multiple levels of biological organization. The work developed in this thesis focuses on protein molecular evolution. Although proteins are linear polymers composed from a basic set of 20 amino acids, they generate an enormous variety of form and function. Proteins that have arisen by a common descent are classified into families; they often share common properties including similarities in sequence, structure, and function. Multiple methods have been developed to infer evolutionary relationships between proteins and classify them into families. Yet, those generic methods are often inaccurate, especially when specific protein properties limit their applications. In this thesis, we analyse two protein classes that are often difficult for the evolutionary analysis: the coiled-coils – repetitive protein domains defined by a simple widespread peptide motif (chapters 2 and 3) and Rab small GTPases – a large family of closely related proteins (chapters 4 and 5). In both cases, we analyse the specific properties that determine protein structure and function and use them to improve their evolutionary inference

    Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

    Get PDF
    About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them

    The Nested Periodic Subspaces: Extensions of Ramanujan Sums for Period Estimation

    Get PDF
    In the year 1918, the Indian mathematician Srinivasa Ramanujan proposed a set of sequences called Ramanujan Sums as bases to expand arithmetic functions in number theory. Today, exactly a 100 years later, we will show that these sequences re-emerge as exciting tools in a completely different context: For the extraction of periodic patterns in data. Combined with the state-of-the-art techniques of DSP, Ramanujan Sums can be used as the starting point for developing powerful algorithms for periodicity applications. The primary inspiration for this thesis comes from a recent extension of Ramanujan sums to subspaces known as the Ramanujan subspaces. These subspaces were designed to span any sequence with integer periodicity, and have many interesting properties. Starting with Ramanujan subspaces, this thesis first develops an entire family of such subspace representations for periodic sequences. This family, called Nested Periodic Subspaces due to their unique structure, turns out to be the least redundant sets of subspaces that can span periodic sequences. Three classes of new algorithms are proposed using the Nested Periodic Subspaces: dictionaries, filter banks, and eigen-space methods based on the auto-correlation matrix of the signal. It will be shown that these methods are especially advantageous to use when the data-length is short, or when the signal is a mixture of multiple hidden periods. The dictionary techniques were inspired by recent advances in sparsity based compressed sensing. Apart from the l1 norm based convex programs currently used in other applications, our dictionaries can admit l2 norm formulations that have linear and closed form solutions, even when the systems is under-determined. A new filter bank is also proposed using the Ramanujan sums. This, named the Ramanujan Filter Bank, can accurately track the instantaneous period for signals that exhibit time varying periodic nature. The filters in the Ramanujan Filter Bank have simple integer valued coefficients, and directly tile the period vs time plane, unlike classical STFT (Short Time Fourier Transform) and wavelets, which tile the time-frequency plane. The third family of techniques developed here are a generalization of the classic MUSIC (MUltiple SIgnal Classification) algorithm for periodic signals. MUSIC is one of the most popular techniques today for line spectral estimation. However, periodic signals are not just any unstructured line spectral signals. There is a nice harmonic spacing between the lines which is not exploited by plain MUSIC. We will show that one can design much more accurate adaptations of MUSIC using Nested Periodic Subspaces. Compared to prior variants of MUSIC for the periodicity problem, our approach is much faster and yields much more accurate results for signals with integer periods. This work is also the first extension of MUSIC that uses simple integer valued basis vectors instead of using traditional complex-exponentials to span the signal subspace. The advantages of the new methods are demonstrated both on simulations, as well as real world applications such as DNA micro-satellites, protein repeats and absence seizures. Apart from practical contributions, the theory of Nested Periodic Subspaces offers answers to a number of fundamental questions that were previously unanswered. For example, what is the minimum contiguous data-length needed to be able to identify the period of a signal unambiguously? Notice that the answer we seek is a fundamental identifiability bound independent of any particular period estimation technique. Surprisingly, this basic question has never been answered before. In this thesis, we will derive precise expressions for the minimum necessary and sufficient datalengths for this question. We also extend these bounds to the context of mixtures of periodic signals. Once again, even though mixtures of periodic signals often occur in many applications, aspects such as the unique identifiability of the component periods were never rigorously analyzed before. We will present such an analysis as well. While the above question deals with the minimum contiguous datalength required for period estimation, one may ask a slightly different question: If we are allowed to pick the samples of a signal in a non-contiguous fashion, how should we pick them so that we can estimate the period using the least number of samples? This question will be shown to be quite difficult to answer in general. In this thesis, we analyze a smaller case in this regard, namely, that of resolving between two periods. It will be shown that the analysis is quite involved even in this case, and the optimal sampling pattern takes an interesting form of sparsely located bunches. This result can also be extended to the case of multi-dimensional periodic signals. We very briefly address multi-dimensional periodicity in this thesis. Most prior DSP literature on multi-dimensional discrete time periodic signals assumes the period to be parallelepipeds. But as shown by the artist M. C. Escher, one can tile the space using a much more diverse variety of shapes. Is it always possible to account for such other periodic shapes using the traditional notion of parallelepiped periods? An interesting analysis in this regard is presented towards the end of the thesis.</p

    Bioinformatic Investigations Into the Genetic Architecture of Renal Disorders

    Get PDF
    Modern genomic analysis has a significant bioinformatic component due to the high volume of complex data that is involved. During investigations into the genetic components of two renal diseases, we developed two software tools. // Genome-Wide Association Studies (GWAS) datasets may be genotyped on different microarrays and subject to different annotation, leading to a mosaic case-control cohort that has inherent errors, primarily due to strand mismatching. Our software REMEDY seeks to detect and correct strand designation of input datasets, as well as filtering for common sources of noise such as structural and multi-allelic variants. We performed a GWAS on a large cohort of Steroid-sensitive nephrotic syndrome samples; the mosaic input datasets were pre-processed with REMEDY prior to merging and analysis. Our results show that REMEDY significantly reduced noise in GWAS output results. REMEDY outperforms existing software as it has significantly more features available such as auto-strand designation detection, comprehensive variant filtering and high-speed variant matching to dbSNP. // The second tool supported the analysis of a newly characterised rare renal disorder: Polycystic kidney disease with hyperinsulinemic hypoglycemia (HIPKD). Identification of the underlying genetic cause led to the hypothesis that a change in chromatin looping at a specific locus affected the aetiology of the disease. We developed LOOPER, a software suite capable of predicting chromatin loops from ChIP-Seq data to explore the possible conformations of chromatin architecture in the HIPKD genomic region. LOOPER predicted several interesting functional and structural loops that supported our hypothesis. We then extended LOOPER to visualise ChIA-PET and ChIP-Seq data as a force-directed graph to show experimental structural and functional chromatin interactions. Next, we re-analysed the HIPKD region with LOOPER to show experimentally validated chromatin interactions. We first confirmed our original predicted loops and subsequently discovered that the local genomic region has many more chromatin features than first thought
    corecore