Search CORE

370 research outputs found

NTRFinder: a software tool to find nested tandem repeats

Author: A. A. Matroud
C. P. Tuffley
Domanic
Fu
Hauth
Landau
M. D. Hendy
Matroud
Sagot
Wells
Wexler
Woodford
Publication venue: Oxford University Press
Publication date
Field of study

We introduce the software tool NTRFinder to search for a complex repetitive structure in DNA we call a nested tandem repeat (NTR). An NTR is a recurrence of two or more distinct tandem motifs interspersed with each other. We propose that NTRs can be used as phylogenetic and population markers. We have tested our algorithm on both real and simulated data, and present some real NTRs of interest. NTRFinder can be downloaded from http://www.maths.otago.ac.nz/~aamatroud/

Crossref

PubMed Central

TRStalker: an efficient heuristic for finding fuzzy tandem repeats

Author: Alessio Vecchio
Ames
Benson
Benson
Boeva
Brodzik
Buchner
Burkhardt
Burkhardt
Bussey
Campuzano
de la Higuera
Dujon
Elemento
Fischetti
Gelfand
Glusman
Grissa
Gupta
Gusfield
Gusfield
Hauth
Jiang
Jurka
Kelkar
Kolpakov
Kolpakov
Kolpakov
Krishnan
Kurtz
Kurtz
Landau
Leclercq
Legendre
M. Elena Renda
Marco Pellegrini
Motwani
Mudunuri
Mulmuley
O'Dushlaine
Parisi
Peterlongo
Rivals
Rivals
Rowen
Saha
Sammeth
Sharma
Sim
Smit
Sokol
Stolovitzky
Vissers
Vogler
Warburton
Wells
Wexler
Wexler
Wooster
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

Motivation: Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events

CiteSeerX

Crossref

PubMed Central

Archivio della Ricerca - Università di Pisa

Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs.

Author: Chaisson Mark J P
Lee Charles
Lu Tsung-Yu
Variation Consortium Human Genome Structural
Zhu Qihui
Publication venue: The Mouseion at the JAXlibrary
Publication date: 12/07/2021
Field of study

Variable number tandem repeats (VNTRs) are composed of consecutive repetitive DNA with hypervariable repeat count and composition. They include protein coding sequences and associations with clinical disorders. It has been difficult to incorporate VNTR analysis in disease studies that use short-read sequencing because the traditional approach of mapping to the human reference is less effective for repetitive and divergent sequences. In this work, we solve VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies. We develop software to build a RPGG, and use the RPGG to estimate VNTR composition with short reads. We use this to discover VNTRs with length stratified by continental population, and expression quantitative trait loci, indicating that RPGG analysis of VNTRs will be critical for future studies of diversity and disease

The Jackson Laboratory: The Mouseion at the JAXlibrary

A fast and effective approach for the detection of units in tandem repeat proteins

Author: COCCO ALBERTO
Publication venue
Publication date: 27/09/2022
Field of study

openA fast and effective approach for the detection of units in tandem repeat proteinsA fast and effective approach for the detection of units in tandem repeat protein

Padua Thesis and Dissertation Archive

Festparameter-Algorithmen fuer die Konsens-Analyse Genomischer Daten

Author: Gramm Jens
Publication venue: Universität Tübingen
Publication date: 01/01/2003
Field of study

Fixed-parameter algorithms offer a constructive and powerful approach to efficiently obtain solutions for NP-hard problems combining two important goals: Fixed-parameter algorithms compute optimal solutions within provable time bounds despite the (almost inevitable) computational intractability of NP-hard problems. The essential idea is to identify one or more aspects of the input to a problem as the parameters, and to confine the combinatorial explosion of computational difficulty to a function of the parameters such that the costs are polynomial in the non-parameterized part of the input. This makes especially sense for parameters which have small values in applications. Fixed-parameter algorithms have become an established algorithmic tool in a variety of application areas, among them computational biology where small values for problem parameters are often observed. A number of design techniques for fixed-parameter algorithms have been proposed and bounded search trees are one of them. In computational biology, however, examples of bounded search tree algorithms have been, so far, rare. This thesis investigates the use of bounded search tree algorithms for consensus problems in the analysis of DNA and RNA data. More precisely, we investigate consensus problems in the contexts of sequence analysis, of quartet methods for phylogenetic reconstruction, of gene order analysis, and of RNA secondary structure comparison. In all cases, we present new efficient algorithms that incorporate the bounded search tree paradigm in novel ways. On our way, we also obtain results of parameterized hardness, showing that the respective problems are unlikely to allow for a fixed-parameter algorithm, and we introduce integer linear programs (ILP's) as a tool for classifying problems as fixed-parameter tractable, i.e., as having fixed-parameter algorithms. Most of our algorithms were implemented and tested on practical data.Festparameter-Algorithmen bieten einen konstruktiven Ansatz zur Loesung von kombinatorisch schwierigen, in der Regel NP-harten Problemen, der zwei Ziele beruecksichtigt: innerhalb von beweisbaren Laufzeitschranken werden optimale Ergebnisse berechnet. Die entscheidende Idee ist dabei, einen oder mehrere Aspekte der Problemeingabe als Parameter der Problems aufzufassen und die kombinatorische Explosion der algorithmischen Schwierigkeit auf diese Parameter zu beschraenken, so dass die Laufzeitkosten polynomiell in Bezug auf den nicht-parametrisierten Teil der Eingabe sind. Gibt es einen Festparameter-Algorithmus fuer ein kombinatorisches Problem, nennt man das Problem festparameter-handhabbar. Die Entwicklung von Festparameter-Algorithmen macht vor allem dann Sinn, wenn die betrachteten Parameter im Anwendungsfall nur kleine Werte annehmen. Festparameter-Algorithmen sind zu einem algorithmischen Standardwerkzeug in vielen Anwendungsbereichen geworden, unter anderem in der algorithmischen Biologie, wo in vielen Anwendungen kleine Parameterwerte beobachtet werden koennen. Zu den bekannten Techniken fuer den Entwurf von Festparameter-Algorithmen gehoeren unter anderem groessenbeschraenkte Suchbaeume. In der algorithmischen Biologie gibt es bislang nur wenige Beispiele fuer die Anwendung von groessenbeschraenkten Suchbaeumen. Diese Arbeit untersucht den Einsatz groessenbeschraenkter Suchbaeume fuer NP-harte Konsens-Probleme in der Analyse von DNS- und RNS-Daten. Wir betrachten Konsens-Probleme in der Analyse von DNS-Sequenzdaten, in der Analyse von sogenannten Quartettdaten zur Erstellung von phylogenetischen Hypothesen, in der Analyse von Daten ueber die Anordnung von Genen und beim Vergleich von RNS-Strukturdaten. In allen Faellen stellen wir neue effiziente Algorithmen vor, in denen das Paradigma der groessenbeschraenkten Suchbaeume auf neuartige Weise realisiert wird. Auf diesem Weg zeigen wir auch Ergebnisse parametrisierter Haerte, die zeigen, dass fuer die dabei betrachteten Probleme ein Festparameter-Algorithmus unwahrscheinlich ist. Ausserdem fuehren wir ganzzahliges lineares Programmieren als eine neue Technik ein, um die Festparameter-Handhabbarkeit eines Problems zu zeigen. Die Mehrzahl der hier vorgestellten Algorithmen wurde implementiert und auf Anwendungsdaten getestet

Publikationsserver der Universität Tübingen

Homology inference with specific molecular constraints

Author: Surkont Jarosław
Publication venue
Publication date: 01/01/2016
Field of study

Evolutionary processes can be considered at multiple levels of biological organization. The work developed in this thesis focuses on protein molecular evolution. Although proteins are linear polymers composed from a basic set of 20 amino acids, they generate an enormous variety of form and function. Proteins that have arisen by a common descent are classified into families; they often share common properties including similarities in sequence, structure, and function. Multiple methods have been developed to infer evolutionary relationships between proteins and classify them into families. Yet, those generic methods are often inaccurate, especially when specific protein properties limit their applications. In this thesis, we analyse two protein classes that are often difficult for the evolutionary analysis: the coiled-coils – repetitive protein domains defined by a simple widespread peptide motif (chapters 2 and 3) and Rab small GTPases – a large family of closely related proteins (chapters 4 and 5). In both cases, we analyse the specific properties that determine protein structure and function and use them to improve their evolutionary inference

Repositório da Universidade Nova de Lisboa

Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

Author: Ashlock Wendy Cole
Publication venue
Publication date: 23/06/2016
Field of study

About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them

YorkSpace

The Nested Periodic Subspaces: Extensions of Ramanujan Sums for Period Estimation

Author: Tenneti Srikanth Venkata
Publication venue
Publication date: 01/01/2018
Field of study

In the year 1918, the Indian mathematician Srinivasa Ramanujan proposed a set of sequences called Ramanujan Sums as bases to expand arithmetic functions in number theory. Today, exactly a 100 years later, we will show that these sequences re-emerge as exciting tools in a completely different context: For the extraction of periodic patterns in data. Combined with the state-of-the-art techniques of DSP, Ramanujan Sums can be used as the starting point for developing powerful algorithms for periodicity applications. The primary inspiration for this thesis comes from a recent extension of Ramanujan sums to subspaces known as the Ramanujan subspaces. These subspaces were designed to span any sequence with integer periodicity, and have many interesting properties. Starting with Ramanujan subspaces, this thesis first develops an entire family of such subspace representations for periodic sequences. This family, called Nested Periodic Subspaces due to their unique structure, turns out to be the least redundant sets of subspaces that can span periodic sequences. Three classes of new algorithms are proposed using the Nested Periodic Subspaces: dictionaries, filter banks, and eigen-space methods based on the auto-correlation matrix of the signal. It will be shown that these methods are especially advantageous to use when the data-length is short, or when the signal is a mixture of multiple hidden periods. The dictionary techniques were inspired by recent advances in sparsity based compressed sensing. Apart from the l1 norm based convex programs currently used in other applications, our dictionaries can admit l2 norm formulations that have linear and closed form solutions, even when the systems is under-determined. A new filter bank is also proposed using the Ramanujan sums. This, named the Ramanujan Filter Bank, can accurately track the instantaneous period for signals that exhibit time varying periodic nature. The filters in the Ramanujan Filter Bank have simple integer valued coefficients, and directly tile the period vs time plane, unlike classical STFT (Short Time Fourier Transform) and wavelets, which tile the time-frequency plane. The third family of techniques developed here are a generalization of the classic MUSIC (MUltiple SIgnal Classification) algorithm for periodic signals. MUSIC is one of the most popular techniques today for line spectral estimation. However, periodic signals are not just any unstructured line spectral signals. There is a nice harmonic spacing between the lines which is not exploited by plain MUSIC. We will show that one can design much more accurate adaptations of MUSIC using Nested Periodic Subspaces. Compared to prior variants of MUSIC for the periodicity problem, our approach is much faster and yields much more accurate results for signals with integer periods. This work is also the first extension of MUSIC that uses simple integer valued basis vectors instead of using traditional complex-exponentials to span the signal subspace. The advantages of the new methods are demonstrated both on simulations, as well as real world applications such as DNA micro-satellites, protein repeats and absence seizures. Apart from practical contributions, the theory of Nested Periodic Subspaces offers answers to a number of fundamental questions that were previously unanswered. For example, what is the minimum contiguous data-length needed to be able to identify the period of a signal unambiguously? Notice that the answer we seek is a fundamental identifiability bound independent of any particular period estimation technique. Surprisingly, this basic question has never been answered before. In this thesis, we will derive precise expressions for the minimum necessary and sufficient datalengths for this question. We also extend these bounds to the context of mixtures of periodic signals. Once again, even though mixtures of periodic signals often occur in many applications, aspects such as the unique identifiability of the component periods were never rigorously analyzed before. We will present such an analysis as well. While the above question deals with the minimum contiguous datalength required for period estimation, one may ask a slightly different question: If we are allowed to pick the samples of a signal in a non-contiguous fashion, how should we pick them so that we can estimate the period using the least number of samples? This question will be shown to be quite difficult to answer in general. In this thesis, we analyze a smaller case in this regard, namely, that of resolving between two periods. It will be shown that the analysis is quite involved even in this case, and the optimal sampling pattern takes an interesting form of sparsely located bunches. This result can also be extended to the case of multi-dimensional periodic signals. We very briefly address multi-dimensional periodicity in this thesis. Most prior DSP literature on multi-dimensional discrete time periodic signals assumes the period to be parallelepipeds. But as shown by the artist M. C. Escher, one can tile the space using a much more diverse variety of shapes. Is it always possible to account for such other periodic shapes using the traditional notion of parallelepiped periods? An interesting analysis in this regard is presented towards the end of the thesis.</p

Caltech Theses and Dissertations

Bioinformatic Investigations Into the Genetic Architecture of Renal Disorders

Author: Cheshire Christopher
Publication venue: UCL (University College London)
Publication date: 28/11/2019
Field of study

Modern genomic analysis has a significant bioinformatic component due to the high volume of complex data that is involved. During investigations into the genetic components of two renal diseases, we developed two software tools. // Genome-Wide Association Studies (GWAS) datasets may be genotyped on different microarrays and subject to different annotation, leading to a mosaic case-control cohort that has inherent errors, primarily due to strand mismatching. Our software REMEDY seeks to detect and correct strand designation of input datasets, as well as filtering for common sources of noise such as structural and multi-allelic variants. We performed a GWAS on a large cohort of Steroid-sensitive nephrotic syndrome samples; the mosaic input datasets were pre-processed with REMEDY prior to merging and analysis. Our results show that REMEDY significantly reduced noise in GWAS output results. REMEDY outperforms existing software as it has significantly more features available such as auto-strand designation detection, comprehensive variant filtering and high-speed variant matching to dbSNP. // The second tool supported the analysis of a newly characterised rare renal disorder: Polycystic kidney disease with hyperinsulinemic hypoglycemia (HIPKD). Identification of the underlying genetic cause led to the hypothesis that a change in chromatin looping at a specific locus affected the aetiology of the disease. We developed LOOPER, a software suite capable of predicting chromatin loops from ChIP-Seq data to explore the possible conformations of chromatin architecture in the HIPKD genomic region. LOOPER predicted several interesting functional and structural loops that supported our hypothesis. We then extended LOOPER to visualise ChIA-PET and ChIP-Seq data as a force-directed graph to show experimental structural and functional chromatin interactions. Next, we re-analysed the HIPKD region with LOOPER to show experimentally validated chromatin interactions. We first confirmed our original predicted loops and subsequently discovered that the local genomic region has many more chromatin features than first thought

UCL Discovery