9 research outputs found
Meta-Alignment with Crumble and Prune: Partitioning very large alignment problems for performance and parallelization
<p>Abstract</p> <p>Background</p> <p>Continuing research into the global multiple sequence alignment problem has resulted in more sophisticated and principled alignment methods. Unfortunately these new algorithms often require large amounts of time and memory to run, making it nearly impossible to run these algorithms on large datasets. As a solution, we present two general methods, Crumble and Prune, for breaking a phylogenetic alignment problem into smaller, more tractable sub-problems. We call Crumble and Prune <it>meta-alignment </it>methods because they use existing alignment algorithms and can be used with many current alignment programs. Crumble breaks long alignment problems into shorter sub-problems. Prune divides the phylogenetic tree into a collection of smaller trees to reduce the number of sequences in each alignment problem. These methods are orthogonal: they can be applied together to provide better scaling in terms of sequence length and in sequence depth. Both methods partition the problem such that many of the sub-problems can be solved independently. The results are then combined to form a solution to the full alignment problem.</p> <p>Results</p> <p>Crumble and Prune each provide a significant performance improvement with little loss of accuracy. In some cases, a gain in accuracy was observed. Crumble and Prune were tested on real and simulated data. Furthermore, we have implemented a system called Job-tree that allows hierarchical sub-problems to be solved in parallel on a compute cluster, significantly shortening the run-time.</p> <p>Conclusions</p> <p>These methods enabled us to solve gigabase alignment problems. These methods could enable a new generation of biologically realistic alignment algorithms to be applied to real world, large scale alignment problems.</p
Integrated multiple sequence alignment
Sammeth M. Integrated multiple sequence alignment. Bielefeld (Germany): Bielefeld University; 2005.The thesis presents enhancements for automated and manual multiple sequence alignment: existing alignment algorithms are made more easily accessible and new algorithms are designed for difficult cases.
Firstly, we introduce the QAlign framework, a graphical user interface for multiple sequence alignment. It comprises several state-of-the-art algorithms and supports their parameters by convenient dialogs. An alignment viewer with guided editing functionality can also highlight or print regions of the alignment. Also phylogenetic features are provided, e.g., distance-based tree reconstruction methods, corrections for multiple substitutions and a tree viewer. The modular concept and the platform-independent implementation guarantee an easy extensibility.
Further, we develop a constrained version of the divide-and-conquer alignment such that it can be restricted by anchors found earlier with local alignments. It can be shown that this method shares attributes of both, local and global aligners, in the quality of results as well as in the computation time. We further modify the local alignment step to work on bipartite (or even multipartite) sets for sequences where repeats overshadow valuable sequence information. In the end a technique is established that can accurately align sequences containing eventually repeated motifs.
Finally, another algorithm is presented that allows to compare tandem repeat sequences by aligning them with respect to their possible repeat histories. We describe an evolutionary model including tandem duplications and excisions, and give an exact algorithm to compare two sequences under this model
Multiple sequence alignment with user-defined constraints at GOBICS
Most multi-alignment methods are fully automated, i.e. they are based on a fixed set of mathematical rules. For various reasons, such methods may fail to produce biologically meaningful alignments. Herein, we describe a semi-automatic approach to multiple sequence alignment where biological expert knowledge can be used to influence the alignment procedure. The user can specify parts of the sequences that are biologically related to each other; our software program uses these sites as anchor points and creates a multiple alignment respecting these user-defined constraints. By using known functionally, structurally or evolutionarily related positions of the input sequences as anchor points, our method can produce alignments that reflect the true biological relationships among the input sequences more accurately than fully automated procedures can do
Generalization of predicates with string arguments
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2002.Thesis (Master's) -- Bilkent University, 2002.Includes bibliographical references leaves 60-63.String/sequence generalization is used in many different areas such as machine
learning, example-based machine translation and DNA sequence alignment. In this
thesis, a method is proposed to find the generalizations of the predicates with string
arguments from the given examples. Trying to learn from examples is a very hard
problem in machine learning, since finding the global optimal point to stop
generalization is a difficult and time consuming process. All the work done until now is
about employing a heuristic to find the best solution. This work is one of them. In this
study, some restrictions applied by the SLGG (Specific Least General Generalization)
algorithm, which is developed to be used in an example-based machine translation
system, are relaxed to find the all possible alignments of two strings. Moreover, a
Euclidian distance like scoring mechanism is used to find the most specific
generalizations. Some of the generated templates are eliminated by four different
selection/filtering approaches to get a good solution set. Finally, the result set is
presented as a decision list, which provides the handling of exceptional cases.Canıtezer, GökerM.S
Tracing the evolution of long non-coding RNAs: Principles of comparative transcriptomics for splice site conservation and biological applications
Eukaryotic cells exhibit an extensive transcriptional diversity. Only about a quarter of the total
RNA in the human cell can be accounted for by messenger RNA (mRNA), which convey genetic
code for protein generation. The remaining part of the transcriptome consists of rather heterogenous
molecules. While some classes are well defined and have been shown to carry out distinct functions,
ranging from housekeeping to complex regulatory tasks, a big fraction of the transcriptional output is
categorized solely based on the lack of protein-coding capacity and transcript length. Several studies
have shown, that as a group, mRNA-like long non-coding RNAs (lncRNAs), are under stabilizing
selection, however at much weaker levels than mRNAs. The conservation at the level of primary
sequence is even lower, blurring the contrast between exonic and intronics parts, which impedes
traditional methods of genome-wide homology search. As a consequence their evolutionary history
is a fairly unexplored field and apart from a few experimentally studied cases, the vast majority
of them is reported to be poorly conserved. However, the pervasive transcription and the highly
spatio-temporal specific expression patterns of lncRNAs suggests their functional importance and
makes their evolutionary age and conservation patterns a topic of interest. By employing diverse
computational methods, recent studies shed light on the common conservation of lncRNA’s secondary
and gene structures, highlighting the significance of structural features on functionality. Splice sites,
in particular, are frequently retained over very large evolutionary time scales, as they maintain the
intron-exon-structure of the transcript.
Consequently, the conservation of splice sites can be utilized in a comparative genomics approach to
establish homology and predict evolutionarily well-conserved transcripts, regardless of their coding
capacity. Since splice site conservation cannot be directly inferred from experimental evidence, in
the course of this thesis a computational pipeline was established to generate comparative maps
of splice sites based on multiple sequence alignments together with transcriptomics data. Scoring
schemes for splice site motifs are employed to assess the conservation of orthologs. This resource
can then be used to systemically study the conservation patterns of RNAs and their gene structures.
This thesis will demonstrate the versatility of this method by showcasing biological applications of
three distinct studies.
First, a comprehensive annotation of the human transcriptome, from RefSeq, ESTs and GENCODE,
was used to trace the evolution of human lncRNAs. A large majority of human lncRNAs is found to
be conserved across Eutheria, and many hundreds originated before the divergence of marsupials and
placental mammals. However, they exhibit a rapid turnover of their transcript structures, indicating
that they are actual ancient components of the vertebrate genome with outstanding evolutionary
plasticity. Additionally, a public web server was setup, which allows the user to retrieve sets of
orthologous splice sites from pre-computed comparative splice site maps and inspect visualizations
of their conservation in the respective species.
Second, a more specific data set of non-colinearly spliced latimerian RNAs is studied to fathom the
origins of atypical transcripts. RNA-seq data from two coelacanth species are analyzed, yielding
thousands of circular and trans-spliced products, with a surprising exclusivity of the majority of
their splice junctions to atypically spliced forms, that is they are not used in linear isoforms. The
conservation analysis with comparative splice site maps yielded high conservation levels for both cir-
cularizing and trans-connecting splice sites. This fact in combination with their abundance strongly
suggests that atypical RNAs are evolutionarily old and of functional importance.
Lastly, comparative splice site maps are used to investigate the role of lncRNAs in the evolution of
the Alzheimer’s disease (AD). The human specificity of AD clearly points out a phylogenetic aspect
of the disease, which makes the evolutionary analysis a very promising field of research. Protein-
coding and non-protein-coding regions, that have been identified to be differentially expressed in AD
patients, are analyzed for conservation of their splice site and evolution of their exon-intron-structure.
Both non-coding and protein-coding AD-associated genes are shown to have evolved more rapidly
in their gene structure than the genome at large. This supports the view of AD as a consequence
of the recent rapid adaptive evolution of the human brain. This phylogenetic trait might have far
reaching consequences with respect to the appropriateness of animal models and the development
of disease-modifying strategies.Eukaryotische Zellen legen eine umfangreiche transkriptionelle Vielfalt an den
Tag. Nur etwa ein Viertel der in der menschlichen Zelle enthaltenen RNA
ist messenger RNA (mRNA), welche den genetischen Code für die Proteingenerierung
übermittelt. Der verbleibende Anteil des Transkriptoms besteht aus eher heterogenen
Molekülen. Während einigen wohldefinierten Klassen spezifische Funktionen zugeordnet werden können, welche von Zellhaushalt bis zu komplexen regulatorischen Aufgaben reichen, wird ein großer Teil der transkriptionellen Produktion ausschließlich auf
Grundlage der fehlenden Kodierungskapazität und der Transkriptlänge kategorisiert.
Einige Studien zeigten, dass mRNA-ähnliche lange nicht-kodierende RNA (lncRNA)
als Gruppe unter stabilisierender Selektion stehen, wenn auch in einem weitaus geringeren Ausmaß als mRNAs. Die Konservierung auf Ebene der primären Sequenz
ist sogar noch niedriger, wodurch der Kontrast zwischen exonischen und intronischen
Elementen verschwimmt und Methoden der traditionellen Homologiesuche erschwert
werden. Infolgedessen ist die evolutionäre Geschichte der lncRNAs ein recht unerforschtes Gebiet und abgesehen von ein paar vereinzelten Fallstudien wird die große
Mehrheit als schwach konserviert vermeldet. Die tiefgreifende Transkription und die
in Raum und Zeit hochspezifischen Expressionsmuster von lncRNA deuten jedoch
auf deren funktionelle Bedeutung hin und machen ihr evolutionäres Alter und ihre
Konservierungsmuster zu einem Thema von Interesse. Durch die Verwendung von
computergestützten Methoden konnten jüngste Studien die verbreitete Konservierung von Sekundär- und Genstruktur von lncRNAs aufzeigen, was die Signifikanz
von strukturellen Merkmalen in Bezug auf deren Funktionalität unterstreicht. Spleißstellen im besonderen werden oft über lange evolutionäre Zeitspannen erhalten, da
sie die Intron-Exon-Struktur des Transkripts bewahren.
Folglich, kann die Konservierung von Spleißstellen durch einen Ansatz der vergleichenden Genomik benutzt werden, um Homologie herzuleiten und evolutionär
gut konservierte Transkripte unabhängig von deren Kodierungskapazität zu prognostizieren. Da es nicht möglich ist die Spleißstellenkonservierung direkt anhand von
experimentellen Indikatoren abzulesen, wurde im Zuge dieser These eine computergestützte Methode entwickelt, welche, basierend auf multiplen Sequenzalignments
und Transkriptomikdaten, “Vergleichskarten” von Spleißstellen erstellt. Ein Punktebewertungssystem für Spleißstellenmotive wird benutzt um die Konservierung der
Orthologen zu beurteilen. Diese Resource kann anschließend verwendet werden um
systematisch die Konservierungsmuster von RNAs und deren Genstrukturen zu untersuchen. Diese Arbeit wird die Vielseitigkeit dieser Methode demonstrieren, indem
die biologische Anwendung in drei verschiedenen Studien präsentiert wird.
Zuerst wird eine umfassende Annotation des menschlichen Transkriptoms, basierend auf RefSeq, EST und GENCODE, benutzt, um die Evolution von humanen lncRNAs nachzuvollziehen. Es konnte festgestellt werden, dass eine große Mehrheit der
menschlichen lncRNAs innerhalb der Eutheria konserviert ist und mehrere hundert
bereits vor der Auseinanderentwicklung von Beuteltieren und höheren Säugetieren
entstanden. Dennoch zeigen sie eine rasante Veränderung in ihren Transkriptstrukturen, welche darauf hindeutet, dass sie tatsächlich alte Bestandteile von Vertebratengenomen mit bemerkenswerter evolutionärer Formbarkeit sind. Zusätzlich wurde ein
öffentlicher Webserver aufgesetzt, der dem Nutzer ermöglicht Datensätze orthologer
Spleißstellen aus vorgenerierten Vergleichskarten zu extrahieren und Visualisierungen
der Konservierung in den jeweiligen Spezies zu betrachten.
Als zweites wird ein spezifischerer Datensatz von nicht-linear gespleißten Latimeria-RNA untersucht um die Ursprünge untypischer Transkripte zu ergründen. Die Analyse der RNA-seq Daten zweier Exemplare des Quastenflossers ergab tausende zirkulärer und Transspleiß-Produkte, wobei die Mehrheit der Spleißverbindungen eine
überraschende Exklusivität für untypisch gespleißte Formen aufzeigt, d.h. diese werden nicht für lineare Isoformen genutzt. Die Konservierungsanalyse mit Spleißstellen-Vergleichskarten ergibt hohe Konservierungsniveaus sowohl für zirkulärisierende als
auch für trans-verbindende Spleißstellen. Diese Tatsache in Kombination mit ihrem
häufigen Vorkommen, deutet stark darauf hin, dass untypische RNAs evolutionär alt
und von funktioneller Bedeutung sind.
Zuletzt werden Spleißstellen-Vergleichskarten benutzt um die Rolle von lncRNAs
in der Evolution der Alzheimer-Krankheit (AK) zu untersuchen. Die Spezifität der
AK auf den Menschen weist klar auf einen phylogenetischen Aspekt der Krankheit
hin, was deren evolutionäre Analyse zu einem vielversprechenden Forschungsgebiet
macht. Proteinkodierende und nicht-proteinkodierende Regionen, bei denen eine differentielle Expression in AK-Patienten erkannt wurde, werden auf die Konservierung
ihrer Spleißstellen und Evolution ihrer Exon-Intron-Strukturen hin analysiert. Es
kann nachgewiesen werden, dass sich die Genstruktur von sowohl nicht-kodierenden
als auch von proteinkodierenden AK-assoziierten Genen schneller entwickelt als das
Genom im Allgemeinen. Das unterstützt die Auffassung, dass AK die Folge einer
kürzlichen rasanten adaptiven Evolution des menschlichen Gehirns ist. Diese phylogenetische Eigenschaft könnte weitreichende Konsequenzen in Bezug auf die Angemessenheit von Tiermodellen und die Entwicklung von krankheitsmodifizierenden
Strategien haben
Progressive Multiple Alignment with Constraints
A progressive alignment algorithm produces a multi-alignment of a set of sequences by repeatedly aligning pairs of sequences and/or previously generated alignments. We describe a method for guaranteeing that the alignment generated by a progressive alignment strategy satisfies a user-specified collection of constraints about where certain sequence positions should appear relative to others. Given a collection of C constraints over K sequences whose total length is N , our algorithm takes O(K(N 2 +KC)) time. An alignment of the fi-like globin gene clusters of several mammals illustrates the practicality of the method. Key words: Multiplesequence alignment, constrained alignment, dynamic programming 1 Introduction It is straightforward to extend the dynamic programming alignment algorithm (Needleman and Wunsch 1970) to the simultaneous alignment of K ? 2 sequences. However, the O(2 K N K ) execution time for sequences of length N makes it impractical to align more than three seque..