601 research outputs found

    Circular Languages Generated by Complete Splicing Systems and Pure Unitary Languages

    Full text link
    Circular splicing systems are a formal model of a generative mechanism of circular words, inspired by a recombinant behaviour of circular DNA. Some unanswered questions are related to the computational power of such systems, and finding a characterization of the class of circular languages generated by circular splicing systems is still an open problem. In this paper we solve this problem for complete systems, which are special finite circular splicing systems. We show that a circular language L is generated by a complete system if and only if the set Lin(L) of all words corresponding to L is a pure unitary language generated by a set closed under the conjugacy relation. The class of pure unitary languages was introduced by A. Ehrenfeucht, D. Haussler, G. Rozenberg in 1983, as a subclass of the class of context-free languages, together with a characterization of regular pure unitary languages by means of a decidable property. As a direct consequence, we characterize (regular) circular languages generated by complete systems. We can also decide whether the language generated by a complete system is regular. Finally, we point out that complete systems have the same computational power as finite simple systems, an easy type of circular splicing system defined in the literature from the very beginning, when only one rule is allowed. From our results on complete systems, it follows that finite simple systems generate a class of context-free languages containing non-regular languages, showing the incorrectness of a longstanding result on simple systems

    Splicing Systems from Past to Future: Old and New Challenges

    Full text link
    A splicing system is a formal model of a recombinant behaviour of sets of double stranded DNA molecules when acted on by restriction enzymes and ligase. In this survey we will concentrate on a specific behaviour of a type of splicing systems, introduced by P\u{a}un and subsequently developed by many researchers in both linear and circular case of splicing definition. In particular, we will present recent results on this topic and how they stimulate new challenging investigations.Comment: Appeared in: Discrete Mathematics and Computer Science. Papers in Memoriam Alexandru Mateescu (1952-2005). The Publishing House of the Romanian Academy, 2014. arXiv admin note: text overlap with arXiv:1112.4897 by other author

    A generic imperative language for polynomial time

    Full text link
    The ramification method in Implicit Computational Complexity has been associated with functional programming, but adapting it to generic imperative programming is highly desirable, given the wider algorithmic applicability of imperative programming. We introduce a new approach to ramification which, among other benefits, adapts readily to fully general imperative programming. The novelty is in ramifying finite second-order objects, namely finite structures, rather than ramifying elements of free algebras. In so doing we bridge between Implicit Complexity's type theoretic characterizations of feasibility, and the data-flow approach of Static Analysis.Comment: 18 pages, submitted to a conferenc

    Declarative operations on nets

    Get PDF
    To increase the expressiveness of knowledge representations, the graph-theoretical basis of semantic networks is reconsidered. Directed labeled graphs are generalized to directed recursive labelnode hypergraphs, which permit a most natural representation of multi-level structures and n-ary relationships. This net formalism is embedded into the relational/functional programming language RELFUN. Operations on (generalized) graphs are specified in a declarative fashion to enhance readability and maintainability. For this, nets are represented as nested RELFUN terms kept in a normal form by rules associated directly with their constructors. These rules rely on equational axioms postulated in the formal definition of the generalized graphs as a constructor algebra. Certain kinds of sharing in net diagrams are mirrored by binding common subterms to logical variables. A package of declarative transformations on net terms is developed. It includes generalized set operations, structure-reducing operations, and extended path searching. The generation of parts lists is given as an application in mechanical engineering. Finally, imperative net storage and retrieval operations are discussed

    New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly

    Get PDF
    Great efforts have been devoted to decipher the sequence composition of the genomes and transcriptomes of diverse organisms. Continuing advances in high-throughput sequencing technologies have led to a decline in associated costs, facilitating a rapid increase in the amount of available genetic data. In particular genome studies have undergone a fundamental paradigm shift where genome projects are no longer limited by sequencing costs, but rather by computational problems associated with assembly. There is an urgent demand for more efficient and more accurate methods. Most recently, “hybrid” methods that integrate short- and long-read data have been devised to address this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph. By design, unitigs are both unique and almost free of assembly errors. As a consequence, only few spurious overlaps are introduced into the graph. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB extracts subgraphs whose global properties approach a disjoint union of paths in multiple steps, utilizing properties of proper interval graphs. A prototype implementation of LazyB, entirely written in Python, not only yields significantly more accurate assemblies of the yeast, fruit fly, and human genomes compared to state-of-the-art pipelines, but also requires much less computational effort. An optimized C++ implementation dubbed MuCHSALSA further significantly reduces resource demands. Advances in RNA-seq have facilitated tremendous insights into the role of both coding and non-coding transcripts. Yet, the complete and accurate annotation of the transciptomes of even model organisms has remained elusive. RNA-seq produces reads significantly shorter than the average distance between related splice events and presents high noise levels and other biases The computational reconstruction remains a critical bottleneck. Ryūtō implements an extension of common splice graphs facilitating the integration of reads spanning multiple splice sites and paired-end reads bridging distant transcript parts. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem. Using phasing information from multi-splice and paired-end reads, nodes with uncertain connections are decomposed step-wise via Linear Programming. Ryūtōs performance compares favorably with state-of-the-art methods on both simulated and real-life datasets. Despite ongoing research and our own contributions, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information which, however, is challenging to utilize due to the large amount of accumulating errors. An extension to Ryūtō enables the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. Benchmarks show stable improvements already at 3 replicates. Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō consistently improves assembly on replicates, demonstrable also when mixing conditions or time series and for differential expression analysis. Ryūtōs approach towards guided assembly is equally unique. It allows users to adjust results based on the quality of the guide, even for multi-sample assembly.:1 Preface 1.1 Assembly: A vast and fast evolving field 1.2 Structure of this Work 1.3 Available 2 Introduction 2.1 Mathematical Background 2.2 High-Throughput Sequencing 2.3 Assembly 2.4 Transcriptome Expression 3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly 3.1 Background 3.2 Strategy 3.3 Data preprocessing 3.4 Processing of the overlap graph 3.5 Post Processing of the Path Decomposition 3.6 Benchmarking 3.7 MuCHSALSA – Moving towards the future 4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly 4.1 Background 4.2 Strategy 4.3 The Ryūtō core algorithm 4.4 Improved Multi-sample transcript assembly with Ryūtō 5 Conclusion & Future Work 5.1 Discussion and Outlook 5.2 Summary and Conclusio

    Languages Generated by Iterated Idempotencies.

    Get PDF
    The rewrite relation with parameters m and n and with the possible length limit = k or :::; k we denote by w~, =kW~· or ::;kw~ respectively. The idempotency languages generated from a starting word w by the respective operations are wDAlso other special cases of idempotency languages besides duplication have come up in different contexts. The investigations of Ito et al. about insertion and deletion, Le., operations that are also observed in DNA molecules, have established that w5 and w~ both preserve regularity.Our investigations about idempotency relations and languages start out from the case of a uniform length bound. For these relations =kW~ the conditions for confluence are characterized completely. Also the question of regularity is -k n answered for aH the languages w- D 1 are more complicated and belong to the class of context-free languages.For a generallength bound, i.e."for the relations :"::kW~, confluence does not hold so frequently. This complicatedness of the relations results also in more complicated languages, which are often non-regular, as for example the languages WWithout any length bound, idempotency relations have a very complicated structure. Over alphabets of one or two letters we still characterize the conditions for confluence. Over three or more letters, in contrast, only a few cases are solved. We determine the combinations of parameters that result in the regularity of wDIn a second chapter sorne more involved questions are solved for the special case of duplication. First we shed sorne light on the reasons why it is so difficult to determine the context-freeness ofduplication languages. We show that they fulfiH aH pumping properties and that they are very dense. Therefore aH the standard tools to prove non-context-freness do not apply here.The concept of root in Formal Language ·Theory is frequently used to describe the reduction of a word to another one, which is in sorne sense elementary.For example, there are primitive roots, periodicity roots, etc. Elementary in connection with duplication are square-free words, Le., words that do not contain any repetition. Thus we define the duplication root of w to consist of aH the square-free words, from which w can be reached via the relation w~.Besides sorne general observations we prove the decidability of the question, whether the duplication root of a language is finite.Then we devise acode, which is robust under duplication of its code words.This would keep the result of a computation from being destroyed by dupli cations in the code words. We determine the exact conditions, under which infinite such codes exist: over an alphabet of two letters they exist for a length bound of 2, over three letters already for a length bound of 1.Also we apply duplication to entire languages rather than to single words; then it is interesting to determine, whether regular and context-free languages are closed under this operation. We show that the regular languages are closed under uniformly bounded duplication, while they are not closed under duplication with a generallength bound. The context-free languages are closed under both operations.The thesis concludes with a list of open problems related with the thesis' topics

    木を用いた構造化並列プログラミング

    Get PDF
    High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzaki’s work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosen’s high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjan’s formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.電気通信大学201

    Tracing the evolution of long non-coding RNAs: Principles of comparative transcriptomics for splice site conservation and biological applications

    Get PDF
    Eukaryotic cells exhibit an extensive transcriptional diversity. Only about a quarter of the total RNA in the human cell can be accounted for by messenger RNA (mRNA), which convey genetic code for protein generation. The remaining part of the transcriptome consists of rather heterogenous molecules. While some classes are well defined and have been shown to carry out distinct functions, ranging from housekeeping to complex regulatory tasks, a big fraction of the transcriptional output is categorized solely based on the lack of protein-coding capacity and transcript length. Several studies have shown, that as a group, mRNA-like long non-coding RNAs (lncRNAs), are under stabilizing selection, however at much weaker levels than mRNAs. The conservation at the level of primary sequence is even lower, blurring the contrast between exonic and intronics parts, which impedes traditional methods of genome-wide homology search. As a consequence their evolutionary history is a fairly unexplored field and apart from a few experimentally studied cases, the vast majority of them is reported to be poorly conserved. However, the pervasive transcription and the highly spatio-temporal specific expression patterns of lncRNAs suggests their functional importance and makes their evolutionary age and conservation patterns a topic of interest. By employing diverse computational methods, recent studies shed light on the common conservation of lncRNA’s secondary and gene structures, highlighting the significance of structural features on functionality. Splice sites, in particular, are frequently retained over very large evolutionary time scales, as they maintain the intron-exon-structure of the transcript. Consequently, the conservation of splice sites can be utilized in a comparative genomics approach to establish homology and predict evolutionarily well-conserved transcripts, regardless of their coding capacity. Since splice site conservation cannot be directly inferred from experimental evidence, in the course of this thesis a computational pipeline was established to generate comparative maps of splice sites based on multiple sequence alignments together with transcriptomics data. Scoring schemes for splice site motifs are employed to assess the conservation of orthologs. This resource can then be used to systemically study the conservation patterns of RNAs and their gene structures. This thesis will demonstrate the versatility of this method by showcasing biological applications of three distinct studies. First, a comprehensive annotation of the human transcriptome, from RefSeq, ESTs and GENCODE, was used to trace the evolution of human lncRNAs. A large majority of human lncRNAs is found to be conserved across Eutheria, and many hundreds originated before the divergence of marsupials and placental mammals. However, they exhibit a rapid turnover of their transcript structures, indicating that they are actual ancient components of the vertebrate genome with outstanding evolutionary plasticity. Additionally, a public web server was setup, which allows the user to retrieve sets of orthologous splice sites from pre-computed comparative splice site maps and inspect visualizations of their conservation in the respective species. Second, a more specific data set of non-colinearly spliced latimerian RNAs is studied to fathom the origins of atypical transcripts. RNA-seq data from two coelacanth species are analyzed, yielding thousands of circular and trans-spliced products, with a surprising exclusivity of the majority of their splice junctions to atypically spliced forms, that is they are not used in linear isoforms. The conservation analysis with comparative splice site maps yielded high conservation levels for both cir- cularizing and trans-connecting splice sites. This fact in combination with their abundance strongly suggests that atypical RNAs are evolutionarily old and of functional importance. Lastly, comparative splice site maps are used to investigate the role of lncRNAs in the evolution of the Alzheimer’s disease (AD). The human specificity of AD clearly points out a phylogenetic aspect of the disease, which makes the evolutionary analysis a very promising field of research. Protein- coding and non-protein-coding regions, that have been identified to be differentially expressed in AD patients, are analyzed for conservation of their splice site and evolution of their exon-intron-structure. Both non-coding and protein-coding AD-associated genes are shown to have evolved more rapidly in their gene structure than the genome at large. This supports the view of AD as a consequence of the recent rapid adaptive evolution of the human brain. This phylogenetic trait might have far reaching consequences with respect to the appropriateness of animal models and the development of disease-modifying strategies.Eukaryotische Zellen legen eine umfangreiche transkriptionelle Vielfalt an den Tag. Nur etwa ein Viertel der in der menschlichen Zelle enthaltenen RNA ist messenger RNA (mRNA), welche den genetischen Code für die Proteingenerierung übermittelt. Der verbleibende Anteil des Transkriptoms besteht aus eher heterogenen Molekülen. Während einigen wohldefinierten Klassen spezifische Funktionen zugeordnet werden können, welche von Zellhaushalt bis zu komplexen regulatorischen Aufgaben reichen, wird ein großer Teil der transkriptionellen Produktion ausschließlich auf Grundlage der fehlenden Kodierungskapazität und der Transkriptlänge kategorisiert. Einige Studien zeigten, dass mRNA-ähnliche lange nicht-kodierende RNA (lncRNA) als Gruppe unter stabilisierender Selektion stehen, wenn auch in einem weitaus geringeren Ausmaß als mRNAs. Die Konservierung auf Ebene der primären Sequenz ist sogar noch niedriger, wodurch der Kontrast zwischen exonischen und intronischen Elementen verschwimmt und Methoden der traditionellen Homologiesuche erschwert werden. Infolgedessen ist die evolutionäre Geschichte der lncRNAs ein recht unerforschtes Gebiet und abgesehen von ein paar vereinzelten Fallstudien wird die große Mehrheit als schwach konserviert vermeldet. Die tiefgreifende Transkription und die in Raum und Zeit hochspezifischen Expressionsmuster von lncRNA deuten jedoch auf deren funktionelle Bedeutung hin und machen ihr evolutionäres Alter und ihre Konservierungsmuster zu einem Thema von Interesse. Durch die Verwendung von computergestützten Methoden konnten jüngste Studien die verbreitete Konservierung von Sekundär- und Genstruktur von lncRNAs aufzeigen, was die Signifikanz von strukturellen Merkmalen in Bezug auf deren Funktionalität unterstreicht. Spleißstellen im besonderen werden oft über lange evolutionäre Zeitspannen erhalten, da sie die Intron-Exon-Struktur des Transkripts bewahren. Folglich, kann die Konservierung von Spleißstellen durch einen Ansatz der vergleichenden Genomik benutzt werden, um Homologie herzuleiten und evolutionär gut konservierte Transkripte unabhängig von deren Kodierungskapazität zu prognostizieren. Da es nicht möglich ist die Spleißstellenkonservierung direkt anhand von experimentellen Indikatoren abzulesen, wurde im Zuge dieser These eine computergestützte Methode entwickelt, welche, basierend auf multiplen Sequenzalignments und Transkriptomikdaten, “Vergleichskarten” von Spleißstellen erstellt. Ein Punktebewertungssystem für Spleißstellenmotive wird benutzt um die Konservierung der Orthologen zu beurteilen. Diese Resource kann anschließend verwendet werden um systematisch die Konservierungsmuster von RNAs und deren Genstrukturen zu untersuchen. Diese Arbeit wird die Vielseitigkeit dieser Methode demonstrieren, indem die biologische Anwendung in drei verschiedenen Studien präsentiert wird. Zuerst wird eine umfassende Annotation des menschlichen Transkriptoms, basierend auf RefSeq, EST und GENCODE, benutzt, um die Evolution von humanen lncRNAs nachzuvollziehen. Es konnte festgestellt werden, dass eine große Mehrheit der menschlichen lncRNAs innerhalb der Eutheria konserviert ist und mehrere hundert bereits vor der Auseinanderentwicklung von Beuteltieren und höheren Säugetieren entstanden. Dennoch zeigen sie eine rasante Veränderung in ihren Transkriptstrukturen, welche darauf hindeutet, dass sie tatsächlich alte Bestandteile von Vertebratengenomen mit bemerkenswerter evolutionärer Formbarkeit sind. Zusätzlich wurde ein öffentlicher Webserver aufgesetzt, der dem Nutzer ermöglicht Datensätze orthologer Spleißstellen aus vorgenerierten Vergleichskarten zu extrahieren und Visualisierungen der Konservierung in den jeweiligen Spezies zu betrachten. Als zweites wird ein spezifischerer Datensatz von nicht-linear gespleißten Latimeria-RNA untersucht um die Ursprünge untypischer Transkripte zu ergründen. Die Analyse der RNA-seq Daten zweier Exemplare des Quastenflossers ergab tausende zirkulärer und Transspleiß-Produkte, wobei die Mehrheit der Spleißverbindungen eine überraschende Exklusivität für untypisch gespleißte Formen aufzeigt, d.h. diese werden nicht für lineare Isoformen genutzt. Die Konservierungsanalyse mit Spleißstellen-Vergleichskarten ergibt hohe Konservierungsniveaus sowohl für zirkulärisierende als auch für trans-verbindende Spleißstellen. Diese Tatsache in Kombination mit ihrem häufigen Vorkommen, deutet stark darauf hin, dass untypische RNAs evolutionär alt und von funktioneller Bedeutung sind. Zuletzt werden Spleißstellen-Vergleichskarten benutzt um die Rolle von lncRNAs in der Evolution der Alzheimer-Krankheit (AK) zu untersuchen. Die Spezifität der AK auf den Menschen weist klar auf einen phylogenetischen Aspekt der Krankheit hin, was deren evolutionäre Analyse zu einem vielversprechenden Forschungsgebiet macht. Proteinkodierende und nicht-proteinkodierende Regionen, bei denen eine differentielle Expression in AK-Patienten erkannt wurde, werden auf die Konservierung ihrer Spleißstellen und Evolution ihrer Exon-Intron-Strukturen hin analysiert. Es kann nachgewiesen werden, dass sich die Genstruktur von sowohl nicht-kodierenden als auch von proteinkodierenden AK-assoziierten Genen schneller entwickelt als das Genom im Allgemeinen. Das unterstützt die Auffassung, dass AK die Folge einer kürzlichen rasanten adaptiven Evolution des menschlichen Gehirns ist. Diese phylogenetische Eigenschaft könnte weitreichende Konsequenzen in Bezug auf die Angemessenheit von Tiermodellen und die Entwicklung von krankheitsmodifizierenden Strategien haben

    The Last Universal Common Ancestor: emergence, constitution and genetic legacy of an elusive forerunner

    Get PDF
    This is an Open Access article distributed under the terms of the Creative Commons Attribution Licens
    corecore