20 research outputs found

    A review of the current methods for computational analysis of tandem repeats

    Get PDF
    This paper considers some of the most important methods for computational tandem repeat analysis. The problem of repeats analysis is far from trivial due to the fact that tandems tend to be highly polymorphic motifs, i.e. or types of mutations within repeats has to be considered. The computational analysis of all types of mutations within repeats increases the time of execution, especially if chromosomes or whole genomes are subject of an analysis. On the other the time complexity significantly improves if only exact tandem repeats are considered, but this has less practical application. There are pros and cons of the methods being considered and maybe the most suitable solutions is a compromise of the opposed conceptions

    Analysis Of DNA Motifs In The Human Genome

    Full text link
    DNA motifs include repeat elements, promoter elements and gene regulator elements, and play a critical role in the human genome. This thesis describes a genome-wide computational study on two groups of motifs: tandem repeats and core promoter elements. Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover tandem repeats generate a huge volume of data, which can be difficult to decipher without further organization. A new method is presented here to organize and rank detected tandem repeats through clustering and classification. Our work presents multiple ways of expressing tandem repeats using the n-gram model with different clustering distance measures. Analysis of the clusters for the tandem repeats in the human genome shows that the method yields a well-defined grouping in which similarity among repeats is apparent. Our new, alignment-free method facilitates the analysis of the myriad of tandem repeats replete in the human genome. We believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeats. As with tandem repeats, promoter sequences of genes contain binding sites for proteins that play critical roles in mediating expression levels. Promoter region binding proteins and their co-factors influence timing and context of transcription. Despite the critical regulatory role of these non-coding sequences, computational methods to identify and predict DNA binding sites are extremely limited. The work reported here analyzes the relative occurrence of core promoter elements (CPEs) in and around transcription start sites. We found that out of all the data sets 49\%-63\% upstream regions have either TATA box or DPE elements. Our results suggest the possibility of predicting transcription start sites through combining CPEs signals with other promoter signals such as CpG islands and clusters of specific transcription binding sites

    RIME: Repeat Identification

    Get PDF
    We present an algorithm for detecting long similar fragments occurring at least twice in a set of biological sequences. The problem becomes computationally challenging when the frequency of a repeat is allowed to increase and when a non-negligible number of insertions, deletions and substitutions are allowed. We introduce in this paper an algorithm, Rime1 1 Rime is also a reference to Coleridge's poem "The Rime of an Ancient Mariner" which contains many repetitions as a poetic device. (for Repeat Identification: long, Multiple, and with Edits) that performs this task, and manages instances whose size and combination of parameters cannot be handled by other currently existing methods. This is achieved by using a filter as a preprocessing step, and by then exploiting the information gathered by the filter in the following actual repeat inference step. To the best of our knowledge, Rime is the first algorithm that can accurately deal with very long repeats (up to a few thousands), occurring possibly several times, and with a rate of differences (substitutions and indels) allowed among copies of a same repeat of 10-15% or even more

    APPROCCI BIOINFORMATICI PER L’ANALISI DI DATI DI NEXT GENERATION SEQUENCING IN OPHRYS (ORCHIDACEAE)

    Get PDF
    Il progetto di dottorato è stato finalizzato ad ottimizzare tecniche bioinformatiche in un contesto evolutivo e di assemblare dei genomi per contribuire all’arricchimento delle banche dati per il genere non-modello Ophrys (Orchidaceae). Questo genere rappresenta una sfida aperta per i ricercatori, in quanto è caratterizzato da una rapida radiazione evolutiva che ha impedito una chiara identificazione delle specie utilizzando le tradizionali tecniche genetiche. Nel corso del Dottorato sono stati dunque affrontati due casi di studio: l’assemblaggio e l’arricchimento delle banche dati con nuovi genomi plastidiali di due specie appartenenti al genere Ophrys ed un’analisi critica più approfondita al fine di ottimizzare l’analisi dei dati GBS in un complesso di specie appartenenti al genere Ophrys. L’approccio seed-extend si è rivelato il migliore per l’assemblaggio dei genomi plastidiali. I genomi presentano 127 geni distinti di cui 25 sono duplicati perché presenti nell’Inverted Repeat. In una specie, il gene ndhF è risultato essere troncato mentre nell’altra il gene ycf1 è risultato essere parzialmente duplicato. Questo riarrangiamento ha causato lo spostamento della giunzione tra le regioni Inverted Repeat e la Small Single Copy. Entrambi i genomi plastidiali hanno 213 loci microsatellitari di cui 67 sono polimorfici e possono essere usati per analisi filogeografiche. L’analisi critica dei dati GBS è stata effettuata attraverso l’utilizzo di diverse strategie di filtraggio dei dati mancanti e dei loci eterozigoti. Usando il genoma plastidiale di una specie come riferimento, è stato possibile distinguere sei aplotipi che hanno consentito l’individuazione di due cladi filetici. Questo risultato è in linea con un’analisi filogenetica effettuata eliminando tutti i loci eterozigoti e selezionando quelli condivisi da almeno il 70% degli individui. Diversamente introducendo nell’analisi i loci eterozigoti ed analizzando quelli condivisi da almeno il 30% degli individui è stato possibile distinguere le specie. Nel complesso questi due casi di studio hanno consentito di testare e individuare strategie di analisi bioinformatica di dati genomici in un contesto evolutivo e di assemblare dei genomi per contribuire all’arricchimento delle banche dati. Tali tecniche potranno essere utilizzate per l’annotazione di genomi organellari in specie non modello e per l’analisi di dati GBS

    Lentiviral vector packaging cell line development using genome editing to target optimal loci discovered by high throughput DNA barcoding

    Get PDF
    Lentiviral vectors are increasingly used as delivery methods in gene therapy clinical trials due to their high efficiency transducing cells and stability of transgene expression. The development of packaging and producer cell lines for the production of lentiviral vectors has always been a labour-intensive and lengthy process. Sequential introduction of vector components, adaptability to suspension cultures, autotransduction and genetic, transcriptional or cell line growth instability are some of the limitations that cause significant drops in productivity. Improved transcription of self-inactivating vectors leading to high titers has been attempted in different ways with the intent to find a high stable producer clone. In this project, we studied the use of lentiviral vectors as a tool to target and identify high-transcribing loci in the genome of our host cells for lentiviral packaging cell line development. Third generation lentiviral vectors carrying eGFP under the control of an endogenous clinically-tested promoter (short EF1α) were produced, containing a variable DNA sequence tag (barcode) in their long terminal repeat (LTR). The aim of the barcode is to uniquely tag, identify and track a particular clone within the heterologous expressing population. Human embryonic kidney cell lines (HEK-293) were transduced with a barcoded lentiviral library at a low multiplicity of infection. We demonstrated that integration site analysis and next-generation sequencing of lentiviral barcoded vector junctions by ligation-mediated PCR (LM-PCR) coupled with RNA-Seq allows for quantification of the relative abundance of each barcode variant in each specific genomic position. Expression cassettes containing lentiviral vector components were then site-specifically integrated into these genomes sites using the CRISPR-Cas9 technology. The barcoding lentiviral system allows for rapid and high-resolution high-throughput screening of gene expression in a large number of genomic positions naturally targeted for optimal vector expression but also of lower expressing sites in order to meet lentiviral cytotoxicity and stoichiometric constraints

    An Information Theoretic Approach to Speaker Diarization of Meeting Recordings

    Get PDF
    In this thesis we investigate a non parametric approach to speaker diarization for meeting recordings based on an information theoretic framework. The problem is formulated using the Information Bottleneck (IB) principle. Unlike other approaches where the distance between speaker segments is arbitrarily introduced, the IB method seeks the partition that maximizes the mutual information between observations and variables relevant for the problem while minimizing the distortion between observations. The distance between speech segments is selected as the Jensen-Shannon divergence as it arises from the IB objective function optimization. In the first part of the thesis, we explore IB based diarization with Mel frequency cepstral coefficients (MFCC) as input features. We study issues related to IB based speaker diarization such as optimizing the IB objective function, criteria for inferring the number of speakers. Furthermore, we benchmark the proposed system against a state-of-the-art systemon the NIST RT06 (Rich Transcription) meeting data for speaker diarization. The IB based system achieves similar speaker error rates (16.8%) as compared to a baseline HMM/GMM system (17.0%). This approach being non parametric clustering, perform diarization six times faster than realtime while the baseline is slower than realtime. The second part of thesis proposes a novel feature combination system in the context of IB diarization. Both speaker clustering and speaker realignment steps are discussed. In contrary to current systems, the proposed method avoids the feature combination by averaging log-likelihood scores. Two different sets of features were considered – (a) combination of MFCC features with time delay of arrival features (b) a four feature stream combination that combines MFCC, TDOA, modulation spectrum and frequency domain linear prediction. Experiments show that the proposed system achieve 5% absolute improvement over the baseline in case of two feature combination, and 7% in case of four feature combination. The increase in algorithm complexity of the IB system is minimal with more features. The system with four feature input performs in real time that is ten times faster than the GMM based system

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Classification of taxonomic units for biology and conservation in the cases of Lathyrus pannonicus and Oxytropis pilosa - Evaluation of morphological and phytosociological studies integrating molecular genetic data

    Get PDF
    In order to conserve biodiversity and develop taxon-specific conservation measures it is important to define and identify taxonomic units, which require protection. In this thesis, the differentiation processes in two species of the Fabaceae, the Hungarian Pea Lathyrus pannonicus (Jacq.) Garcke and the Wolly Milkvetch Oxytropis pilosa DC., are investigated using a combination of phytosociological surveys (using the Braun-Blanquet method) and molecular sequence data, and a set of analyses which include entirely novel approaches. Both species are relict species and elements of the Pontian-Pannonian floristic province. In the case of the generally more variable Lathyrus pannonicus correlations are found between ecological, morphological and genetic inter-stand distances. Mean Ellenberg indicator values calculated for stands (based on the Braun-Blanquet relevés) allowed to characterise the ecological properties of members of all subspecies, and to conclude that the genetic differentiation found in L. pannonicus is closely linked to Ellenberg’s moisture figure “F”. Lathyrus pannonicus and its subspecies fall into two major lineages in Europe: (1) a dry-adapted lineage (subspecies collinus, suevicus, and varius) thriving in habitats with mean “F” values of 4. Oxytropis pilosa is genetically less variable than Lathyrus pannonicus, which corresponds to its highly conserved morphology and ecology. Nevertheless, two major genetic variants are present. Using non-hierarchical clustering it is shown that this intra-specific variation in O. pilosa is as high or higher than inter-species divergence in other subclades of the Astragalus/Oxytropis genus complex. The two main genetic types reflect geographic differentiation. Pure and mixed populations are found, all of which ought to be protected to maintain current levels of biodiversity. The results of the plant phytosociological and genetic analyses of both target species are discussed in the context of conservation strategies: rather than maintaining mere high numbers of (currently accepted or proposed) species, it is important to select and preserve the populations, which provide the genetic resources within a species. The importance of taxon-specific conservation means is highlighted, and a recommendation is given how to preserve biodiversity.Ziel der Arbeit ist es, die jüngeren Verbreitungen und Artbildungsprozesse zweier rarer (pontisch-)pannonischer Steppenpflanzen aus der Familie der Schmetterlingsblütler (Fabaceae), die sich in ihrer morphologischen Aufgliederung unterscheiden, sowohl molekulargenetisch (mittels Klonierung und Sequenzierung von rDNA-Spacerregionen) als auch pflanzensoziologisch nachzuzeichnen. Die beiden untersuchten Arten, Wollige Fahnenwicke (Oxytropis pilosa) sowie die Ungarische Platterbse (Lathyrus pannonicus), haben in weiten Teilen Eurasiens eine ähnliche Verbreitung. Während O. pilosa über ihr Gesamtareal eine morphologisch homogene und genetisch schwach differenzierte Art darstellt, gliedert sich die ihr gegenübergestellte L. pannonicus sehr stark in morphologisch definierte Unterarten mit bemerkenswerter genetischer Variabilität, die sich genetisch-ökologisch als jeweils einen trockenheits- und einen feuchtigkeitsliebenden Ökotyp zusammenfassen lassen. Die ökologische Differenzierung der beiden untersuchten Arten kann mit der Braun-Blanquet-Methode erfasst und charakterisiert werden, wobei eine genaue synsystematische Zuordnung weder möglich noch angestrebt ist. Mit Hilfe der Braun-Blanquet-Aufnahmen wurden Ellenberg-Zeigerwerte und ökologisch-definierte Habitatdistanzen (über Bray-Curtis-Distanzen) generiert und diese den durch Klonierung und Sequenzierung gewonnenen molekularen Daten gegenübergestellt. Auf diese Weise war es möglich, trotz der hohen inter- und intraindividuellen Variabilität im Falle von L. pannonicus zu einem evolutionär-interpretierbaren Ergebnis zu kommen. Bei L. pannonicus zeigt sich, dass die ökologische Differenzierung als die treibende Kraft im Artbildungsprozess gesehen werden kann, demgegenüber ein geographisches Differenzierungssignal zurücktritt. An die Stelle der traditionell morphologisch unterschiedenen Unterarten treten zwei genetisch-ökologisch klar charakterisierbare Haupttypen (evolutionäre Linien), die möglicherweise sogar als Arten angesprochen werden könnten. Die morphologische Gleichförmigkeit von O. pilosa geht einher mit ihrer ökologischen Konstanz, jedoch gibt es auch hier verschiedene genetische Typen, welche mit der geographischen Verbreitung korrelieren. Es konnte gezeigt werden, dass die genetische Variation innerhalb der Art O. pilosa die Divergenz zwischen allgemein akzeptierten Arten ihrer Schwestergattung Astragalus, die z.T. auch als Relikte angesehen werden, entspricht oder sogar statistisch-signifikant übertrifft. Die umfassende Charakterisierung der untersuchten Reliktarten zeigt auf, welche Unterarten beziehungsweise Ökotypen und letztlich Populationen für den Naturschutz als besonders vordringlich aufgezeigt und wie neben den Arten und Unterarten selbst auch deren Habitate charakterisiert werden können. Dabei zeigt sich, dass traditionell bedingte taxonomische Kategorien (Unterarten, Arten, Gattungen) ungeeignet sind, schützenswerte Einheiten zu definieren. Im Sinne des Prozessschutzes ist es wünschenswert, dass die ökologischen Standortbedingungen bei aller Veränderlichkeit einen Erhalt der Reliktarten auch weiterhin ermöglichen. Für die untersuchten Arten kann aufgrund der pflanzensoziologischen Aufnahmen auf eine Besiedlung insbesondere von gestörten Biotopen geschlossen werden: Bevorzugt werden durch massive Störungen Lückigkeit aufweisende Habitate eingenommen, so beispielsweise beweidete oder anderweitig stark anthropogen beeinflusste und damit stark offengehaltene Flächen bis hin zu Bahndämmen. Ein Monitoring im Sinne einer Effizienzkontrolle ist für diese Reliktarten in ihren starken Wechseln unterworfenen Habitaten überdies eine Notwendigkeit. Die Schwierigkeit von Naturschutzmaßnahmen unter diesen Umständen wird ebenso diskutiert wie die Vernetzung von Habitaten, welche für Reliktarten nicht notwendigerweise zielführend ist. Insbesondere erschwert die Besonderheit vieler reliktischer Standorte eine Vernetzung der Habitate. Im Sinne des Erhalts der Biodiversität ist es wichtig, aber auch erfolgsversprechend, die umfassend in ihren Eigenheiten erkannten Reliktarten mit ihrer Angepasstheit im Fortbestand zu sichern und ihren Artbildungsprozess unter Bedingungen nicht zuletzt des Klimawandels weiter zu verfolgen

    The dynamics of complex systems. Studies and applications in computer science and biology

    Get PDF
    Our research has focused on the study of complex dynamics and on their use in both information security and bioinformatics. Our first work has been on chaotic discrete dynamical systems, and links have been established between these dynamics on the one hand, and either random or complex behaviors. Applications on information security are on the pseudorandom numbers generation, hash functions, informationhiding, and on security aspects on wireless sensor networks. On the bioinformatics level, we have applied our studies of complex systems to theevolution of genomes and to protein folding

    Complexity, Emergent Systems and Complex Biological Systems:\ud Complex Systems Theory and Biodynamics. [Edited book by I.C. Baianu, with listed contributors (2011)]

    Get PDF
    An overview is presented of System dynamics, the study of the behaviour of complex systems, Dynamical system in mathematics Dynamic programming in computer science and control theory, Complex systems biology, Neurodynamics and Psychodynamics.\u
    corecore