1,208 research outputs found

    On the genealogy of a population of biparental individuals

    Get PDF
    If one goes backward in time, the number of ancestors of an individual doubles at each generation. This exponential growth very quickly exceeds the population size, when this size is finite. As a consequence, the ancestors of a given individual cannot be all different and most remote ancestors are repeated many times in any genealogical tree. The statistical properties of these repetitions in genealogical trees of individuals for a panmictic closed population of constant size N can be calculated. We show that the distribution of the repetitions of ancestors reaches a stationary shape after a small number Gc ~ log N of generations in the past, that only about 80% of the ancestral population belongs to the tree (due to coalescence of branches), and that two trees for individuals in the same population become identical after Gc generations have elapsed. Our analysis is easy to extend to the case of exponentially growing population.Comment: 14 pages, 7 figures, to appear in the Journal of Theoretical Biolog

    Algorithms and methods for large-scale genome rearrangements identification

    Get PDF
    Esta tesis por compendio aborda la definición formal de SB, empezando por Pares de Segmentos de alta puntuación (HSP), los cuales son bien conocidos y aceptados. El primer objetivo se centró en la detección de SB como una combinación de HSPs incluyendo repeticiones lo cual incrementó la complejidad del modelo. Como resultado, se obtuvo un método más preciso y que mejora la calidad de los resultados del estado del arte. Este método aplica reglas basadas en la adyacencia de SBs, permitiendo además detectar LSGR e identificarlos como inversiones, translocaciones o duplicaciones, constituyendo un framework capaz de trabajar con LSGR para organismos de un solo cromosoma. Más tarde en un segundo artículo, se utilizó este framework para refinar los bordes de los SBs. En nuestra novedosa propuesta, las repeticiones que flanquean los SB se utilizaron para refinar los bordes explotando la redundancia introducida por dichas repeticiones. Mediante un alineamiento múltiple de estas repeticiones se calculan los vectores de identidad del SB y de la secuencia consenso de las repeticiones alineadas. Posteriormente, una máquina de estados finitos diseñada para detectar los puntos de transición en la diferencia de ambos vectores determina los puntos de inicio y fin de los SB refinados. Este método también se mostró útil a la hora de detectar "puntos de ruptura" (conocidos como break points (BP)). Estos puntos aparecen como la región entre dos SBs adyacentes. El método no fuerza a que el BP sea una región o un punto, sino que depende de los alineamientos de las repeticiones y del SB en cuestión. El método es aplicado en un tercer trabajo, donde se afronta un caso de uso de análisis de metagenomas. Es bien sabido que la información almacenada en las bases de datos no corresponde necesariamente a las muestras no cultivadas contenidas en un metagenoma, y es posible imaginar que la asignación de una muestra de un metagenoma se vea dificultada por un evento reorganizativo. En el articulo se muestra que las muestras de un metagenoma que mapean sobre las regiones exclusivas de un genoma (aquellas que no comparte con otros genomas) respaldan la presencia de ese genoma en el metagenoma. Estas regiones exclusivas son fácilmente derivadas a partir de una comparación múltiple de genomas, como aquellas regiones que no forman parte de ningún SB. Una definición bajo un espacio de comparación múltiple de genomas es más precisa que las definiciones construidas a partir de una comparación de pares, ya que entre otras cosas, permite un refinamiento siguiendo un procedimiento similar al descrito en el segundo artículo (usando SBs, en vez de repeticiones). Esta definición también resuelve la contradicción existente en la definición de puntos de BPs (mencionado en la segunda publicación), por la cual una misma región de un genoma puede ser detectada como BP o formar parte de un SB dependiendo del genoma con el que se compare. Esta definición de SB en comparación múltiple proporciona además información precisa para la reconstrucción de LSGR, con vistas a obtener una aproximación del verdadero ancestro común entre especies. Además, proporciona una solución para el problema de la granularidad en la detección de SBs: comenzamos por SBs pequeños y bien conservados y a través de la reconstrucción de LSGR se va aumentando gradualmente el tamaño de dichos bloques. Los resultados que se esperan de esta línea de trabajo apuntan a una definición de una métrica destinada a obtener distancias inter genómicas más precisas, combinando similaridad entre secuencias y frecuencias de LSGR.Esta tesis es un compendio de tres artículos recientemente publicados en revistas de alto impacto, en los cuales mostramos el proceso que nos ha llevado a proponer la definición de Unidades Elementales de Conservación (regiones conservadas entre genomas que son detectadas después de una comparación múltiple), así como algunas operaciones básicas como inversiones, transposiciones y duplicaciones. Los tres artículos están transversalmente conectados por la detección de Bloques de Sintenia (SB) y reorganizaciones genómicas de gran escala (LSGR) (consultar sección 2), y respaldan la necesidad de elaborar el framework que se describe en la sección "Systems And Methods". De hecho, el trabajo intelectual llevado a cabo en esta tesis y las conclusiones aportadas por las publicaciones han sido esenciales para entender que una definición de SB apropiada es la clave para muchos de los métodos de comparativa genómica. Los eventos de reorganización del ADN son una de las principales causas de evolución y sus efectos pueden ser observados en nuevas especies, nuevas funciones biológicas etc. Las reorganizaciones a pequeña escala como inserciones, deleciones o substituciones han sido ampliamente estudiadas y existen modelos aceptados para detectarlas. Sin embargo, los métodos para identificar reorganizaciones a gran escala aún sufren de limitaciones y falta de precisión, debido principalmente a que no existe todavía una definición de SB aceptada. El concepto de SB hace referencia a regiones conservadas entre dos genomas que guardan el mismo orden y {strand. A pesar de que existen métodos para detectarlos, éstos evitan tratar con repeticiones o restringen la búsqueda centrándose solamente en las regiones codificantes en aras de un modelo más simple. El refinamiento de los bordes de estos bloques es a día de hoy un problema aún por solucionar

    Cedratvirus lausannensis - digging into Pithoviridae diversity.

    Get PDF
    Amoeba-infecting viruses have raised scientists' interest due to their novel particle morphologies, their large genome size and their genomic content challenging previously established dogma. We report here the discovery and the characterization of Cedratvirus lausannensis, a novel member of the Megavirales, with a 0.75-1 µm long amphora-shaped particle closed by two striped plugs. Among numerous host cell types tested, the virus replicates only in Acanthamoeba castellanii leading to host cell lysis within 24 h. C. lausannensis was resistant to ethanol, hydrogen peroxide and heating treatments. Like 30 000-year-old Pithovirus sibericum, C. lausannensis enters by phagocytosis, releases its genetic content by fusion of the internal membrane with the inclusion membrane and replicates in intracytoplasmic viral factories. The genome encodes 643 proteins that confirmed the grouping of C. lausannensis with Cedratvirus A11 as phylogenetically distant members of the family Pithoviridae. The 575,161 bp AT-rich genome is essentially devoid of the numerous repeats harbored by Pithovirus, suggesting that these non-coding repetitions might be due to a selfish element rather than particular characteristics of the Pithoviridae family. The discovery of C. lausannensis confirms the contemporary worldwide distribution of Pithoviridae members and the characterization of its genome paves the way to better understand their evolution

    Genomic Scaffold Filling Revisited

    Get PDF
    The genomic scaffold filling problem has attracted a lot of attention recently. The problem is on filling an incomplete sequence (scaffold) I into I\u27, with respect to a complete reference genome G, such that the number of adjacencies between G and I\u27 is maximized. The problem is NP-complete and APX-hard, and admits a 1.2-approximation. However, the sequence input I is not quite practical and does not fit most of the real datasets (where a scaffold is more often given as a list of contigs). In this paper, we revisit the genomic scaffold filling problem by considering this important case when, (1) a scaffold S is given, the missing genes X = c(G) - c(S) can only be inserted in between the contigs, and the objective is to maximize the number of adjacencies between G and the filled S\u27 and (2) a scaffold S is given, a subset of the missing genes X\u27 subset X = c(G) - c(S) can only be inserted in between the contigs, and the objective is still to maximize the number of adjacencies between G and the filled S\u27\u27. For problem (1), we present a simple NP-completeness proof, we then present a factor-2 greedy approximation algorithm, and finally we show that the problem is FPT when each gene appears at most d times in G. For problem (2), we prove that the problem is W[1]-hard and then we present a factor-2 FPT-approximation for the case when each gene appears at most d times in G

    Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence Comparison

    Get PDF
    Sequence analysis and structure analysis are two of the fundamental areas of bioinformatics research. This dissertation discusses, specifically, protein structure related problems including protein structure alignment and query, and genome sequence related problems including haplotype reconstruction and genome rearrangement. It first presents an algorithm for pairwise protein structure alignment that is tested with structures from the Protein Data Bank (PDB). In many cases it outperforms two other well-known algorithms, DaliLite and CE. The preliminary algorithm is a graph-theory based approach, which uses the concept of \stars to reduce the complexity of clique-finding algorithms. The algorithm is then improved by introducing \double-center stars in the graph and applying a self-learning strategy. The updated algorithm is tested with a much larger set of protein structures and shown to be an improvement in accuracy, especially in cases of weak similarity. A protein structure query algorithm is designed to search for similar structures in the PDB, using the improved alignment algorithm. It is compared with SSM and shows better performance with lower maximum and average Q-score for missing proteins. An interesting problem dealing with the calculation of the diameter of a 3-D sequence of points arose and its connection to the sublinear time computation is discussed. The diameter calculation of a 3-D sequence is approximated by a series of sublinear time deterministic, zero-error and bounded-error randomized algorithms and we have obtained a series of separations about the power of sublinear time computations. This dissertation also discusses two genome sequence related problems. A probabilistic model is proposed for reconstructing haplotypes from SNP matrices with incomplete and inconsistent errors. The experiments with simulated data show both high accuracy and speed, conforming to the theoretically provable e ciency and accuracy of the algorithm. Finally, a genome rearrangement problem is studied. The concept of non-breaking similarity is introduced. Approximating the exemplar non-breaking similarity to factor n1..f is proven to be NP-hard. Interestingly, for several practical cases, several polynomial time algorithms are presented

    A candidate gene for fire blight resistance in Malus × robusta 5 is coding for a CC-NBS-LRR

    Get PDF
    Erworben im Rahmen der Schweizer Nationallizenzen (http://www.nationallizenzen.ch)Fire blight is the most important bacterial disease in apple (Malus ×  domestica) and pear (Pyrus communis) production. Today, the causal bacterium Erwinia amylovora is present in many apple- and pear-growing areas. We investigated the natural resistance of the wild apple Malus ×  robusta 5 against E. amylovora, previously mapped to linkage group 3. With a fine-mapping approach on a population of 2,133 individuals followed by phenotyping of the recombinants from the region of interest, we developed flanking markers useful for marker-assisted selection. Open reading frames were predicted on the sequence of a BAC spanning the resistance locus. One open reading frame coded for a protein belonging to the NBS–LRR family. The in silico investigation of the structure of the candidate resistance gene against fire blight of M. ×  robusta 5, FB_MR5, led us hypothesize the presence of a coiled-coil region followed by an NBS and an LRR-like structure with the consensus ‘LxxLx[IL]xxCxxLxxL’. The function of FB_MR5 was predicted in agreement with the decoy/guard model, that FB_MR5 monitors the transcribed RIN4_MR5, a homolog of RIN4 of Arabidopsis thaliana that could interact with the previously described effector AvrRpt2EA of E. amylovora

    Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the k-mer Distributions in the Human Genome

    Get PDF
    The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a greater length increases the chance for reads being uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 to 1000 basepairs. We use the proportion of non-singleton k-mers to evaluate the mappability of reads for a corresponding read length. We observe that the proportion of non-singletons decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different k ranges. A faster decay at smaller values for k indicates more limited gains for read lengths > 200 basepairs. The frequency distributions of k-mers exhibit long tails in a power-law-like trend, and rank frequency plots exhibit a concave Zipf's curve. The location of the most frequent 1000-mers comprises 172 kilobase-ranged regions, including four large stretches on chromosomes 1 and X, containing genes with biomedical implications. Even the read length 1000 would be insufficient to reliably sequence these specific regions.Comment: 5 figure

    ALIGNMENT-FREE METHODS AND ITS APPLICATIONS

    Get PDF
    Comparing biological sequences remains one of the most vital activities in Bioinformatics. Comparing biological sequences would address the relatedness between species, and find similar structures that might lead to similar functions. Sequence alignment is the default method, and has been used in the domain for over four decades. It gained a lot of trust, but limitations and even failure has been reported, especially with the new generated genomes. These new generated genomes have bigger size, and to some extent suffer errors. Such errors come mainly as a result from the sequencing machine. These sequencing errors should be considered when submitting sequences to GenBank, for sequence comparison, it is often hard to address or even trace this problem. Alignment-based methods would fail with such errors, and even if biologists still trust them, reports showed failure with these methods. The poor results of alignment-based methods with erratic sequences, motivated researchers in the domain to look for alternatives. These alternative methods are alignment-free, and would overcome the shortcomings of alignment-based methods. The work of this thesis is based on alignment-free methods, and it conducts an in-depth study to evaluate these methods, and find the right domain’s application for them. The right domain for alignment-free methods could be by applying them to data that were subjected to manufactured errors, and test the methods provide better comparison results with data that has naturally severe errors. The two techniques used in this work are compression-based and motif-based (or k-mer based, or signal based). We also addressed the selection of the used motifs in the second technique, and how to progress the results by selecting specific motifs that would enhance the quality of results. In addition, we applied an alignment-free method to a different domain, which is gene prediction. We are using alignment-free in gene prediction to speed up the process of providing high quality results, and predict accurate stretches in the DNA sequence, which would be considered parts of genes

    Comparative genomics of Steinernema reveals deeply conserved gene regulatory networks

    Get PDF
    Background: Parasitism is a major ecological niche for a variety of nematodes. Multiple nematode lineages have specialized as pathogens, including deadly parasites of insects that are used in biological control. We have sequenced and analyzed the draft genomes and transcriptomes of the entomopathogenic nematode Steinernema carpocapsae and four congeners (S. scapterisci, S. monticolum, S. feltiae, and S. glaseri). Results: We used these genomes to establish phylogenetic relationships, explore gene conservation across species, and identify genes uniquely expanded in insect parasites. Protein domain analysis in Steinernema revealed a striking expansion of numerous putative parasitism genes, including certain protease and protease inhibitor families, as well as fatty acid- and retinol-binding proteins. Stage-specific gene expression of some of these expanded families further supports the notion that they are involved in insect parasitism by Steinernema. We show that sets of novel conserved non-coding regulatory motifs are associated with orthologous genes in Steinernema and Caenorhabditis. Conclusions: We have identified a set of expanded gene families that are likely to be involved in parasitism. We have also identified a set of non-coding motifs associated with groups of orthologous genes in Steinernema and Caenorhabditis involved in neurogenesis and embryonic development that are likely part of conserved protein–DNA relationships shared between these two genera
    corecore