168 research outputs found

    A Fast and Specific Alignment Method for Minisatellite Maps

    Get PDF
    Background: Variable minisatellites count among the most polymorphic markers of eukaryotic and prokaryotic genomes. This variability can affect gene coding regions, like in the prion protein gene, or gene regulation regions, like for the cystatin B gene, and be associated or implicated in diseases: the Creutzfeld-Jakob disease and the myoclonus epilepsy type 1, for our examples. When it affects neutrally evolving regions, the polymorphism in length (i.e. in number of copies) of minisatellites proved useful in population genetics. Motivation: In these tandem repeat sequences, different mutational mechanisms let the number of copies, as well as the copies themselves, vary. Especially, the interspersion of events of tandem duplication/contraction and of punctual mutation makes the succession of variant repeat much more informative than the sole allele length. To exploit this information requires the ability to align minisatellite alleles by accounting for both punctual mutations and tandem duplications. Results: We propose a minisatellite maps alignment program that improves on previous solutions. Our new program is faster, simpler, considers an extended evolutionary model, and is available to the community. We test it on the data set of 609 alleles of the MSY1 (DYF155S1) human minisatellite andconfirm its abilityto recover known evolutionary signals. Our experiments highlight that the informativeness of minisatellites resides in their length and composition polymorphisms. Exploiting both simultaneously is critical to unravel the implications of variable minisatellites in the control of gene expression and diseases. Availability: Software is available at http://atgc.lirmm.fr/ms_align/ Keywords: VNTR, tandem repeat, tandem duplication, variable costs, dynamic programming, sequence comparison

    Author Index

    Get PDF

    Multilocus variable number of tandem repeat analysis reveals multiple introductions in Spain of Xanthomonas arboricola pv. Pruni, the causal agent of bacterial spot disease of stone fruits and almond

    Get PDF
    Xanthomonas arboricola pv. pruni is the causal agent of the bacterial spot disease of stone fruits, almond and some ornamental Prunus species. In Spain it was first detected in 2002 and since then, several outbreaks have occurred in different regions affecting mainly Japanese plum, peach and almond, both in commercial orchards and nurseries. As the origin of the introduction(s) was unknown, we have assessed the genetic diversity of 239 X. arboricola pv. pruni strains collected from 11 Spanish provinces from 2002 to 2013 and 25 reference strains from international collections. We have developed an optimized multilocus variable number of tandem repeat analysis (MLVA) scheme targeting 18 microsatellites and five minisatellites. A high discriminatory power was achieved since almost 50% of the Spanish strains were distinguishable, confirming the usefulness of this genotyping technique at small spatio-temporal scales. Spanish strains grouped in 18 genetic clusters (conservatively delineated so that each cluster contained haplotype networks linked by up to quadruple-locus variations). Furthermore, pairwise comparisons among populations from different provinces showed a strong genetic differentiation. Our results suggest multiple introductions of this pathogen in Spain and redistribution through contaminated nursery propagative plant material

    TRStalker: an efficient heuristic for finding fuzzy tandem repeats

    Get PDF
    Motivation: Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events

    TRStalker: an Efficient Heuristic for Finding NP-Complete Tandem Repeats

    Get PDF
    Genomic sequences in higher eucaryotic organisms contain a substantial amount of (almost) repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage, are characterized by close spatial contiguity, and play an important role in several molecular regulatory mechanisms. Certain types of tandem repeats are highly polymorphic and constitute a fingerprint feature of individuals. Abnormal TRs are known to be linked to several diseases. Researchers in bio-informatics in the last 20 years have proposed many formal definitions for the rather loose notion of a Tandem Repeat and have proposed exact or heuristic algorithms to detect TRs in genomic sequences. The general trend has been to use formal (implicit or explicit) definitions of TR for which verification of the solution was easy (with complexity linear, or polynomial in the TR\u27s length and substitution+indel rates) while the effort was directed towards identifying efficiently the sub-strings of the input to submit to the verification phase (either implicitly or explicitly). In this paper we take a step forward: we use a definition of TR for which also the verification step is difficult (in effect, NP-complete) and we develop new filtering techniques for coping with high error levels. The resulting heuristic algorithm, christened TRStalker, is approximate since it cannot guarantee that all NP-Complete Tandem Repeats satisfying the target definition in the input string will be found. However, in synthetic experiments with 30% of errors allowed, TRStalker has demonstrated a very high recall (ranging from 100% to 60%, depending on motif length and repetition number) for the NP-complete TRs. TRStalker has consistently better performance than some stateof- the-art methods for a large range of parameters on the class of NP-complete Tandem Repeats. TRStalker aims at improving the capability of TR detection for classes of TRs for which existing methods do not perform well

    ModuleOrganizer: detecting modules in families of transposable elements

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Most known eukaryotic genomes contain mobile copied elements called transposable elements. In some species, these elements account for the majority of the genome sequence. They have been subject to many mutations and other genomic events (copies, deletions, captures) during transposition. The identification of these transformations remains a difficult issue. The study of families of transposable elements is generally founded on a multiple alignment of their sequences, a critical step that is adapted to transposons containing mostly localized nucleotide mutations. Many transposons that have lost their protein-coding capacity have undergone more complex rearrangements, needing the development of more complex methods in order to characterize the architecture of sequence variations.</p> <p>Results</p> <p>In this study, we introduce the concept of a <it>transposable element module</it>, a flexible motif present in at least two sequences of a family of transposable elements and built on a succession of maximal repeats. The paper proposes an assembly method working on a set of exact maximal repeats of a set of sequences to create such modules. It results in a graphical view of sequences segmented into modules, a representation that allows a flexible analysis of the transformations that have occurred between them. We have chosen as a demonstration data set in depth analysis of the transposable element Foldback in <it>Drosophila melanogaster</it>. Comparison with multiple alignment methods shows that our method is more sensitive for highly variable sequences. The study of this family and the two other families AtREP21 and SIDER2 reveals new copies of very different sizes and various combinations of modules which show the potential of our method.</p> <p>Conclusions</p> <p>ModuleOrganizer is available on the Genouest bioinformatics center at <url>http://moduleorganizer.genouest.org</url></p

    A graphical simulation model of the entire DNA process associated with the analysis of short tandem repeat loci

    Get PDF
    The use of expert systems to interpret short tandem repeat DNA profiles in forensic, medical and ancient DNA applications is becoming increasingly prevalent as high-throughput analytical systems generate large amounts of data that are time-consuming to process. With special reference to low copy number (LCN) applications, we use a graphical model to simulate stochastic variation associated with the entire DNA process starting with extraction of sample, followed by the processing associated with the preparation of a PCR reaction mixture and PCR itself. Each part of the process is modelled with input efficiency parameters. Then, the key output parameters that define the characteristics of a DNA profile are derived, namely heterozygote balance (Hb) and the probability of allelic drop-out p(D). The model can be used to estimate the unknown efficiency parameters, such as π(extraction). ‘What-if’ scenarios can be used to improve and optimize the entire process, e.g. by increasing the aliquot forwarded to PCR, the improvement expected to a given DNA profile can be reliably predicted. We demonstrate that Hb and drop-out are mainly a function of stochastic effect of pre-PCR molecular selection. Whole genome amplification is unlikely to give any benefit over conventional PCR for LCN

    The shape of human gene family phylogenies

    Get PDF
    BACKGROUND: The shape of phylogenetic trees has been used to make inferences about the evolutionary process by comparing the shapes of actual phylogenies with those expected under simple models of the speciation process. Previous studies have focused on speciation events, but gene duplication is another lineage splitting event, analogous to speciation, and gene loss or deletion is analogous to extinction. Measures of the shape of gene family phylogenies can thus be used to investigate the processes of gene duplication and loss. We make the first systematic attempt to use tree shape to study gene duplication using human gene phylogenies. RESULTS: We find that gene duplication has produced gene family trees significantly less balanced than expected from a simple model of the process, and less balanced than species phylogenies: the opposite to what might be expected under the 2R hypothesis. CONCLUSION: While other explanations are plausible, we suggest that the greater imbalance of gene family trees than species trees is due to the prevalence of tandem duplications over regional duplications during the evolution of the human genome

    Characterizing VNTRs in human populations

    Get PDF
    Over half the human genome consists of repetitive sequences. One major class is the tandem repeats (TRs), which are defined by their location in the genome, repeat unit, and copy number. TRs loci that exhibit variant copy numbers are called Variable Number Tandem Repeats (VNTRs). High VNTR mutation rates of approximately 0.0001 per generation make them suitable for forensic studies, and of interest for potential roles in gene regulation and disease. TRs are generally divided into three classes: 1) microsatellites or short tandem repeats (STRs) with patterns 100 bp. To date, mini- and macrosatellites have been poorly characterized, mainly due to a lack of computational tools. In this thesis, I utilize a tool, VNTRseek, to identify human minisatellite VNTRs using short-read sequencing data from nearly 2,800 individuals and developed a new computational tool, MaSUD, to identify human macrosatellite VNTRs using data from 2,504 individuals. MaSUD is the first high-throughput tool to genotype macrosatellites using short reads. I identified over 35,000 minisatellite VNTRs and over 4,000 macrosatellite VNTRs, most previously unknown. A small subset in each VNTR class was validated experimentally and in silico. The detected VNTRs were further studied for their effects on gene expression, ability to distinguish human populations, and functional enrichment. Unlike STRs, mini- and macrosatellite VNTRs are enriched in regions with functional importance, e.g., introns, promoters, and transcription factor binding sites. A study of VNTRs across 26 populations shows that minisatellite VNTR genotypes can be used to predict super-populations with >90% accuracy. In addition, genotypes for 195 minisatellite VNTRs and 22 macrosatellite VNTRs were shown to be associated with differential expression in nearby genes (eQTLs). Finally, I developed a computational tool, mlZ, to infer undetected VNTR alleles and to detect false positive predictions. mlZ is applicable to other tools that use read support for predicting short variants. Overall, these studies provide the most comprehensive analysis of mini- and macrosatellites in human populations and will facilitate the application of VNTRs for clinical purposes
    corecore