67 research outputs found

    CRAC: an integrated approach to the analysis of RNA-seq reads

    No full text
    International audienceA large number of RNA-sequencing studies set out to predict mutations, splice junctions or fusion RNAs. We propose a method, CRAC, that integrates genomic locations and local coverage to enable such predictions to be made directly from RNA-seq read analysis. A k-mer profiling approach detects candidate mutations, indels and splice or chimeric junctions in each single read. CRAC increases precision compared with existing tools, reaching 99:5% for splice junctions, without losing sensitivity. Importantly, CRAC predictions improve with read length. In cancer libraries, CRAC recovered 74% of validated fusion RNAs and predicted novel recurrent chimeric junctions. CRAC is available at http://crac.gforge.inria.fr

    A Fast and Specific Alignment Method for Minisatellite Maps

    Get PDF
    Background: Variable minisatellites count among the most polymorphic markers of eukaryotic and prokaryotic genomes. This variability can affect gene coding regions, like in the prion protein gene, or gene regulation regions, like for the cystatin B gene, and be associated or implicated in diseases: the Creutzfeld-Jakob disease and the myoclonus epilepsy type 1, for our examples. When it affects neutrally evolving regions, the polymorphism in length (i.e. in number of copies) of minisatellites proved useful in population genetics. Motivation: In these tandem repeat sequences, different mutational mechanisms let the number of copies, as well as the copies themselves, vary. Especially, the interspersion of events of tandem duplication/contraction and of punctual mutation makes the succession of variant repeat much more informative than the sole allele length. To exploit this information requires the ability to align minisatellite alleles by accounting for both punctual mutations and tandem duplications. Results: We propose a minisatellite maps alignment program that improves on previous solutions. Our new program is faster, simpler, considers an extended evolutionary model, and is available to the community. We test it on the data set of 609 alleles of the MSY1 (DYF155S1) human minisatellite andconfirm its abilityto recover known evolutionary signals. Our experiments highlight that the informativeness of minisatellites resides in their length and composition polymorphisms. Exploiting both simultaneously is critical to unravel the implications of variable minisatellites in the control of gene expression and diseases. Availability: Software is available at http://atgc.lirmm.fr/ms_align/ Keywords: VNTR, tandem repeat, tandem duplication, variable costs, dynamic programming, sequence comparison

    Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity

    Get PDF
    Ultra high-throughput sequencing is used to analyse the transcriptome or interactome at unprecedented depth on a genome-wide scale. These techniques yield short sequence reads that are then mapped on a genome sequence to predict putatively transcribed or protein-interacting regions. We argue that factors such as background distribution, sequence errors, and read length impact on the prediction capacity of sequence census experiments. Here we suggest a computational approach to measure these factors and analyse their influence on both transcriptomic and epigenomic assays. This investigation provides new clues on both methodological and biological issues. For instance, by analysing chromatin immunoprecipitation read sets, we estimate that 4.6% of reads are affected by SNPs. We show that, although the nucleotide error probability is low, it significantly increases with the position in the sequence. Choosing a read length above 19 bp practically eliminates the risk of finding irrelevant positions, while above 20 bp the number of uniquely mapped reads decreases. With our procedure, we obtain 0.6% false positives among genomic locations. Hence, even rare signatures should identify biologically relevant regions, if they are mapped on the genome. This indicates that digital transcriptomics may help to characterize the wealth of yet undiscovered, low-abundance transcripts

    Querying large read collections in main memory: a versatile data structure

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the <it>k</it>-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some <it>k</it>-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently.</p> <p>Results</p> <p>Here, we present a solution, named <it>Gk </it>arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a <it>k</it>-mer, get the reads containing this <it>k</it>-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq).</p> <p>Conclusions</p> <p><it>Gk </it>arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The <it>Gk </it>arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The <it>Gk </it>arrays library is available under Cecill (GPL compliant) license from <url>http://www.atgc-montpellier.fr/ngs/</url>.</p

    Régimes en densité et détachement dans le plasma de bord d’ITER

    No full text
    The properties and behavior of the plasma edge in a nuclear fusion reactor are critical for its performance and operation. Just outside the boundary of the confined region where fusion occurs, a layer forms which interacts with the vacuum vessel walls. While most of the plasma falls on the so-called "divertor" section of the machine specially designed to sustain high power fluxes, some of it can still interact with the rest of the vacuum vessel (the "First Wall"). Plasma particles impacting surfaces can erode materials, producing impurities that can contaminate the core plasma, degrade fusion performance, and shorten the components' lifetime. To optimize operations, one needs quantitative assessments of the plasma fluxes and conditions at surfaces for different plasma scenarios. As the phenomena involved in the edge plasma are highly diverse and non-linear, numerical simulation tools are required.In this thesis, we use the SOLEDGE3X code to simulate the behavior of the ITER Scrape-Off Layer in the first non-active operation phase (PFPO-1), with a fluid model for the plasma using the SOLEDGE3X plasma solver, which is coupled to and a kinetic model for neutrals using the EIRENE code. This study specifically focuses on detachment in the divertor and the plasma fluxes and conditions at the beryllium First Wall. Due to the conditions expected in the ITER machine, the physics model in SOLEDGE3X had to be improved to better describe plasma-surface and plasma-neutral interactions in the divertor. The increased complexity of the new model also mandated several improvements to the numerical scheme. Then, following the new ability of the code to deal with ITER cases, a scan over a range of gas injection rates is performed to study the mechanisms at play in plasma detachment. More specifically, the contributions of the plasma-neutral interaction processes are analyzed for the attached/high-recycling, rollover, and partial detachment regimes. Further analysis of the SOL width λq\lambda_q indicates that it varies with the divertor regime, giving guidelines on the procedure to set up perpendicular transport coefficients for high-density regime simulations. The contributions of neutrals and ions in the assessed gross beryllium erosion rates are analyzed, including their energy spectra. Then, since observations in current machines point to the possible formation of density shoulders in the far-SOL of ITER, a sensitivity study of the results is carried out by increasing the far-SOL perpendicular transport coefficients in the simulations. The impacts on the divertor regime, conditions, fluxes, and gross erosion rate at the first wall are assessed. Finally, potential future steps to improve simulations are proposed, along with the initial simulation results of a Q=10 plasma with Neon and Helium impurities.Les propriétés et le comportement du plasma de bord dans un réacteur à fusion nucléaire conditionnent directement ses performances et son fonctionnement. Juste à l'extérieur de la région confinée où se produit la fusion, une couche de plasma se forme et interagit avec les parois de la chambre à vide. Si la majorité du plasma arrive sur le "divertor" spécialement conçu pour supporter des flux de chaleur élevés, une partie peut néanmoins interagir avec le reste de l'enceinte à vide (la "première paroi"). Des flux excessifs peuvent éroder les matériaux et réduire leur durée de vie, et également produire des impuretés susceptibles de contaminer et dégrader les performances du plasma central.Par conséquent, des estimations de ces flux sur ces surfaces sont nécessaires en fonction des scénarios envisagés. La physique du plasma de bord est cependant très riche et complexe, et le recours aux simulations numériques est nécessaire pour produire ces estimations.Dans cette thèse, le code SOLEDGE3X est utilisé pour simuler le plasma de bord d'ITER dans sa première phase d'opération (PFPO-1) avec une combinaison de modèles fluide pour le plasma et cinétique pour les neutres (via le code EIRENE). Le détachement du divertor et les flux/conditions plasma au niveau de la première paroi sont étudiés. En raison de la taille et de la géométrie de la machine ITER, le modèle physique dans SOLEDGE3X a dû être amélioré pour mieux décrire les interactions plasma-neutres, entraînant également des améliorations du schéma numérique. Un scan en taux d'injection de gaz est effectué pour étudier les mécanismes du détachement. Les contributions des différents processus d'interaction plasma-neutres sont analysées pour les régimes attachés/high-recycling, rollover et partiellement détaché. Une analyse de l'épaisseur de la SOL λq\lambda_q est proposée, où il est montré que ce paramètre dépend du régime, conduisant à des recommandations sur le choix des coefficients de transport perpendiculaire pour les simulations de régimes haute-densité. Pour la première fois dans la littérature, une évaluation est proposée des contributions entre neutres et ions dans les taux d'érosion de la première paroi sur la base de simulations autocohérentes. Ensuite, au vu des observations dans les machines actuelles indiquant la possibilité de formation d'épaules en densité dans la far-SOL d'ITER, une étude de sensibilité des résultats précédents est réalisée en augmentant le transport perpendiculaire dans la far-SOL des simulations. Les impacts sur le régime divertor, les conditions plasma, les flux et le taux d'érosion de la première paroi sont analysés. Enfin, des futures étapes pour améliorer les simulations sont proposées, ainsi que les premiers résultats d'une simulation d'un plasma Q = 10 avec impuretés (néon et hélium)

    Hardness results for the center and median string problems under the weighted and unweighted edit distances

    Get PDF
    Given a finite set of strings, the MEDIAN STRING problem consists in finding a string that minimizes the sum of the edit distances to the strings in the set. Approximations of the median string are used in a very broad range of applications where one needs a representative string that summarizes common information to the strings of the set. It is the case in classification, in speech and pattern recognition, and in computational biology. In the latter, MEDIAN STRING is related to the key problem of multiple alignment. In the recent literature, one finds a theorem stating the NP-completeness of the MEDIAN STRING for unbounded alphabets. However, in the above mentioned areas, the alphabet is often finite. Thus, it remains a crucial question whether the MEDIAN STRING problem is NP-complete for bounded and even binary alphabets. In this work, we provide an answer to this question and also give the complexity of the related CENTER STRING problem. Moreover, we study the parameterized complexity of both problems with respect to the number of input strings. In addition, we provide an algorithm to compute an optimal center under a weighted edit distance in polynomial time when the number of input strings is fixed

    Longest Common Subsequence Problem for Unoriented and Cyclic Strings

    No full text
    Article Soumis à SPIRE'04Longest Common Subsequence Problem for Unoriented and Cyclic String

    Hardness of optimal spaced seed design

    Get PDF
    Speeding up approximate pattern matching is a line of research in stringology since the 80s. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are first excluded, and the remaining regions are then compared to the pattern by dynamic programming. Among the conditions used to test similarity between the regions and the pattern, many require a minimum number of common substrings between them. When only substitutions are taken into account for measuring dissimilarity, counting spaced subwords instead of substrings improves the filtration efficiency. However, a preprocessing step is required to design one or more patterns, called spaced seeds (or gapped seeds), for the subwords, depending on the search parameters. Two distinct lines of research appear the literature: one with probabilistic formulations of seed design problems, in which one wishes for instance to compute a seed with the highest probability to detect the desired similarities (lossy filtration), a second line with combinatorial formulations, where the goal is to find a seed that detects all or a maximum number of similarities (both lossless and lossy filtration). We concentrate on combinatorial seed design problems and consider formulations in which the set of sought similarities is either listed explicitly (RSOS), or characterised by their length and maximal number of mismatches (NON-DETECTION). Several articles exhibit exponential algorithms for these problems. In this work, we provide hardness and inapproximability results for several seed design problems, thereby justifying the complexity of known algorithms. Moreover, we introduce a new formulation of seed design (MWLS), in which the weight of the seed has to be maximised, and show it is as difficult to approximate as MAXIMUM INDEPENDENT SET

    Complexities of the centre and median string problems

    No full text
    Abstract. Given a finite set of strings, the median string problem consists in finding a string that minimizes the sum of the distances to the strings in the set. Approximations of the median string are used in a very broad range of applications where one needs a representative string that summarizes common information to the strings of the set. It is the case in Classification, in Speech and Pattern Recognition, and in Computational Biology. In the latter, median string is related to the key problem of Multiple Alignment. In the recent literature, one finds a theorem stating the NP-completeness of the median string for unbounded alphabets. However, in the above mentioned areas, the alphabet is often finite. Thus, it remains a crucial question whether the median string problem is NP-complete for finite and even binary alphabets. In this work, we provide an answer to this question and also give the complexity of the related centre string problem. Moreover, we study the parametrized complexity of both problems with respect to the number of input strings.
    • …
    corecore