64 research outputs found
Fast index based algorithms and software for matching position specific scoring matrices
BACKGROUND: In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. RESULTS: We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330. CONCLUSION: Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than | [Formula: see text] |(m )+ m - 1, where m is the length of the PSSM and [Formula: see text] a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript
Efficient and accurate P-value computation for Position Weight Matrices
<p>Abstract</p> <p>Background</p> <p>Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.</p> <p>Results</p> <p>The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available.</p> <p>Conclusion</p> <p>We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.</p
Statistical significance of cis-regulatory modules
BACKGROUND: It is becoming increasingly important for researchers to be able to scan through large genomic regions for transcription factor binding sites or clusters of binding sites forming cis-regulatory modules. Correspondingly, there has been a push to develop algorithms for the rapid detection and assessment of cis-regulatory modules. While various algorithms for this purpose have been introduced, most are not well suited for rapid, genome scale scanning. RESULTS: We introduce methods designed for the detection and statistical evaluation of cis-regulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization. In order to determine the statistical significance of module sites, we first need a method to determine the statistical significance of single transcription factor binding site matches. We introduce a straightforward method of estimating the statistical significance of single site matches using a database of known promoters to produce data structures that can be used to estimate p-values for binding site matches. We next introduce a technique to calculate the statistical significance of the arrangement of binding sites within a module using a max-gap model. If the module scanned for has defined organizational parameters, the probability of the module is corrected to account for organizational constraints. The statistical significance of single site matches and the architecture of sites within the module can be combined to provide an overall estimation of statistical significance of cis-regulatory module sites. CONCLUSION: The methods introduced in this paper allow for the detection and statistical evaluation of single transcription factor binding sites and cis-regulatory modules. The features described are implemented in the Search Tool for Occurrences of Regulatory Motifs (STORM) and MODSTORM software
Reliable transfer of transcriptional gene regulatory networks between taxonomically related organisms
Baumbach J, Rahmann S, Tauch A. Reliable transfer of transcriptional gene regulatory networks between taxonomically related organisms. BMC Systems Biology. 2009;3(1):8.Background: Transcriptional regulation of gene activity is essential for any living organism. Transcription factors therefore recognize specific binding sites within the DNA to regulate the expression of particular target genes. The genome-scale reconstruction of the emerging regulatory networks is important for biotechnology and human medicine but cost-intensive, time-consuming, and impossible to perform for any species separately. By using bioinformatics methods one can partially transfer networks from well-studied model organisms to closely related species. However, the prediction quality is limited by the low level of evolutionary conservation of the transcription factor binding sites, even within organisms of the same genus. Results: Here we present an integrated bioinformatics workflow that assures the reliability of transferred gene regulatory networks. Our approach combines three methods that can be applied on a large-scale: re-assessment of annotated binding sites, subsequent binding site prediction, and homology detection. A gene regulatory interaction is considered to be conserved if (1) the transcription factor, (2) the adjusted binding site, and (3) the target gene are conserved. The power of the approach is demonstrated by transferring gene regulations from the model organism Corynebacterium glutamicum to the human pathogens C. diphtheriae, C. jeikeium, and the biotechnologically relevant C. efficiens. For these three organisms we identified reliable transcriptional regulations for similar to 40% of the common transcription factors, compared to similar to 5% for which knowledge was available before. Conclusion: Our results suggest that trustworthy genome-scale transfer of gene regulatory networks between organisms is feasible in general but still limited by the level of evolutionary conservation
A Comparative Chemogenomics Strategy to Predict Potential Drug Targets in the Metazoan Pathogen, Schistosoma mansoni
Schistosomiasis is a prevalent and chronic helmintic disease in tropical regions. Treatment and control relies on chemotherapy with just one drug, praziquantel and this reliance is of concern should clinically relevant drug resistance emerge and spread. Therefore, to identify potential target proteins for new avenues of drug discovery we have taken a comparative chemogenomics approach utilizing the putative proteome of Schistosoma mansoni compared to the proteomes of two model organisms, the nematode, Caenorhabditis elegans and the fruitfly, Drosophila melanogaster. Using the genome comparison software Genlight, two separate in silico workflows were implemented to derive a set of parasite proteins for which gene disruption of the orthologs in both the model organisms yielded deleterious phenotypes (e.g., lethal, impairment of motility), i.e., are essential genes/proteins. Of the 67 and 68 sequences generated for each workflow, 63 were identical in both sets, leading to a final set of 72 parasite proteins. All but one of these were expressed in the relevant developmental stages of the parasite infecting humans. Subsequent in depth manual curation of the combined workflow output revealed 57 candidate proteins. Scrutiny of these for ‘druggable’ protein homologs in the literature identified 35 S. mansoni sequences, 18 of which were homologous to proteins with 3D structures including co-crystallized ligands that will allow further structure-based drug design studies. The comparative chemogenomics strategy presented generates a tractable set of S. mansoni proteins for experimental validation as drug targets against this insidious human pathogen
Mycolactone Gene Expression Is Controlled by Strong SigA-Like Promoters with Utility in Studies of Mycobacterium ulcerans and Buruli Ulcer
Mycolactone A/B is a lipophilic macrocyclic polyketide that is the primary virulence factor produced by Mycobacterium ulcerans, a human pathogen and the causative agent of Buruli ulcer. In M. ulcerans strain Agy99 the mycolactone polyketide synthase (PKS) locus spans a 120 kb region of a 174 kb megaplasmid. Here we have identified promoter regions of this PKS locus using GFP reporter assays, in silico analysis, primer extension, and site-directed mutagenesis. Transcription of the large PKS genes mlsA1 (51 kb), mlsA2 (7 kb) and mlsB (42 kb) is driven by a novel and powerful SigA-like promoter sequence situated 533 bp upstream of both the mlsA1 and mlsB initiation codons, which is also functional in Escherichia coli, Mycobacterium smegmatis and Mycobacterium marinum. Promoter regions were also identified upstream of the putative mycolactone accessory genes mup045 and mup053. We transformed M. ulcerans with a GFP-reporter plasmid under the control of the mls promoter to produce a highly green-fluorescent bacterium. The strain remained virulent, producing both GFP and mycolactone and causing ulcerative disease in mice. Mosquitoes have been proposed as a potential vector of M. ulcerans so we utilized M. ulcerans-GFP in microcosm feeding experiments with captured mosquito larvae. M. ulcerans-GFP accumulated within the mouth and midgut of the insect over four instars, whereas the closely related, non-mycolactone-producing species M. marinum harbouring the same GFP reporter system did not. This is the first report to identify M. ulcerans toxin gene promoters, and we have used our findings to develop M. ulcerans-GFP, a strain in which fluorescence and toxin gene expression are linked, thus providing a tool for studying Buruli ulcer pathogenesis and potential transmission to humans
RNA Interference in Schistosoma mansoni Schistosomula: Selectivity, Sensitivity and Operation for Larger-Scale Screening
RNA interference (RNAi) is a technique to selectively suppress mRNA of individual genes and, consequently, their cognate proteins. RNAi using double-stranded (ds) RNA has been used to interrogate the function of mainly single genes in the flatworm, Schistosoma mansoni, one of a number of schistosome species causing schistosomiasis. In consideration of large-scale screens to identify candidate drug targets, we examined the selectivity and sensitivity (the degree of suppression) of RNAi for 11 genes produced in different tissues of the parasite: the gut, tegument (surface) and otherwise. We used the schistosomulum stage prepared from infective cercariae larvae which are accessible in large numbers and adaptable to automated screening platforms. We found that RNAi suppresses transcripts selectively, however, the sensitivity of suppression varies (40%–>75%). No obvious changes in the parasite occurred post-RNAi, including after targeting the mRNA of genes that had been computationally predicted to be essential for survival. Additionally, we defined operational parameters to facilitate large-scale RNAi, including choice of culture medium, transfection strategy to deliver dsRNA, dose- and time-dependency, and dosing limits. Finally, using fluorescent probes, we show that the developing gut allows rapid entrance of dsRNA into the parasite to initiate RNAi
Critical Assessment of Metagenome Interpretation:A benchmark of metagenomics software
International audienceIn metagenome analysis, computational methods for assembly, taxonomic profilingand binning are key components facilitating downstream biological datainterpretation. However, a lack of consensus about benchmarking datasets andevaluation metrics complicates proper performance assessment. The CriticalAssessment of Metagenome Interpretation (CAMI) challenge has engaged the globaldeveloper community to benchmark their programs on datasets of unprecedentedcomplexity and realism. Benchmark metagenomes were generated from newlysequenced ~700 microorganisms and ~600 novel viruses and plasmids, includinggenomes with varying degrees of relatedness to each other and to publicly availableones and representing common experimental setups. Across all datasets, assemblyand genome binning programs performed well for species represented by individualgenomes, while performance was substantially affected by the presence of relatedstrains. Taxonomic profiling and binning programs were proficient at high taxonomicranks, with a notable performance decrease below the family level. Parametersettings substantially impacted performances, underscoring the importance ofprogram reproducibility. While highlighting current challenges in computationalmetagenomics, the CAMI results provide a roadmap for software selection to answerspecific research questions
The complete genome sequence of Corynebacterium pseudotuberculosis FRC41 isolated from a 12-year-old girl with necrotizing lymphadenitis reveals insights into gene-regulatory networks contributing to virulence
Trost E, Ott L, Schneider J, et al. The complete genome sequence of Corynebacterium pseudotuberculosis FRC41 isolated from a 12-year-old girl with necrotizing lymphadenitis reveals insights into gene-regulatory networks contributing to virulence. BMC Genomics. 2010;11(1): 728
- …