832 research outputs found

    DNA Motif Match Statistics Without Poisson Approximation

    No full text
    Transcription factors (TFs) play a crucial role in gene regulation by binding to specific regulatory sequences. The sequence motifs recognized by a TF can be described in terms of position frequency matrices. Searching for motif matches with a given position frequency matrix is achieved by employing a predefined score cutoff and subsequently counting the number of matches above this cutoff. In this article, we approximate the distribution of the number of motif matches based on a novel dynamic programming approach, which accounts for higher order sequence background (e.g., as is characteristic for CpG islands) and overlapping motif matches on both DNA strands. A comparison with our previously published compound Poisson approximation and a binomial approximation demonstrates that in particular for relaxed score thresholds, the dynamic programming approach yields more accurate results

    Statistical detection of cooperative transcription factors with similarity adjustment

    Get PDF
    Motivation: Statistical assessment of cis-regulatory modules (CRMs) is a crucial task in computational biology. Usually, one concludes from exceptional co-occurrences of DNA motifs that the corresponding transcription factors (TFs) are cooperative. However, similar DNA motifs tend to co-occur in random sequences due to high probability of overlapping occurrences. Therefore, it is important to consider similarity of DNA motifs in the statistical assessment

    Poisson approximation for search of rare words in DNA sequences

    Get PDF
    Using recent results on the occurrence times of a string of symbols in a stochastic process with mixing properties, we present a new method for the search of rare words in biological sequences generally modelled by a Markov chain. We obtain a bound on the error between the distribution of the number of occurrences of a word in a sequence (under a Markov model) and its Poisson approximation. A global bound is already given by a Chen-Stein method. Our approach, the psi-mixing method, gives local bounds. Since we only need the error in the tails of distribution, the global uniform bound of Chen-Stein is too large and it is a better way to consider local bounds. We search for two thresholds on the number of occurrences from which we can regard the studied word as an over-represented or an under-represented one. A biological role is suggested for these over- or under-represented words. Our method gives such thresholds for a panel of words much broader than the Chen-Stein method. Comparing the methods, we observe a better accuracy for the psi-mixing method for the bound of the tails of distribution. We also present the software PANOW (available at http://stat.genopole.cnrs.fr/software/panowdir/) dedicated to the computation of the error term and the thresholds for a studied word.Comment: 29 pages, 0 figure

    Natural similarity measures between position frequency matrices with an application to clustering

    No full text
    Motivation: Transcription factors (TFs) play a key role in gene regulation by binding to target sequences. In silico prediction of potential binding of a TF to a binding site is a well-studied problem in computational biology. The binding sites for one TF are represented by a position frequency matrix (PFM). The discovery of new PFMs requires the comparison to known PFMs to avoid redundancies. In general, two PFMs are similar if they occur at overlapping positions under a null model. Still, most existing methods compute similarity according to probabilistic distances of the PFMs. Here we propose a natural similarity measure based on the asymptotic covariance between the number of PFM hits incorporating both strands. Furthermore, we introduce a second measure based on the same idea to cluster a set of the Jaspar PFMs. Results: We show that the asymptotic covariance can be efficiently computed by a two dimensional convolution of the score distributions. The asymptotic covariance approach shows strong correlation with simulated data. It outperforms three alternative methods. The Jaspar clustering yields distinct groups of TFs of the same class. Furthermore, a representative PFM is given for each class. In contrast to most other clustering methods, PFMs with low similarity automatically remain singletons. Availability: A website to compute the similarity and to perform clustering, the source code and Supplementary Material are available at http://mosta.molgen.mpg.d

    In silico analyses of maleidride biosynthetic gene clusters

    Get PDF
    Maleidrides are a family of structurally related fungal natural products, many of which possess diverse, potent bioactivities. Previous identification of several maleidride biosynthetic gene clusters, and subsequent experimental work, has determined the ā€˜coreā€™ set of genes required to construct the characteristic medium-sized alicyclic ring with maleic anhydride moieties. Through genome mining, this work has used these core genes to discover ten entirely novel putative maleidride biosynthetic gene clusters, amongst both publicly available genomes, and encoded within the genome of the previously un-sequenced epiheveadride producer Wicklowia aquatica CBS 125634. We have undertaken phylogenetic analyses and comparative bioinformatics on all known and putative maleidride biosynthetic gene clusters to gain further insights regarding these unique biosynthetic pathways. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s40694-022-00132-z

    Development of novel anticancer agents targeting G protein coupled receptor: GPR120

    Get PDF
    The G-protein coupled receptor, GPR120, has ubiquitous expression and multifaceted roles in modulating metabolic and anti-inflammatory processes. GPR120 - also known as Free Fatty Acid Receptor 4 (FFAR4) is classified as a free fatty acid receptor of the Class A GPCR family. GPR120 has recently been implicated as a novel target for cancer management. GPR120 gene knockdown in breast cancer studies revealed a role of GPR120-induced chemoresistance in epirubicin and cisplatin-induced DNA damage in tumour cells. Higher expression and activation levels of GPR120 is also reported to promote tumour angiogenesis and cell migration in colorectal cancer. A number of agonists targeting GPR120 have been reported, such as TUG891 and Compound39, but to date development of small-molecule inhibitors of GPR120 is limited. This research applied a rational drug discovery approach to discover and design novel anticancer agents targeting the GPR120 receptor. A homology model of GPR120 (short isoform) was generated to identify potential anticancer compounds using a combined in silico docking-based virtual screening (DBVS), molecular dynamics (MD) assisted pharmacophore screenings, structureā€“activity relationships (SAR) and in vitro screening approach. A pharmacophore hypothesis was derived from analysis of 300 ns all-atomic MD simulations on apo, TUG891-bound and Compound39-bound GPR120 (short isoform) receptor models and was used to screen for ligands interacting with Trp277 and Asn313 of GPR120. Comparative analysis of 100 ns all-atomic MD simulations of 9 selected compounds predicted the effects of ligand binding on the stability of the ā€œionic lockā€ ā€“ a characteristic of Class A GPCRs activation and inactivation. The ā€œionic lockā€ between TM3(Arg136) and TM6(Asp) is known to prevent G-protein recruitment while GPCR agonist binding is coupled to outward movement of TM6 breaking the ā€œionic lockā€ which facilitates G-protein recruitment. The MD-assisted pharmacophore hypothesis predicted Cpd 9, (2-hydroxy-N-{4-[(6-hydroxy-2-methylpyrimidin-4-yl) amino] phenyl} benzamide) to act as a GPR120S antagonist which can be evaluated and characterised in future studies. Additionally, DBVS of a small molecule database (~350,000 synthetic chemical compounds) against the developed GPR120 (short isoform) model led to selection of the 13 hit molecules which were then tested in vitro to evaluate their cytotoxic, colony forming and cell migration activities against SW480 ā€“ human CRC cell line expressing GPR120. Two of the DBVS hit molecules showed significant (\u3e 90%) inhibitory effects on cell growth with micromolar affinities (at 100 Ī¼M) - AK-968/12713190 (dihydrospiro(benzo[h]quinazoline-5,1ā€²-cyclopentane)-4(3H)-one) and AG-690/40104520 (fluoren-9-one). SAR analysis of these two test compounds led to the identification of more active compounds in cell-based cytotoxicity assays ā€“ AL-281/36997031 (IC50 = 5.89ā€“6.715 Ī¼M), AL-281/36997034 (IC50 = 6.789 to 7.502 Ī¼M) and AP-845/40876799 (IC50 = 14.16-18.02 Ī¼M). In addition, AL-281/36997031 and AP-845/40876799 were found to be significantly target-specific during comparative cytotoxicity profiling in GPR120-silenced and GPR120-expressing SW480 cells. In wound healing assays, AL-281/36997031 was found to be the most active at 3 Ī¼M (IC25) and prevented cell migration. As well as in the assessment of the proliferation ability of a single cell to survive and form colonies through clonogenic assays, AL-281/36997031 was found to be the most potent of all three test compounds with the survival rate of ~ 30% at 3 Ī¼M. The inter-disciplinary approach applied in this work identified potential chemical scaffolds ā€“spiral benzo-quinazoline and fluorenone, targeting GPR120 which can be further explored for designing anti-cancer drug development studies

    Structural characterization and selective drug targeting of higher-order DNA G-quadruplex systems.

    Get PDF
    There is now substantial evidence that guanine-rich regions of DNA form non-B DNA structures known as G-quadruplexes in cells. G-quadruplexes (G4s) are tetraplex DNA structures that form amid four runs of guanines which are stabilized via Hoogsteen hydrogen bonding to form stacked tetrads. DNA G4s have roles in key genomic functions such as regulating gene expression, replication, and telomere homeostasis. Because of their apparent role in disease, G4s are now viewed as important molecular targets for anticancer therapeutics. To date, the structures of many important G4 systems have been solved by NMR or X-ray crystallographic techniques. Small molecules developed to target these structures have shown promising results in treating cancer in vitro and in vivo, however, these compounds commonly lack the selectivity required for clinical success. There is now evidence that long single-stranded G-rich regions can stack or otherwise interact intramolecularly to form G4-multimers, opening a new avenue for rational drug design. For a variety of reasons, G4 multimers are not amenable to NMR or X-ray crystallography. In the current dissertation, I apply a variety of biophysical techniques in an integrative structural biology (ISB) approach to determine the primary conformation of two disputed higher-order G4 systems: (1) the extended human telomere G-quadruplex and (2) the G4-multimer formed within the human telomerase reverse transcriptase (hTERT) gene core promoter. Using the higher-order human telomere structure in virtual drug discovery approaches I demonstrate that novel small molecule scaffolds can be identified which bind to this sequence in vitro. I subsequently summarize the current state of G-quadruplex focused virtual drug discovery in a review that highlights successes and pitfalls of in silico drug screens. I then present the results of a massive virtual drug discovery campaign targeting the hTERT core promoter G4 multimer and show that discovering selective small molecules that target its loops and grooves is feasible. Lastly, I demonstrate that one of these small molecules is effective in down-regulating hTERT transcription in breast cancer cells. Taken together, I present here a rigorous ISB platform that allows for the characterization of higher-order DNA G-quadruplex structures as unique targets for anticancer therapeutic discovery

    Development of a data processing toolkit for the analysis of next-generation sequencing data generated using the primer ID approach

    Get PDF
    Philosophiae Doctor - PhDSequencing an HIV quasispecies with next generation sequencing technologies yields a dataset with significant amplification bias and errors resulting from both the PCR and sequencing steps. Both the amplification bias and sequencing error can be reduced by labelling each cDNA (generated during the reverse transcription of the viral RNA to DNA prior to PCR) with a random sequence tag called a Primer ID (PID). Processing PID data requires additional computational steps, presenting a barrier to the uptake of this method. MotifBinner is an R package designed to handle PID data with a focus on resolving potential problems in the dataset. MotifBinner groups sequences into bins by their PID tags, identifies and removes false unique bins, produced from sequencing errors in the PID tags, as well as removing outlier sequences from within a bin. MotifBinner produces a consensus sequence for each bin, as well as a detailed report for the dataset, detailing the number of sequences per bin, the number of outlying sequences per bin, rates of chimerism, the number of degenerate letters in the final consensus sequences and the most divergent consensus sequences (potential contaminants). We characterized the ability of the PID approach to reduce the effect of sequencing error, to detect minority variants in viral quasispecies and to reduce the rates of PCR induced recombination. We produced reference samples with known variants at known frequencies to study the effectiveness of increasing PCR elongation time, decreasing the number of PCR cycles, and sample partitioning, by means of dPCR (droplet PCR), on PCR induced recombination. After sequencing these artificial samples with the PID approach, each consensus sequence was compared to the known variants. There are complex relationships between the sample preparation protocol and the characteristics of the resulting dataset. We produce a set of recommendations that can be used to inform sample preparation that is the most useful the particular study. The AMP trial infuses HIV-negative patients with the VRC01 antibody and monitors for HIV infections. Accurately timing the infection event and reconstructing the founder viruses of these infections are critical for relating infection risk to antibody titer and homology between the founder virus and antibody binding sites. Dr. Paul Edlefsen at the Fred Hutch Cancer Research Institute developed a pipeline that performs infection timing and founder reconstruction. Here, we document a portion of the pipeline, produce detailed tests for that portion of the pipeline and investigate the robustness of some of the tools used in the pipeline to violations of their assumptions
    • ā€¦
    corecore