28 research outputs found

    A new hash function and its use in read mapping on genome

    Get PDF
    Mapping reads onto genomes is an indispensable step in sequencing data analysis. A widely used method to speed up mapping is to index a genome by a hash table, in which genomic positions of kk-mers are stored in the table. The hash table size increases exponentially with the kk-mer length and thus the traditional hash function is not appropriate for a kk-mer as long as a read. We present a hashing mechanism by two functions named score1score1 and score2score2 which can hash sequences with the length of reads. The size of hash table is directly proportional to the genome size, which is absolutely lower than that of hash table built by the conventional hash function. We evaluate our hashing system by developing a read mapper and running the mapper on E.coliE. coli genome with some simulated data sets. The results show that the high percentage of simulated reads can be mapped to correct locations on the genome

    New scoring schema for finding motifs in DNA Sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Pattern discovery in DNA sequences is one of the most fundamental problems in molecular biology with important applications in finding regulatory signals and transcription factor binding sites. An important task in this problem is to search (or predict) known binding sites in a new DNA sequence. For this reason, all subsequences of the given DNA sequence are scored based on an scoring function and the prediction is done by selecting the best score. By assuming no dependency between binding site base positions, most of the available tools for known binding site prediction are designed. Recently Tomovic and Oakeley investigated the statistical basis for either a claim of dependence or independence, to determine whether such a claim is generally true, and they presented a scoring function for binding site prediction based on the dependency between binding site base positions. Our primary objective is to investigate the scoring functions which can be used in known binding site prediction based on the assumption of dependency or independency in binding site base positions.</p> <p>Results</p> <p>We propose a new scoring function based on the dependency between all positions in biding site base positions. This scoring function uses joint information content and mutual information as a measure of dependency between positions in transcription factor binding site. Our method for modeling dependencies is simply an extension of position independency methods. We evaluate our new scoring function on the real data sets extracted from JASPAR and TRANSFAC data bases, and compare the obtained results with two other well known scoring functions.</p> <p>Conclusion</p> <p>The results demonstrate that the new approach improves known binding site discovery and show that the joint information content and mutual information provide a better and more general criterion to investigate the relationships between positions in the TFBS. Our scoring function is formulated by simple mathematical calculations. By implementing our method on several biological data sets, it can be induced that this method performs better than methods that do not consider dependencies.</p

    Bi technology IranianJournal of

    Get PDF
    Background: RNA molecules play many important regulatory, catalytic and structural roles in the cell, and RNA secondary structure prediction with pseudoknots is one the most important problems in biology. An RNA pseudoknot is an element of the RNA secondary structure in which bases of a single-stranded loop pair with complementary bases outside the loop. Modeling these nested structures (pseudoknots) causes numerous computational difficulties and so it has been generally neglected in RNA structure prediction algorithms. Objectives: In this study, we present a new heuristic algorithm for the Prediction of RNA Knotted structures using Tree Adjoining Grammars (named PreRKTAG). Materials and Methods: For a given RNA sequence, PreRKTAG uses a genetic algorithm on tree adjoining grammars to propose a structure with minimum thermodynamic energy. The genetic algorithm employs a subclass of tree adjoining grammars as individuals by which the secondary structure of RNAs are modeled. Upon the tree adjoining grammars, new crossover and mutation operations were designed.The fitness function is defined according to the RNA thermodynamic energy function, which causes the algorithm convergence to be a stable structure. Results: The applicability of our algorithm is demonstrated by comparing its iresults with three well-known RNA secondary structure prediction algorithms that support crossed structures. Conclusions: We performed our comparison on a set of RNA sequences from the RNAseP database, where the outcomes show efficiency and practicality of the proposed algorithm

    A novel data augmentation approach for influenza a subtype prediction based on HA proteins

    Get PDF
    Influenza, a pervasive viral respiratory illness, remains a significant global health concern. The influenza A virus, capable of causing pandemics, necessitates timely identification of specific subtypes for effective prevention and control, as highlighted by the World Health Organization. The genetic diversity of influenza A virus, especially in the hemagglutinin protein, presents challenges for accurate subtype prediction. This study introduces PreIS as a novel pipeline utilizing advanced protein language models and supervised data augmentation to discern subtle differences in hemagglutinin protein sequences. PreIS demonstrates two key contributions: leveraging pretrained protein language models for influenza subtype classification and utilizing supervised data augmentation to generate additional training data without extensive annotations. The effectiveness of the pipeline has been rigorously assessed through extensive experiments, demonstrating a superior performance with an impressive accuracy of 94.54% compared to the current state-of-the-art model, the MC-NN model, which achieves an accuracy of 89.6%. PreIS also exhibits proficiency in handling unknown subtypes, emphasizing the importance of early detection. Pioneering the classification of HxNy subtypes solely based on the hemagglutinin protein chain, this research sets a benchmark for future studies. These findings promise more precise and timely influenza subtype prediction, enhancing public health preparedness against influenza outbreaks and pandemics. The data and code underlying this article are available in https://github.com/CBRC-lab/PreIS

    Improvement in Sp1 binding site prediction by combining MNN data from different modifications.

    No full text
    <p>ROC curves for a number of different methods for predicting bound locations. Results of predictions made by combining all 21 modifications (green line); 8 modifications (black line) and integrating H2A.Z and H3K4me3 data (blue line). Comparing this figure with <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0089226#pone-0089226-g001" target="_blank">Figure 1</a> shows that applying the LRCs to the data of single modifications perform better than those LRCs trained with the combination of histone modifications. This may be due to the fact that the predictive ability for distinguishing true target regions is redundantly encoded among histone marks.</p

    Distributions of nucleosome positions around Sp1 binding sites.

    No full text
    <p>Distributions of the central positions of nucleosomes for the top 8 marks and 3 repressive marks around Sp1 binding sites on the genome. The x-axis shows genomic positions with respect to central position of Sp1 binding sites (from −1015bp to +1015bp). The positions of nucleosomes are defined as the positions from −15 bp to 15 bp with respect to the center of the nucleosome. Active marks are highly enriched around binding sites and show a bimodal distribution around these sites. A nucleosome free region with respect to central position of binding sites is also observable in all top marks.</p

    ROC curves for predicting the binding regions of Sp1using the MNN feature.

    No full text
    <p>ROC curves for 21 LRCs trained on individual histone modifications for prediction of Sp1 binding regions, using the MNN feature. The LRCs corresponding to each histone modification were trained on Chromosome 1 and tested on Chromosome 2 to 22 and two sex chromosomes. The LRCs assign a score to each interval. Predictions of binding regions are based on these scores. These curves show that the MNN feature is predictive of binding regions even when no PWM score is used. The x-axis is the false positive rate and the y-axis is the true positive rate. Shown are the curves of the most predictive modifications. ROC curves for the rest 13 modifications can be found in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0089226#pone.0089226.s001" target="_blank">Figure S1</a>.</p

    AUC values corresponding to different histone modifications for predicting the binding regions of Sp1 based on the MNN feature.

    No full text
    <p>Results are shown for predicting the binding sites of Sp1 in CD4+T cells using the MNN feature. The height of each bar corresponds to the Area under the ROC curves. Certain modifications are more predictive for true binding regions. Comparing the results with using the PWM alone (Figure3) clearly shows that the MNN feature, especially for certain modifications, can be used as an informative feature for TFBSs prediction.</p

    Histone modification types.

    No full text
    <p>Each modification is clustered into active, repressive or moderate type based on their association with active or repressed genes. Moderate marks show a dual tendency toward active and repressed genes.</p
    corecore