31 research outputs found

    FGF4 retrogene on CFA12 is responsible for chondrodystrophy and intervertebral disc disease in dogs.

    Get PDF
    Chondrodystrophy in dogs is defined by dysplastic, shortened long bones and premature degeneration and calcification of intervertebral discs. Independent genome-wide association analyses for skeletal dysplasia (short limbs) within a single breed (PBonferroni = 0.01) and intervertebral disc disease (IVDD) across breeds (PBonferroni = 4.0 × 10-10) both identified a significant association to the same region on CFA12. Whole genome sequencing identified a highly expressed FGF4 retrogene within this shared region. The FGF4 retrogene segregated with limb length and had an odds ratio of 51.23 (95% CI = 46.69, 56.20) for IVDD. Long bone length in dogs is a unique example of multiple disease-causing retrocopies of the same parental gene in a mammalian species. FGF signaling abnormalities have been associated with skeletal dysplasia in humans, and our findings present opportunities for both selective elimination of a medically and financially devastating disease in dogs and further understanding of the ever-growing complexity of retrogene biology

    IN-AIS-MACA: Integrated Artificial Immune System based Multiple Attractor Cellular Automata For Human Protein Coding and Promoter Prediction of 252bp Length DNA Sequence

    Get PDF
    Gene prediction involves protein coding and promoter predictions. There is a need of integrated algorithms which can predict both these regions at a faster rate. Till date, we have individual algorithms for addressing these problems. We have developed a novel classifier IN-AIS-MACA, which can predict both these regions in genomic DNA sequences of length 252bp with 93.5% accuracy and total prediction time of 1031ms. This classifier will certainly create intuition to develop more classifiers like this

    Generalizations of Markov model to characterize biological sequences

    Get PDF
    BACKGROUND: The currently used k(th )order Markov models estimate the probability of generating a single nucleotide conditional upon the immediately preceding (gap = 0) k units. However, this neither takes into account the joint dependency of multiple neighboring nucleotides, nor does it consider the long range dependency with gap>0. RESULT: We describe a configurable tool to explore generalizations of the standard Markov model. We evaluated whether the sequence classification accuracy can be improved by using an alternative set of model parameters. The evaluation was done on four classes of biological sequences – CpG-poor promoters, all promoters, exons and nucleosome positioning sequences. Using di- and tri-nucleotide as the model unit significantly improved the sequence classification accuracy relative to the standard single nucleotide model. In the case of nucleosome positioning sequences, optimal accuracy was achieved at a gap length of 4. Furthermore in the plot of classification accuracy versus the gap, a periodicity of 10–11 bps was observed which might indicate structural preferences in the nucleosome positioning sequence. The tool is implemented in Java and is available for download at . CONCLUSION: Markov modeling is an important component of many sequence analysis tools. We have extended the standard Markov model to incorporate joint and long range dependencies between the sequence elements. The proposed generalizations of the Markov model are likely to improve the overall accuracy of sequence analysis tools

    Recognition of DNA Splice Junction via Machine Learning Approaches

    Get PDF
    Successful recognition of splice junction sites of human DNA sequences was achieved via three machine learning approaches. Both unsupervised (Kohonen's Self-Organizing Map, KSOM) and supervised (Back-propagation Neural Network, BNN; and Support Vector Machine, SVM) machine learning techniques were used for the classification of sequences from the testing set into one of three categories: transition from exon to intron, transition from intron to exon, and no transition. The dataset used in this study is comprised of 1,424 DNA sequences obtained from the National Center for Bioinformatics Information (NCBI). Performance of the machine learning approaches were assessed by the construction of learning models from 1,000 sequences of the training set and evaluated on the 424 sequences of the testing set that is unknown to the learning model. Each sequence is a window of 32 nucleotides long with regions comprising -15 to +15 nucleotides from the dinucleotide splice site. Since the nucleotides (A, C, G, and T) are represented by four digit binary code (e.g. 0001, 0010, 0100, and 1000) the number of descriptors increased from 32 to 128. The performance of machine learning techniques in order of increasing accuracy are as follows SVM > BNN > KSOM, suggesting that SVM is a robust method in the identification of unknown splice site. Although KSOM gave lower prediction accuracy than the two supervised methods, it is fascinating that it was able to make such prediction based only on knowledge of the input whereas the supervised method requires that the output be known during training. It is expected that the Support Vector Machine method can provide a powerful computational tool for predicting the splice junction sites of uncharacterized DNA

    GC-compositional strand bias around transcription start sites in plants and fungi

    Get PDF
    BACKGROUND: A GC-compositional strand bias or GC-skew (=(C-G)/(C+G)), where C and G denote the numbers of cytosine and guanine residues, was recently reported near the transcription start sites (TSS) of Arabidopsis genes. However, it is unclear whether other eukaryotic species have equally prominent GC-skews, and the biological meaning of this trait remains unknown. RESULTS: Our study confirmed a significant GC-skew (C > G) in the TSS of Oryza sativa (rice) genes. The full-length cDNAs and genomic sequences from Arabidopsis and rice were compared using statistical analyses. Despite marked differences in the G+C content around the TSS in the two plants, the degrees of bias were almost identical. Although slight GC-skew peaks, including opposite skews (C < G), were detected around the TSS of genes in human and Drosophila, they were qualitatively and quantitatively different from those identified in plants. However, plant-like GC-skew in regions upstream of the translation initiation sites (TIS) in some fungi was identified following analyses of the expressed sequence tags and/or genomic sequences from other species. On the basis of our dataset, we estimated that >70 and 68% of Arabidopsis and rice genes, respectively, had a strong GC-skew (>0.33) in a 100-bp window (that is, the number of C residues was more than double the number of G residues in a +/-100-bp window around the TSS). The mean GC-skew value in the TSS of highly-expressed genes in Arabidopsis was significantly greater than that of genes with low expression levels. Many of the GC-skew peaks were preferentially located near the TSS, so we examined the potential value of GC-skew as an index for TSS identification. Our results confirm that the GC-skew can be used to assist the TSS prediction in plant genomes. CONCLUSION: The GC-skew (C > G) around the TSS is strictly conserved between monocot and eudicot plants (ie. angiosperms in general), and a similar skew has been observed in some fungi. Highly-expressed Arabidopsis genes had overall a more marked GC-skew in the TSS compared to genes with low expression levels. We therefore propose that the GC-skew around the TSS in some plants and fungi is related to transcription. It might be caused by mutations during transcription initiation or the frequent use of transcription factor-biding sites having a strand preference. In addition, GC-skew is a good candidate index for TSS prediction in plant genomes, where there is a lack of correlation among CpG islands and genes

    Determining promoter location based on DNA structure first-principles calculations

    Get PDF
    A new method is presented which predicts promoter regions based on atomistic molecular dynamics simulations of small oligonucleotides, without requiring information on sequence conservation or features

    Genome-wide analysis of core promoter elements from conserved human and mouse orthologous pairs

    Get PDF
    BACKGROUND: The canonical core promoter elements consist of the TATA box, initiator (Inr), downstream core promoter element (DPE), TFIIB recognition element (BRE) and the newly-discovered motif 10 element (MTE). The motifs for these core promoter elements are highly degenerate, which tends to lead to a high false discovery rate when attempting to detect them in promoter sequences. RESULTS: In this study, we have performed the first analysis of these core promoter elements in orthologous mouse and human promoters with experimentally-supported transcription start sites. We have identified these various elements using a combination of positional weight matrices (PWMs) and the degree of conservation of orthologous mouse and human sequences – a procedure that significantly reduces the false positive rate of motif discovery. Our analysis of 9,010 orthologous mouse-human promoter pairs revealed two combinations of three-way synergistic effects, TATA-Inr-MTE and BRE-Inr-MTE. The former has previously been putatively identified in human, but the latter represents a novel synergistic relationship. CONCLUSION: Our results demonstrate that DNA sequence conservation can greatly improve the identification of functional core promoter elements in the human genome. The data also underscores the importance of synergistic occurrence of two or more core promoter elements. Furthermore, the sequence data and results presented here can help build better computational models for predicting the transcription start sites in the promoter regions, which remains one of the most challenging problems

    DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models

    Full text link
    Effective DNA embedding remains crucial in genomic analysis, particularly in scenarios lacking labeled data for model fine-tuning, despite the significant advancements in genome foundation models. A prime example is metagenomics binning, a critical process in microbiome research that aims to group DNA sequences by their species from a complex mixture of DNA sequences derived from potentially thousands of distinct, often uncharacterized species. To fill the lack of effective DNA embedding models, we introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C2^2LR) strategy. Empirical results on 18 diverse datasets showed DNABERT-S's remarkable performance. It outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training while doubling the Adjusted Rand Index (ARI) in species clustering and substantially increasing the number of correctly identified species in metagenomics binning. The code, data, and pre-trained model are publicly available at https://github.com/Zhihan1996/DNABERT_S

    MetaProm: a neural network based meta-predictor for alternative human promoter prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>De novo eukaryotic promoter prediction is important for discovering novel genes and understanding gene regulation. In spite of the great advances made in the past decade, recent studies revealed that the overall performances of the current promoter prediction programs (PPPs) are still poor, and predictions made by individual PPPs do not overlap each other. Furthermore, most PPPs are trained and tested on the most-upstream promoters; their performances on alternative promoters have not been assessed.</p> <p>Results</p> <p>In this paper, we evaluate the performances of current major promoter prediction programs (i.e., PSPA, FirstEF, McPromoter, DragonGSF, DragonPF, and FProm) using 42,536 distinct human gene promoters on a genome-wide scale, and with emphasis on alternative promoters. We describe an artificial neural network (ANN) based meta-predictor program that integrates predictions from the current PPPs and the predicted promoters' relation to CpG islands. Our specific analysis of recently discovered alternative promoters reveals that although only 41% of the 3' most promoters overlap a CpG island, 74% of 5' most promoters overlap a CpG island.</p> <p>Conclusion</p> <p>Our assessment of six PPPs on 1.06 × 10<sup>9 </sup>bps of human genome sequence reveals the specific strengths and weaknesses of individual PPPs. Our meta-predictor outperforms any individual PPP in sensitivity and specificity. Furthermore, we discovered that the 5' alternative promoters are more likely to be associated with a CpG island.</p
    corecore