Position weight matrix model as a tool for the study of regulatory elements distribution across the DNA sequence

Abstract

Ab initio methods of DNA regulatory sequence region prediction known as transcription factor binding sites (TFBS) are a very big challenge to modern bioinformatics. Although the currently available methods are not perfect they are fairly reliable and can be used to search for new potential protein-DNA interaction sites. The biggest problem of ab initio approaches is the very high false positive rate of predicted sites which results mainly from the fact that TFBS are very short and highly degenerate. Because of that they can occur by chance every few hundred bases making the task of computational prediction extremely difficult if one aims to reduce the high false positive rate keeping highest possible sensitivity to predict biologically meaningful sequence regions. In this work we present a new application that can be used to predict TFBS regions in very large datasets based on position weight matrix models (PWM’s) using one of the most popular prediction methods. The presented application was used to predict the concentration of TFBS in a set of nearly 2.2 thousand unique sequences of human gene promoter regions. The study revealed that the concentration of TFBS further than 1kbp from the transcription initiation site is constant but it decreases rapidly while getting closer to the transcription initiation site. The decreasing TFBS concentration in the vicinity of genes might result from evolutionary selection which keeps only sites responsible for interactions with proteins being part of a specific regulatory mechanism leading to cells survival

    Similar works