A Vocabulon Study of E.Coli Regulatory Sites with Feedback to Expression Array Analysis

Abstract

The identification of binding sites for regulatory proteins in the up-stream region of genes is an important ingredient towards the understanding of transcription regulation. In recent years, novel experimental techniques, as gene expression arrays, and the availability of entire genome sequences have opened the possibility for more detailed investigations in this domain. Traditionally, the reconstruction of the profile of a binding site and the localization of all its occurrences in a sequence are treated as separate problems. The first is tackled using a small group of sequences, known or suspected to contain the binding site, but with neither position or pattern known. One successful approach to such reconstruction problem is based on a probabilistic model of the sequence, represented as concatenation of background and motif stochastic words. Maximum likelihood or maximum a-posteriori estimates are obtained with EM or Gibbs-sampler algorithms [13, 14]. The second problem is approached considering one or multiple sequences of variable length; the pattern characterizing the motif is assumed known. Possible locations are identified on the base of scoring functions that highlight the similarity of the motif with the sequence portions. Cut off values for such similarity scores are hard to determine: ad hoc solutions or estimations on a training set are often adopted [17, 18]. Typically these techniques are used to scan one sequence of interest against a data-base of known binding sites. While there are historical and practical reasons to consider these two problems as separate, the current post-genomic era, where we are confronted with large abundance of sequence, calls for a different approach. Consider the problem, tackled in [18], of identifying all the the binding sites of the known regulatory proteins in the genome of E. Coli. While formally similar to blasting a small sequence of interest against a data-base of known regulatory proteins, there are substantial differences in these genome-wide search. On the one hand, as one scans through the genome for binding sites of LexA—to take one example—and finds a substantial number of them, it seems appropriate one should use the information in the identified locations to update the current pattern description. On the other hand, given that the output is not going to include a small number of sites, that can be further investigated, but a large collection of them, the assessment of significance cut-off should be based on proper probabilistic statements. To address these issues, one would need a probability model for the entire genome sequence, that can lead to evaluation of specific a-posteriori probabilities of appearance of a binding site in any given location, and whose parameters can be estimated on the base of data. At the same time, given the scale of the problem, the model should be suitable for rapid computation. In an attempt to address such need we introduce here the Vocabulon model. Section 2 gives a description of the probability model we employ; its differences from others in the literature; and its current implementation. We then present the results of multiple investigations on E. Coli sequence. Given that genome-wide information on the location of binding sites is not available, we used results of gene expression array experiments to corroborate our results, arguing in favor of a novel perspective in array analysis

    Similar works