We discuss a simple and powerful approach for the ab initio identification of
cis-regulatory motifs involved in transcriptional regulation. The method we
present integrates several elements: human-mouse comparison, statistical
analysis of genomic sequences and the concept of coregulation. We apply it to a
complete scan of the human genome. By using the catalogue of conserved upstream
sequences collected in the CORG database we construct sets of genes sharing the
same overrepresented motif (short DNA sequence) in their upstream regions both
in human and in mouse. We perform this construction for all possible motifs
from 5 to 8 nucleotides in length and then filter the resulting sets looking
for two types of evidence of coregulation: first, we analyze the Gene Ontology
annotation of the genes in the set, searching for statistically significant
common annotations; second, we analyze the expression profiles of the genes in
the set as measured by microarray experiments, searching for evidence of
coexpression. The sets which pass one or both filters are conjectured to
contain a significant fraction of coregulated genes, and the upstream motifs
characterizing the sets are thus good candidates to be the binding sites of the
TF's involved in such regulation. In this way we find various known motifs and
also some new candidate binding sites.Comment: 22 pages, 2 figures. Supplementary material available from the
author