Introduction

There are several studies focused on the transcriptional regulation of gene expression in S. cerevisiae. Lee et al. [1] performed genome-wide location analysis to investigate how transcription factors (TFs) bind to promoter regions of target genes across the genome of S. cerevisiae. They observed nearly 4000 interactions between 106 TFs and thousands of target genes. The number of TFs acting on a gene varies and the distribution satisfies a scale-free power distribution [2]. Also the number of genes regulated by a TF varies from a few to hundreds. This makes up a transcriptional regulatory network [3-6] with typical motifs such as auto regulation, feed forward loop, multi-component loop, single input motif, multi-input motif, and regulator chain, which are also found in transcriptional regulatory network in E. coli [7]. The result from genome-wide location analysis suggests that there must be cooperativity among TFs and it was confirmed through many studies [8, 9].

On the other hand, Kruglyak et al. [10] showed that the gene expression profiles of adjacent genes in S. cerevisiae are highly correlated than that observed in random gene pairs. This suggests possible positional cooperativity between nearby genes but a detailed positional analysis of transcriptional regulation network was missing. In this paper, by calculating positional correlation among target genes of a TF, we identify the presence of a operon-like motif structure, in which several genes located side by side are co-regulated by a TF.

Positional correlation function

The number of target genes regulated by each transcription factor ranges from 1 to 275 genes. For example, transcription factors ABF1, SWI4, ACE2, and ARO80 regulate 275, 136, 72, and 32 genes respectively, as shown in fig. 1. As we plot the location of target genes for each TF over 16 chromosomes, the lines are dispersed randomly over 16 chromosomes without any preference. This is more evident as we plot the total number of target genes on each of the chromosomes. In fig. 2, we plot the fractions of regulated genes, the total number of target genes on one chromosome divided by the chromosome size, the number of genes on the chromosome. Except for some fluctuations, the distribution is uniform over 16 chromosomes without any positional cooperativity.

Genes regulated by transcription factors are distributed over 16 chromosomes. The position of the lines denotes the position of target genes on the chromosomes. Experimental data is from ref. [1].

Chromosome-wide distribution of regulated gene over 16 chromosomes. Top line: the total number of genes in the chromosome; middle line: the total number genes regulated all TFs; bottom line: the fraction of regulated genes, the total regulated genes divided total number of genes.

To quantify the detailed positional co-regulation, we measure the positional correlation function of gene regulation by a TF A defined as:

$C_{A} (d) = \frac{{〈 (g_{iA} - {\overset{ˉ}{g}}_{A}) (g_{i + dA} - {\overset{ˉ}{g}}_{A}) 〉}_{i}}{\sqrt{{〈 {(g_{iA} - {\overset{ˉ}{g}}_{A})}^{2} 〉}_{i}}}$

where $g_{iA} = 1$ if gene i is bound by TF A and $g_{iA} = 0$ otherwise. Note that $g_{i + dA} = 1$ denotes the gene transcription at distance d from the gene i and ${\overset{ˉ}{g}}_{A}$ denote the average binding rate of TF A.

In fig. 3(a), we plot the positional correlation function for some TFs. For TFs, such as ARO80, RM101, GAT3, PHD1, and GAL4, and so on, the positional correlation among their target genes is rather high compared with that of MOT3 or HAP5. In fig. 3(b), we plot the average correlation function for all TFs. The average positional correlation up to distance 2 or 3 is non-vanishing and it indicates that genes located side by side have more chance to be co-regulated. To provide the significance of the observed data, we plot the correlation functions obtained from a simple positional correlation model, which is defined as follows: The target gene gi of a TF is randomly selected with probability

p_{i} = \frac{{(g_{i - 1} + g_{i + 1} + 1)}^{α}}{\sum_{i} {(g_{i - 1} + g_{i + 1} + 1)}^{α}}

where gi=1 if gene i is bound by a TF, and gi=0 otherwise. In this model, if nearby genes $g_{i + 1}$ and $g_{i - 1}$ are regulated, then the probability pi of gene regulation by the TF is higher than that of no regulated nearby genes. In this model, the cooperativity among the TFs is not taken into account. The degree of co-regulation is controlled by the factor α. In fig. 3(b), the correlation functions for α= 0, 1, 3, 6, 7, 8 (from bottom to top) are plotted taking average over 500 ensembles.

(a) Positional correlation among target genes of some transcription factors. (b) Plot of positional correlation function average overall and comparison with model results of α = 0, 1, 3, 6, 7, 8 (bottom to top, respectively). Vertical lines in the curves denote standard deviation.

With α=0, target genes are chosen randomly and there is no correlation between two near by genes. Positional correlation increases with increase of α and the real data is fitted with model data with α=7. The high correlation factor α=7 suggests that the observed co-regulated genes are not from random selection.

Operon-like motif structure

As we have shown in the previous chapter, co-regulation of near by genes is not by chance. From the detailed positional mapping of the transcriptional regulation shown in fig. 4, several repeating patterns are observed. Note the presence of motif structures, where n-genes (n=2, 3, 4, 5, 6) located side by side are co-regulated by the same transcription factor (fig. 5). For example, the TF MPB1 regulates 6 genes YHR149C, YHR150W, YHR151C, YHR152W, YHR153C, and YHR154W. They are located on the 8th chromosome as shown in fig. 4. Other examples, GAT3 (YLR462W, YLR463C, YLR464W, YLR465C, YLR466W, YLR467W); MBP1 (YHR149C, YHR150W, YHR151C, YHR152W, YHR153C, YHR154W), and so on are indicated by small boxes in fig. 4.

Genes regulated by transcription factors which have high positional correlation and an example of 6-motifs regulated by MBP1. Small boxes denote examples of 6-gene motifs

Operon-like motifs with n-genes (n=2, 3, 4, 5, and 6). One transcription factor regulates n-genes located side by side.

The number of n-genes motifs is listed on second column of Table 1. On the third and fourth columns, we list the average and the standard deviation of n-genes motifs found in 1500 random regulatory networks, respectively. The random networks are generated by selecting target genes of a TF randomly with its number of target genes. This process is repeated for all 106 transcription factors. Then we use z-score to quantify statistical significance of number of motifs found in real data. The z-score is defined as:

z_{score} = \frac{N_{real} - N_{rand}}{SD}

where N_real and N_rand are the number of motifs observed in real and random data, respectively. SD is the standard deviation of random data. Very high z-scores listed on the fifth column of table 1 confirm the statistical significance of operon-like motifs in the real regulatory network.

The number of operon-like motif structure observed in real data (N real) compared with that of random data (N rand) and the standard deviation (SD). High z-scores on the last column indicate significance of N real.
Motif n-gene	N real	N rand	SD	Z score
2	872	16.93	27.142	31.5
3	117	0.14	0.557	209.8
4	37	0.02	0.123	300.7
5	13	0.00067	0.025	520.0
6	6	<<1	<0.001	>>520.0

We also used the list of binding sites available from TRANSFAC [11] in combination with recent work done by Harbison et al. [12]. A binding site is a region consisting of several nucleotides on DNA where transcription factor binds and stimulates gene expression. For each gene in a motif, we analyze the DNA sequence on its upstream sequence (1000 base pairs) and test the presence of the binding site for the regulating TF. For example, ACGCGt is the sequence on binding site of MPB1 and this sequence appears on upstream regions of two motif genes of TF MPB1: YHR153C and YHR154W, but not on YHR149C, YHR150W, YHR151C, and YHR152W. The fraction of genes in motif with binding site observed is about 38% (fraction of genes in motif not having binding site is 62%). This might be due to two type of transcriptional regulation: cis-regulation and trans-regulation.

We note that in prokaryote a few genes involved in the same function are located side by side to be regulated by one transcription factor in sequence. As this is called an operon structure, we will call the n-genes motif structure as an ‘operon-like’ motif. In prokaryote, the genes in the operon are strongly correlated [10]. Similarly, to find out functional correlation among genes in the operon-like motifs, we analyze the gene expression profiles. For that, we use gene expression data from several different experimental conditions [13-16]: 314 data points (cell cycle: 77, sporulation: 9, stress response: 173, MAPK signaling: 45, and unfolding protein: 10). In fig. 6, the correlation of gene expression profiles in the motifs and that of random gene pairs are presented. For motif gene pairs, the fraction of high correlation pair is not so significant, occurrence of high correlation pairs is observed, especially for genes in motifs of 4, 5 and 6-genes. This is in agreement with work of Kruglyak et al. [10].

The distributions of correlation of gene expression profiles for gene pairs in n-gene motif (star-red) and random pairs (square-blue) are plotted. With increase of α, the fraction of high correlation gene pairs increases. Experimental data is from ref. [13-16].

What will be the origin of the positional cooperativity and the formation of operon-like motif structure? One of the possibilities is that it might be generated through the gene duplication. The gene duplication is the driving force creating new genes in genome: at least 50% of prokaryotic genes and over 90% eukaryotic genes are products of gene duplication [17]. Typically, the duplicated gene is located near by its original gene and regulated by the same transcriptional factor as shown in fig. 7.

Duplication of regulated gene and positional cooperativity.

One possible design principle of operon-like motif structure is that it will be very effective in generating a spatio-temporal pattern with gradual onset of the target genes as in the single-input-module (SIM) motif structure [1, 7].

Conclusion

Based on the genome-wide transcriptional regulation network analysis data in S. cerevisiae, we found positional cooperativity among target genes of a transcription factor. As a consequence of positional cooperativity, the operon-like motif structures with several genes located side by side on chromosome was also observed. High correlation in gene expression profiles among genes in operon-like motif structures indicates functional correlations among the motif genes. The result is consistent with that of De Hoon et al. [18], where high correlated gene expression was observed among genes in the operon structure of Bacillus subtilis, a prokaryotic organism. Further more, our results are also consistent with several previous works on S. cerevisiae. Kruglyak et al. [10] and Cohen et al. [19] have analyzed the expression of adjacent genes and showed that the expression patterns of adjacent genes are more often highly correlated than the expression patterns of randomly selected genes pairs, but their analysis is limited to two nearby genes. They also found that many functionally related genes are located very closely to each other. Cohen et al. [19] showed that the adjacent genes often fall into the same functional category by examining functional category of 2081 adjacent gene pairs. Cho et al. [20] also showed existence of adjacent genes with initiation on the same phase of cell cycle. However, they did not mention about the presence of a common transcriptional regulators.

By this analysis, we report on the presence of operon-like motif structure: several groups of adjacent genes are regulated by the same transcription factors and they tend to have high correlation expression and related gene function. Identification of functional roles of operon-like motif structures remains as an open task of experiments.

Acknowledgement

We would like to thank Prof. Joo Young Yoo at POSTECH for useful discussions. This work was supported by the SBD-NCRC program at POSTECH and also by the research grant from Chungbuk National University, 2004.

Reference

1. T. I. Lee et al., Science298, 799 (2002).

2. R. Albert, J Cell Sci118, 4947 (2005).

3. A. S. Seshasayee, P. Bertone, G. M. Fraser, N. M. Luscombe, Curr Opin Microbiol9, 511 (2006).

4. D. F. Veiga, F. F. Vicente, G. Bastos, Genet Mol Res5, 254 (2006).

5. A. Wagner, J. Wright, Biosystems (2006).

6. C. N. Yoon, S. K. Han, H. Y. Kim, J. Korean Phys. Soc.44, 638 (2004).

7. S. S. Shen-Orr, R. Milo, S. Mangan, U. Alon, Nat Genet31, 64 (2002).

8. Y. H. Chang, Y. C. Wang, B. S. Chen, Bioinformatics (2006).

9. N. K. Goto, T. Zor, M. Martinez-Yamout, H. J. Dyson, P. E. Wright, J Biol Chem277, 43168 (2002).

10. S. Kruglyak, H. Tang, Trends Genet16, 109 (2000).

11. V. Matys et al., Nucleic Acids Res31, 374 (2003).

12. C. T. Harbison et al., Nature431, 99 (2004).

13. P. T. Spellman et al., Mol Biol Cell9, 3273 (1998).

14. S. Chu et al., Science282, 699 (1998).

15. A. P. Gasch et al., Mol Biol Cell11, 4241 (2000).

16. K. J. Travers et al., Cell101, 249 (2000).

17. S. A. Teichmann, M. M. Babu, Nat Genet36, 492 (2004).

18. M. J. De Hoon, S. Imoto, K. Kobayashi, N. Ogasawara, S. Miyano, Pac Symp Biocomput, 276 (2004).

19. B. A. Cohen, R. D. Mitra, J. D. Hughes, G. M. Church, Nat Genet26, 183 (2000).

20. R. J. Cho et al., Mol Cell2, 65 (1998).