MethylSeqLogo: DNA methylation smart sequence logos | BMC Bioinformatics

Here we describe the design rationale and details of the MethylSeqLogo display; schematically presented in Fig. 1.Fig. 1Design of MethylSeqLogo. Proportional shading of C’s and G’s indicates the methylation level of TFBS cytosines on the forward and reverse strands respectively; while a dashed line indicates the expected level of methylation based on the background distribution. The methylation key at lower right of the logo shows background methylation probabilities of CG, CHG and CHH, respectively; and the four single nucleotide background probabilities. The top track shows the relative entropy contributed by methylation in each context/strand combination, with information associated with cytosines on the reverse strand displayed downward. In the bottom track positive height indicating the presence of under-represented dimers (typically CpG), and negative height (not seen in this example) indicating the presence of over-represented dimers. For reference, the theoretical maximum and minimum possible dimer relative entropy contribution achievable for the given background are also shownDesign goals

1

Keep the advantages of sequence logos; including familiarity.

2

For methylation, clearly display:

Strand (\(+/-\)) of the binding site

Trinucleotide context (CG, CHG or CHH)

Comparison relative to a background model

Dimer enrichment/depletion in the motif

We achieve the first goal by respecting two expectations viewers familiar with sequence logos will have: first, the relative height of an element (e.g. “A”) within a column represents the frequency of the corresponding element; and second, that the height of a column represents an information theoretic measure (relative entropy) of the degree to which that position in the binding sites differs from background [7].We achieve the second goal by adding several intuitive elements to the plot:

Partial shading of C’s and G’s

Dashed line indicating expected methylation level

Box at right showing background frequencies

Context Colored Methylation info track at top

Dimer enrichment/depletion info track at bottom

The height of the shading of C’s and G’s is proportional to the methylation level of cytosines on the forward and reverse strands respectively.In order to give users a clear image of hyper- or hypo-methylation, we added a dashed line showing the methylation level which would be expected based on the background distribution (taking the trinucleotide context {CG, CHG, CHH} in each binding site into account).We designed a methylation info track showing (for each position in the binding site) the contribution of each context to the methylation information; and a box at right to show the background distribution of bases and methylation used for the relative entropy computation (Fig. 1).Column heightsThis section describes how column height is determined for MethylSeqLogo’s three tracks so that the total information in a set of binding sites can be estimated by visually adding up the height of all elements in a MethylSeqLogo display.Column heights indicate relative entropySequence logos often employ a background model fit to a set of background sequences, such as the whole genome or promoter regions etc. The background model is used to compute how “typical” the binding sequences are, with the idea that atypical binding site sequences should be emphasized visually (given taller column height) to reflect their statistical distance from background. For example, binding sites abundant in C and G should be emphasized more against an AT-poor background than against an AT-poor background. Quantitatively, the column heights are made proportional to the relative entropy; also known as the Kullback–Leibler directed divergence [21], and equivalent to information content [22] under a uniform distribution background.Sequence background models In explaining the MethylSeqLogo sequence logo and dimer information tracks, we will refer to zero order Markov model and first order Markov model background models. Zero order models generate each nucleotide of a DNA sequence independently, but first order models condition the nucleotide probabilities on the previous nucleotide.Relative entropy formula To facilitate describing the column heights of the MethylSeqLogo tracks in the following sections, we state the definition of relative entropy:$$\begin{aligned} \text {D}( M ||B ) \,\overset{\hbox {def}}{=}\,\text {E} \left[ \; \lg \left( \frac{P[\,s|\,\text {Motif Model~} M]}{P[\,s|\,\text {Background model~} B]} \right) \;\right] \end{aligned}$$using \(\lg \) to denote \(\log _2\).With this notation, the difference in relative entropy when employing different background models \(\textbf{B}_{1}\) versus \(\textbf{B}_{0}\) is:$$\begin{aligned} \text {D}( M ||\textbf{B}_{1}) – \text {D}( M ||\textbf{B}_{0}) \,=\, \text {E}\big [ \lg ( P[s|\textbf{B}_{0}] ) \big ] – \text {E}\big [ \lg ( P[s|\textbf{B}_{1}] ) \big ] \end{aligned}$$Where the expectation is the average over the individual binding site sequences s in a set of binding sites.Sequence logo track column heightStandard sequence logos typically display columns with a height proportional to relative entropy using a PWM (Position Weight Matrix) based motif model which assigns distinct probabilities to the nucleotides {A,C,G,T} at each position but assumes independence between positions. A zero order Markov model, which also assumes positional independence, is usually employed as a background model. In this case the relative entropy of the binding sites is easily decomposed into a sum with one term for each position; and therefore can be conveniently displayed via the height of the column representing each position. MethylSeqLogo adopts these conventions for its sequence logo track.Dimer information trackAlthough convenient, a zero order background model is unable to represent the striking (sometimes > 4x) depletion of CpG (relative to CpC, GpC, and GpG dinucleotides) in mammalian genomes. Admittedly, CpG’s are much less depleted in promoter regions, but there is still discrepancy between actual dimer frequencies versus what would be predicted by a zero order model. Therefore a first order model should provide a substantially more useful measure of how statistically distinct a set of binding sites is from background.Given the potential size of this effect and the fact that methylation occurs at CpG dimers, we decided MethylSeqLogo should display information based on a first order Markov model background. We did not want to change the sequence logo track, so instead of directly displaying relative entropy against a first order Markov model, we chose to display the difference between that relative entropy and the zero order background relative entropy in a separate track. Fortunately, this difference can easily be decomposed into the sum of a set of terms; one term for each pair of adjacent positions (see supplementary text for a mathematical derivation). Since these terms represent pairs of adjacent nucleotide positions, MethylSeqLogo displays them as vertical bars between the two positions. In theory, column heights in this track can be negative if the binding sites contain many over-represented dimers (for example homodimers XpX may be somewhat over-represented).Methylation track column heightHyper- or hypo-methylation of TF bindings sites (relative to a background) may help distinguish those binding sites from background. To allow users to see this effect, MethylSeqLogo presents a methylation information track above the main sequence logo track. Informally, the height of bars in the methylation information track represent the amount of additional surprise experienced when observing the methylation value at position i from one of the TFBSs; after having observed the primary sequences, since that information is already accounted for in the other tracks. The propensity of genomic cytosines to be methylated differs strongly depending on the following base or two (i.e. CG, CHG, or CHH trinucleotide context), so we separate these cases in our computation. For a background distribution these three cases are enough; while for binding sites, position and strand must also be considered. Thus altogether we separate the methylation data for each position in a collection of TFBSs into 6 strand specific contexts: 3 trinucleotide contexts \(\times \) 2 strands (Supplementary Fig. 2).Formally, let \(P_{\textsf {context}\,\vert \,i}\) denote the probability that a binding site will have a cytosine matching the given context at position i and \(P_{m\,\vert \,\textsf {context},i}\) denote the probability that such a cytosine will be methylated or not; while \(P_{m\,\vert \,\textsf {context,BG}}\) denotes the background probability of a cytosine in that given context being methylated or not. We can write the contribution of methylation information to the height of column i as:Note that relative entropy is inherently robust to small sample estimation error in \(P_{m\,\vert \,{\textsf {context},\,i}}\) since it includes a multiplicative term \({P_{\textsf {context}\,\vert \,i}}\) in the contribution of that context to column height. Thus rare contexts cannot make large contributions to column height.

Hot Topics

Related Articles