Structural insights into i-motif DNA structures in sequences from the insulin-linked polymorphic region

Characterisation of the sequence variants within the ILPRThe polymorphism in the length of the regulatory promoter region of the insulin gene was suggested as a genetic marker for non-insulin-dependent diabetes in 198323. A follow-up population study with 298 unrelated individuals revealed that the 5ʹ-flanking region of the human insulin gene is polymorphic in both nucleotide length and sequence13. Rotwein et al.13 reported 14 sequence variants, with minor changes from the predominant ILPR sequence, but three variants were limited to only 0.2% of the population. An early study showed some ILPR variants were inserted as an isolated segment into a minimal prolactin promoter-luciferase construct and co-transfected with a known insulin-related transcription factor, Pur-1, to mimic beta-cell molecular microenvironment in non-beta cells. This showed that the overexpression of Pur-1 has different effects on gene expression in the minimal promoter system with different ILPR variants. The highest Pur-1 affinity was associated with the most prevalent ILPR sequence and provides the initial proof-of-concept that the ILPR is linked to regulating insulin expression5. Another study focused on three of the G-rich ILPR variants and was able to correlate a relationship between the conformation of the G-quadruplex structure with binding affinity to insulin and insulin-like growth factor22. Although the G-rich variants from the ILPR have been investigated, studies on the C-rich sequences are limited to the most prevalent variant20,21,24,25.Here we focused on the 11 main ILPR variants of both the C-rich (ILPRC, Table 1) and the G-rich sequences (ILPRG, Table 2) to fully understand the relationship between variant sequence, structure, and function. Each tandem repeat has two tracts of guanines/cytosines (5′-ACAGGGGTGTGGGG−3′/3′-TGTCCCCACACCCC−5′) so we designed our sequences to have two repeats, providing the minimum sequence necessary for i-motif or G-quadruplex formation. We maintained flanking sequences on either side of the terminal C/G-tracts in line with the tandem repeat sequence. The general sequence for each variant was Flank-(C/G-tract)-Loop1-(C/G-tract)-Loop2-(C/G-tract)-Loop3-(C/G-tract)-Flank. Each C-rich and G-rich variant was characterised using circular dichroism (CD) to determine the overall topology, thermal difference spectroscopy (TDS) to characterise the type of structure in solution, and UV melting/annealing experiments to determine the thermal stability. For the C-rich sequences, the transitional pH was determined by CD, to allow comparison of the pH stability of the sequence variants.Table 1 Biophysical characterisation data for the sequences of the C-rich ILPR variantsTable 2 Biophysical characterisation data for the sequences of the G-rich ILPR variantsThe C-rich ILPR sequence variants were characterised in 10 mM sodium cacodylate buffer with 100 mM KCl. CD spectroscopy was performed at a range of pHs between 4 and 8 for determination of the pHT. TDS and UV melting and annealing experiments were performed at pH 5.5 to allow for assessment of the relative stability of all sequences, even those that may not be stable at physiologically relevant pH, allowing all sequences to be compared alongside each other. A summary of the sequence, the melting temperature (Tm), annealing temperature (Ta), transitional pH (pHT), and structural assignment by TDS are provided in Table 1. The corresponding representative data is provided in the supplementary information (Supplementary Figs. 1–5).The predominant ILPR C-rich variant (1C) gave clear UV melting and annealing traces (Supplementary Fig. 1) and a Tm of 55 ± 0.2 °C (Table 1). This melting temperature is higher than that measured by others for the same sequence. Although the previous work was performed in phosphate buffer, which displays reduced buffering capacity at elevated temperatures20, the likely cause of the different Tm values is differences in annealing procedures26. The TDS of sequence 1C showed positive peaks at 240 and 265 nm and a negative peak at 295 nm, consistent with an i-motif structure signature profile (Fig. 1A)27. Similarly, CD spectroscopy of variant 1C showed i-motif formation at acidic pH, indicated by a positive peak at 288 nm and a negative peak at 260 nm (Fig. 1B and Supplementary Fig. 2)28. As the pH increases towards pH 7, the structure unfolds, the positive peak shifts to 273 nm and the negative peak to 250 nm (Supplementary Fig. 2A). The pHT of 1C was determined to be 6.6 (Table 1 and Supplementary Fig. 2F), in-line with previous experiments for this sequence24. Variant 4C demonstrated the same stability as 1C, with a pHT of 6.7 (Table 1, Supplementary Fig. 2I) but the other variants were all significantly less pH stable, with pHTs as low as 4.6 and 5.1 (11C and 10C, Table 1 and Supplementary Fig. 4). Interestingly, minor differences in the sequence made significant changes in the stability of the structures formed. For example, in 2C, a single C-to-G mutation in each of the tandem repeats results in significantly lower melting (50 ± 0.6 °C compared to 55 ± 0.2 °C) and annealing temperatures (p < 0.0001) and also a lower pHT of 5.2 (p < 0.0001). It appears to be a hairpin-like structure at pH 5.5 in the TDS analysis (Fig. 1A) and from the CD spectra at pH 5.5 (Fig. 1B). I.e., this C to G mutation prevents i-motif formation.Fig. 1: Representative circular dichroism and thermal difference spectroscopy of ILPR variants examined in the cell-based reporter gene assay.A TDS of 2.5 μM ILPRC variants 1C, 2C, 4C and 10C in 10 mM sodium cacodylate, 100 mM KCl at pH 5.5. B CD spectra of 10 μM ILPRC variants 1C, 2C, 4C and 10C in 10 mM sodium cacodylate, 100 mM KCl at pH 5.5. C TDS of 2.5 μM ILPRG variants 1G, 2G, 4G and 10G in 10 mM sodium cacodylate, 20 mM KCl at pH 7.0. D CD spectra of 10 μM ILPRG variants 1G, 2G, 4G and 10G in 10 mM sodium cacodylate, 100 mM KCl at pH 7.0. Source data for this figure are provided as a Source Data file.Two main factors appear to decrease the stability of i-motif structure in these variants: mutation of loop nucleotides from cytosine to guanine (6C-11C, Table 1) or mutation/truncation within the C-tracts (2C, 3C, 5C, 7C and 10C, Table 1). Some sequences are affected by both these factors (2C, 7C and 10C, Table 1) and have some of the lowest pHTs overall (5.2, 5.4 and 5.1, respectively). The C to G mutation is critical as it both removes cytosines from the core stack of base pairs and introduces potential competing Watson/Crick complementary nucleotides, which can shift the conformational equilibrium towards hairpin formation. This is further supported by the data acquired for variants 10C and 11C which have more guanines in the loops; both sequences did not give any transitions in the UV melting/annealing experiments at 295 nm (Supplementary Fig. 1D and Supplementary Fig. 1L) but did at 260 nm (Supplementary Fig. 1H, 1P), indicative of hairpin/duplex formation. TDS also indicated a spectrum inconsistent with i-motif and more consistent with duplex (Fig. 1A and Supplementary Fig. 5), suggesting that these sequences form only hairpins under these experimental conditions. In silico structural calculations of these sequences using M-fold29, also show clear potential for these sequences to fold into hairpins (Supplementary Fig. 6). Given the vast differences in structures and stability of the C-rich sequences, we examined the complementary G-rich sequences, to determine any complementarity in structure formation.The G-rich ILPR sequence variants were characterised by CD in 10 mM sodium cacodylate buffer, pH 7.0, with 100 mM of KCl, NaCl or LiCl, to reveal cation preferences typically observed in G-quadruplex forming sequences. Sequences were characterised by UV melting/annealing in analogous buffer except with 20 mM KCl (100 mM concentrations of KCl resulted in Tm values > 95 °C). A summary of the characterisation of the sequences in KCl cation conditions: the melting temperature (Tm), annealing temperature (Ta), QGRS mapper score30, and structural assignment by CD and TDS are provided in Table 2. Corresponding example data is provided in the supplementary information (Supplementary Figs. 7–9).The predominant ILPR G-rich variant (1G) gave clear UV melting and annealing traces (Supplementary Fig. 7A) and a Tm of 76 ± 0.6 °C (Table 2). This melting temperature is similar to that measured previously (~78 °C) for the same sequence in Tris buffer31. The TDS of sequence 1G showed several positive peaks at 240, 255 and 270 nm and a negative peak at 295 nm, consistent with G-quadruplex structure27 (Fig. 1C). Variant 1G gave CD spectra with a negative peak at 245 nm, and positive peaks at 263 nm and 295 nm (Fig. 1D). This is in-line with previous CD spectroscopy data, showing a mixed population of parallel and antiparallel G-quadruplexes in the presence of KCl and a shift towards antiparallel G4 formation in the presence of weaker stabilising cations NaCl and LiCl (Supplementary Fig. 8A)14,31,32.Of the other G-rich sequence variants, 4G (Tm = 79 ± 0.6 °C) had similar thermal stability compared to 1G (76 ± 0.6 °C), showing the mutation in G-tract-Loop2 from C to T makes little difference in the stability (Supplementary Fig. 7B, F). Variant 8G was more stable (80 ± 0.7 °C) than 1G, but does present as a potential mixture of species by CD and TDS (Supplementary Figs. 7K, 8G, and 9). The most stable of all the variants was sequence 6G (82 ± 0.7 °C). 1G, 4G, and 6G were all clearly characterised as G-quadruplexes by CD and TDS, similar to previously studies on these sequences (Supplementary Figs. 8 and 9)22,32. Other sequences formed significantly weaker DNA structures and were thermally less stable than these strong G-quadruplexes. For example, 2G has a significantly lower melting (Tm = 56 ± 0 °C compared to the 1G variant with a Tm of 76 ± 0.6 °C) and annealing temperatures (p < 0.0001) and presents as a mixture of G-quadruplex and hairpin-like structures in the TDS (Fig. 1C). CD spectra show a broad weak positive peak at 300 nm and a negative peak at 245 nm (Fig. 1D), in-line with formation of a weak antiparallel G-quadruplex, potentially mixed with hairpin/duplex28. The fact that there is a melting transition in the UV at 295 nm (Supplementary Fig. 7I) potentially indicates G-quadruplex DNA structure, however, a negative peak at 295 nm in the TDS may present with Z-DNA and Hoogsteen DNA as well as G-quadruplex and i-motif structures28. Interestingly, some of the sequences (5G, 7G, 9G, 10G, and 11G) do not have UV melt and anneal profiles at 295 nm, as expected with G-quadruplex structures. However, variants 7G, 9G, 10G, and 11G have clear melting and annealing transitions at 260 nm, consistent with these sequences forming hairpins/duplex-like structures (Supplementary Fig. 7)33. For example, the 10G variant, is similar to 2G in the TDS signature (Fig. 1C), consistent with hairpin formation and clearly different to that of the G-quadruplexes formed by 1G, 4G, and 6G (Supplementary Fig. 8). The CD spectrum of 10G shows only a very weak positive signal at 260 nm and a negative signal at 240 nm (Fig. 1D), which is consistent with unfolded G-rich sequence or a very weak G-quadruplex. Notably, the hairpin-forming variants lose cation sensitivity in the CD spectra and all have a narrow dip in signal at 215 nm (Supplementary Fig. 8). M-fold predictions show clear potential for these sequences to form into hairpins (Supplementary Fig. 10).These results indicate that only some ILPR variants are capable of forming i-motif and G-quadruplex structures. Comparing the biophysical data with the G-scores from QGRS Mapper30 (Table 2) indicates that QGRS Mapper accurately predicts the most stable G-quadruplex forming sequences 1G, 4G, 6G and 8G (all score 63), but there are two sequences that score nearly as high (9G and 11G, score 62) that do not form G-quadruplexes at all. Moreover, it is important to consider that sequences such as 2G and 10G do not form stable G-quadruplex structures, but score 42, the same score as the G-quadruplex forming sequence from the widely-studied human telomere (TTAGGGTTAGGGTTAGGGTTAGGGTTA), demonstrating that loop nucleotide composition is critical.The biophysical data for the C-rich and G-rich ILPR variants show that stable i-motifs are not exclusively formed in the complementary sequences of stable G-quadruplexes. From the native variants, 7/11 of the C-rich sequences form stable i-motif structures, whereas only 3/11 of the G-rich sequences form clear G-quadruplex structures. This indicates that it may be easier to mutate out a G-quadruplex based on the sequence, whereas an i-motif structure is more difficult to eliminate completely.DNA structures switch insulin reporter gene transcriptionFrom the biophysical data, it was clear that not all native ILPR variants form i-motif or G-quadruplex structures. Importantly, the most common variants (1C and 4C) formed the most stable i-motif structures, and the complementary strands (1G and 4G) also formed stable G-quadruplex structures. We hypothesised that the DNA sequences forming into i-motifs and G-quadruplexes in the ILPR are potentially binding elements to control transcription of insulin. To test this hypothesis, we compared four ILPR variants using a Luciferase-based reporter gene assay, where the entire human insulin promoter up to the start of the ILPR was cloned upstream of the gene encoding firefly luciferase34. Resulting firefly bioluminescence is proportional to the insulin promoter activation. Due to the difficulty in cloning long lengths of the ILPR and the fact that this region of DNA is intrinsically variable between people, we included enough repeat sequences to form one i-motif or G-quadruplex. We chose the most common variants (1C/1G and 4C/4G), which formed both stable i-motif and G-quadruplex structures and two of the variants that appeared to form hairpin structures in both C-rich and G-rich sequences (2C/2G and 10C/10G).Functioning β-cells normally secrete insulin in response to increased blood glucose levels as part of blood glucose homoeostasis. There are many cell line models which can be used to assess levels of insulin expression in vitro. These cells retain normal regulation of glucose-induced insulin secretion, allowing the use of glucose as a positive control35. We selected the rat insulinoma-derived cell line INS-1 as model system due to the lack of an intrinsic ILPR or analogous sequence36,37. The INS-1 cells were co-transfected with either one of the ILPR vectors and a reference vector encoding renilla luciferase to allow normalisation of transfection efficiency variability between experiments. After transfection, the cells were starved overnight and were treated with either fresh low (2.8 mM) or high glucose (16.2 mM) medium to determine their respective responsiveness to glucose after four hours. These high/low glucose treatment conditions are consistent with other previous studies measuring responsiveness to glucose34,38,39.The four ILPR variants showed no significant difference in firefly luciferase expression in the presence of low (2.8 mM) glucose levels (Fig. 2). However, in the presence of high glucose (16.2 mM), there was a significant increase in the expression of luciferase relative to the control for the 1C/G and 4C/G plasmid variants (where the underlying sequences were shown to form stable i-motif and G-quadruplex structures) and no significant change in expression for the 2C/G and 10C/G plasmid variants (characterised to form hairpin-like structures). Specifically, the plasmid containing the 1C/G ILPR variant sequence showed a twofold increase in firefly luciferase expression levels (p < 0.001) in the presence of high glucose compared to low glucose concentrations. It was expected for the most prevalent ILPR sequence (1C/G) to show glucose responsiveness and therefore we considered as the positive control for this system. The plasmid containing the 4C/G ILPR variant sequence also responded to the higher glucose level with a 1.7-fold increase in gene expression compared to low glucose levels (p < 0.001). Both example ILPR variants, with sequences capable of forming i-motifs and G-quadruplexes responded to changes in glucose levels in a similar fashion, but the increase was significantly higher in the most prevalent ILPR variant (1C/G, p < 0.001). Importantly, reporters encoding the two ILPR variants that did not form i-motif and G-quadruplex DNA structures (2C/G and 10C/G) showed no changes in the presence of high glucose (Fig. 2). These data indicate the potential importance of the different sequence variants in the ILPR, showing the different DNA structures they form may play a role in controlling the responsiveness to glucose. Although the plasmid experiments do not directly assess transcription in endogenous chromatin, they do imply that only small changes in sequence can give rise to a very big difference in the structure formed and also the relative reporter expression in plasmids.Fig. 2: Dual Luciferase-reporter gene assay for glucose sensitivity in co-transfected INS-1 cells after four hours.Firefly signal is regulated by the human insulin promotor and is corrected to reference renilla luciferase signal. Firefly to renilla ratio is normalised to luminescence signals in low glucose levels, to represent insulin expression induced by glucose. Relative insulin reporter expression was determined in four different ILPR variants (1C/G, 2C/G, 4C/G or 10C/G), measured in 12 biological repeats (n = 12), each with 2–3 technical repeats and expressed in Mean ± SEM. All samples passed the D’Agostino & Pearson test for normal distribution. Statistical analysis was performed using 2-way ANOVA multiple comparisons with Holm-Šidák post hoc test. p < 0.001***, ns > 0.12. Source data for this figure are provided as a Source Data file.Determination of an intramolecular i-motif crystal structureGiven that small differences between DNA sequences resulted in different structure formation in vitro and potential function in the reporter genes in cellulo, we were interested in the potential interactions within the loops that made certain sequence variants more stable than others. The most biologically relevant DNA structures are intramolecular, i.e., those formed from a single strand, similar to what would form in the context of genomic DNA. However, structural information on intramolecular i-motifs is particularly scant. Although there are intramolecular NMR structures for i-motif (1EL2, 1ELN)40, they are of modified fragments from the telomeres, and these modifications (necessary to enable structure determination by NMR) have been shown to alter the widths of the grooves in the structure41. There are currently twelve intermolecular i-motif crystal structures formed from two or four separate strands, but no intramolecular topologies. The apparent reason for the lack of intramolecular crystal structures is mainly due to the fact that i-motif loops are highly dynamic and difficult to resolve successfully using crystallographic methods. Intramolecular i-motif crystal structures would provide an opportunity for rational design of compounds to target these structures, and potential for drug development against these interesting biological targets, complementing drug discovery projects targeting G-quadruplex.With this in mind, we wanted to give the best chance for successful crystallisation, so we trialled the most stable C-rich ILPR variant that formed only i-motif from our biophysical studies: (4C) TATCCCCACACCCCTATCCCCACACCCCTAT. This sequence is the second most prevalent ILPRC variant13 and lacks guanines within the sequence, so it reduces the formation of intermolecular species through GC-base-pairing. To increase the chance of successful crystallisation we also designed variants of this sequence with different flanking regions: TCCCCACACCCCTATCCCCACACCCCT (4Ca) and ATCCCCACACCCCTATCCCCACACCCC (4Cb) (Supplementary Table 1). The crystallisations were performed at pH 5.5, below the pHT, where this sequence would be most stable.Crystals of all three variants were obtained by hanging-drop methods (Supplementary Table 2) with the highest-quality diffraction data acquired with 4C. With the limited availability of i-motif structures, molecular replacement (MR) methods proved challenging. However, anomalous dispersion (AD) methods were successful in structure determination and model validation using both intrinsic and extrinsic scattering elements. Intrinsic phosphorous single–wavelength anomalous dispersion (P-SAD) where phosphorus is integral part of the DNA backbone provided validation of the native structural model (Supplementary Fig. 11) while the use of extrinsic bromine, combined with multiple-wavelength anomalous dispersion (Br-MAD) methods provided anomalous scattering sufficient to generate high-quality maps for model building (Supplementary Tables 3 and 4). In the 4C-Br sequence (Supplementary Table 1), the Br-substitution located within the less flexible CC-core (Cytosine-4) provided a strong anomalous scattering contribution, while scattering for the second bromide loop-2 (Adenine-16) was not observed due to the flexibility of this region.The general use of intrinsic P-anomalous scattering for structure determination of DNA/RNA motifs has proven challenging. The post analysis of our long-wavelength data revealed a limited P-anomalous scattering contribution, resulting in only a few P-peaks of the anomalous difference map (Fanom(calc)) overlapping the modelled positions or those observed with only weak diffuse peaks, this is despite using the lower energies closest to the peak (3.9995 Å, f”2.3). The poor P-signal can be partly attributed to the static disorder, to the mobility of the phosphorous atoms and the low number of unique reflections compared to anomalous scatterers42. We are currently exploring ways to optimise the P-signal for P-SAD applications.Structural description of an intramolecular i-motifThe crystal structure formed from the ILPRC sequence 4C (8AYG) is comprised of two, independent, and inverted intramolecular i-motifs in the asymmetric unit (Fig. 3A, B). Each of these individual i-motifs is formed from four antiparallel strands held together by eight, intercalated hemi-protonated cytosine-cytosine base pairs, connected by three loops (Fig. 3). The ACA-loops connect strands at the minor grooves and the middle TAT-loop at the major groove (Fig. 3C, Supplementary Table 5 and Supplementary Fig. 12). The terminal CC-base-pair is at the 3′-end, making each structure a 3′E-topology (Fig. 3B)43.Fig. 3: Crystal structure, structural features and interactions of the ILPR 4C intramolecular i-motif 8AYG.A 4C structure coloured by nucleotide type (green: C, blue: T, yellow: A, grey: backbone, red: water molecules) and 2Fobs – Fcalc electron density map contoured at 1.5 σ level (grey). B Schematic showing the two 4C intramolecular i-motifs as arranged in the asymmetric unit and the interactions they form. Both fold into a 3′E-topology with the outer CC pair at the 3′–end (red). C Top view of one 4C i-motif and schematic showing the arrangement of the TAT and ACA-loops at the wide and narrow grooves, respectively. D Intramolecular and E Intermolecular interactions formed by the two independent intramolecular 4C i-motif strands B and A present in the asymmetric unit. Each structure and schematic is coloured based on the flank or loop position in the sequence: flank-1 (purple), loop-1 (orange), loop-2 (dark green), loop-3 (light blue), flank-2 (pink), grey (C-core). The nucleotides involved in the hydrogen-bonds shown in the boxes are coloured by atom type. All bond distances are in Å. F Structural comparison of the two i-motifs in the asymmetric unit by overlapping Strand-A (red) and Strand-B (blue). G Crystal of the 4C i-motif.Apart from the CC-base-pairing, other interactions within each strand include mismatched base pairs like AA and TT, which could contribute to the overall stability of the folded construct (Fig. 3D, Supplementary Table 6). In strand-A, there is an AA base-pair between loop-1 and loop-3, A22 (loop-3) interacts with A10 (loop-1) via two hydrogen-bonds. The topology is further stabilised in loop-2, by the T3 from the flanking region, demonstrating the importance of the flanking sequence in stabilising interactions. Also, the flanking T3 interaction with T17 as a TT-base-pair stacks with the terminal CC-base-pair (C14 and C28). T15 also forms a TA-pair (T15 and A16) via one hydrogen-bond, which then stacks on top of the TT-base-pair. While for strand-B the TT-base-pair is sandwiched between the terminal CC-base-pair and an additional TAT-triad consisting of T15 (loop-2), T29 (flank-2) and A8 (loop-1) from the symmetry of strand-A (Supplementary Fig. 13, Table S7). Also, in strand-B the AA base-pair is formed between A22 (loop-3) and A8 rather than with A10 (loop-1) as in strand-A.Strand-B is similar to strand-A (Fig. 3F) with an RMSD of 2.32 Å (when flanks are excluded, nucleotides 4 to 28). A difference is that the A16 is displaced with a symmetry-related adenine, but still stacks on top of the TT-base-pair. Also, A8 displaces A10 in the interaction with A22, which allows A10 to interact with a symmetry-related thymine (Fig. 3D, Supplementary Fig. 14). When only the core is included in the calculation, the RMSD is 1.04 Å, showing the high similarity between the two cores. Differences in the torsion angles and sugar puckers of the two strands, attributed to the phosphate backbone flexibility, are shown in Supplementary Table 8 and Supplementary Fig. 15.As there are two i-motifs in the asymmetric unit, this gives a view to how more than one i-motif may interact with each other like “beads-on-a-string”. There are clear interactions between flank-1 of one strand (A2) and loop-3 (A24) of the other strand (Fig. 3E). Also, there are various π-π-stacking interactions between the outer nucleotides of the TAT-flanks and an A or T in the loops which highlight the importance of flanks in crystal packing (Supplementary Table 7). Intermolecular TA- and CC-base-pairs further contribute to the crystal packing (Supplementary Figs. 14 and 16). It is important to note that the crystallisation conditions are different to those used in the solution-based experiments, with higher concentrations of DNA and other additives to initiate nucleation and crystallisation. Potentially, at lower concentrations of DNA, these intermolecular interactions might not be present, more complex higher-order conformations could occur in solution if higher concentrations are used. Nevertheless, the crystal structure has demonstrated the potential for intermolecular interactions between intramolecular i-motifs of the 4C-variant sequence. Given the ILPR is comprised of tandem repeats, these intermolecular interactions are potentially important for consideration with how ligands and nuclear proteins may interact with these structures.No specific hydration pattern was observed at the middle of the CC core as most of the cytosine hydrogen-bond donors and acceptors are used in the formation of the CC-pairs. Some waters at the major groove were seen hydrogen-bonded with the H of N4 of the cytosine, which is not involved in the CC-base-pairing, and we observe a bridging with the phosphate O-atoms. This is in agreement with some of the other intermolecular crystal structures published e.g., 1CN041, 1BQJ41, 8DHC44, 8CXF44, but no bifurcated hydrogen-bond to O2 of a cytosine partner was seen. Based on the use of the Fanom(calc) maps (Supplementary Fig. 11), we can more confidently describe these peaks as water molecules and exclude sodium or chloride ions. Although limited by resolution, water molecules observed in the loops could represent potential sites for hydrogen-bond interactions, potentially useful in future ligand design or interactions with proteins. Given the potential binding pocket revealed by A16, which in strand-A is base-paired with T15 and in strand-B this adenine was displaced with a symmetry-related adenine, this indicates that this site may also be interesting for potential targeting with ligands.Stabilising TT-base-pairs are observed in solutionGiven the additional base-pairs within the crystal structure, we were interested in whether these could be observed in solution. We performed NMR spectroscopy to examine the imino-proton region, which showed a set of peaks between 15.4 and 15.8 ppm, consistent with the presence of hemi-protonated cytosines (Supplementary Fig. 17)15. Additional imino-proton signals at 10.9 and 11.5 ppm are consistent with the presence of TT-base-pairs45. Importantly, there are no signals in the region between 12.5 and 14 ppm, where the imino-proton signals from GC- and AT-base-pairing would be expected45,46. NMR annealing experiments (from 333 to 277 K) revealed the formation of the CC-base-pairs at 319 K, followed by the TT-base-pairs at 312 K (Supplementary Fig. 18). This indicates that the structure in solution is similar to that in the crystal structure, and the TT-base-pairs are weaker than the CC-base-pairs. A recent study looking at i-motifs using a DNA microarray containing 10,976 genomic i-motif forming sequences found that i-motifs with shorter loops (n = 1–4) had enhanced stability when the sequences had thymine residues directly flanking C-tracts47. The presence of the TT-base-pairs in both the NMR experiments and the crystal structure provides structural evidence for the reason why this is the case.Enhanced sampling molecular dynamicsTo further explore the conformational landscape of i-motifs, we performed enhanced sampling molecular dynamics simulations. Of particular interest to us were the loops regions, which are the major contributor to the dynamics, differentiating i-motifs from each other and other nucleic acid structures; and the influence of the flanking nucleotides on the dynamics of the loops. Markov state models (MSMs) were built to study the kinetics of conformational transitions in the loop regions (Supplementary Note 1, Supplementary Figs. 19–26 and Supplementary Table 10). Upon creation of these models, both strands present a free energy landscape consisting of multiple metastable states that also explore the crystallographic conformations.Given the interactions observed in the crystal structure originating from the flanking sequence, we looked at sequence 4C (TATCCCCACACCCCTATCCCCACACCCCTAT) and also an analogue with one base missing at the 3′-end, 4Cdel (TATCCCCACACCCCTATCCCCACACCCCTA). Our analysis suggests that 4Cdel is far more dynamic with multiple interactions compared to 4C. Upon inspection of the structures extracted from the coarse-grained models, 4Cdel featured far more unstructured conformations in loops-1 and 3, while those in 4C seemed fairly ordered. Loop-2 in both sequences was well ordered. This would seem to suggest that slow motions in the i-motif structure are largely as a result of stabilising and for the flexibility both loop-2 and the flanking regions at the 5′- and 3′-ends. This can be visualised in the dynamics of 4C and 4Cdel. The 4C structure is longer than 4Cdel by one base (T) at the 3′-end. This extra nucleotide leads to significantly more structural ordering via π-stacking interactions within the 3′-end, which then leads to the ordered conformations observed in loop-2. Comparing this with 4Cdel, the additional stacking is not possible, and therefore, the interactions with loop-2 produces a greater number of metastable states. Since time independent components (tICs) are ordered from slowest to fastest in terms of motions, those that provide the most stability will be ordered highest than those that are faster. But still a significant number may not be fully described by the number of dimensions which the features were reduced into. This is borne out by the unstructured conformations of loops-1 and -3 in these models as opposed to the fairly ordered ones of loop-2 and the terminal regions.The simulation data supports the hypothesis that flanking sequences are important to the stability of i-motif structure, by providing the opportunities for additional interactions that reduce conformational dynamics. This is not only important for consideration of sequence designs for in vitro experiments involving i-motifs, but also may play an important role in how small molecules and proteins can interact with i-motif structures, and their consequential effects in biology.Here, we show that different sequence variants of the ILPR form different DNA structures in vitro and these have different effects on in cellulo insulin reporter expression. Importantly, not all native ILPR variants are capable of forming i-motifs and G-quadruplexes; minor changes in the sequence have been shown to give completely different structures. The crystal structure and dynamics of an intramolecular i-motif reveals that sequences within the loop regions form additional stabilising interactions. These AA-, TT- and AT-base-pairs are critical to the formation of the stable i-motif structures and reveal pockets for rational-based drug design. We also showed the importance of flanking sequence in the crystallisation of i-motif structures, through several intermolecular interactions in the crystal structure and supporting molecular dynamics. The outcomes of this work reveal the detail in the formation of stable i-motif DNA structures, with potential for rational-based drug design for compounds to target i-motifs.

Structural insights into i-motif DNA structures in sequences from the insulin-linked polymorphic region

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis