We developed ARTEM 2.0 [31], a new version of the ARTEM algorithm tailored to the nucleic acid tertiary motif search problem. This tool enables automated searches of RNA and DNA structure databases against multiple query modules. ARTEM is equipped with a rich set of user-defined parameters to specify particular 3D structure regions of interest, impose restraints on the identified motifs, and save the superimposed matches or entire nucleic acid-containing input complexes in PDB or mmCIF format. ARTEM advances beyond the limitations of the existing tools by exclusively relying on motif isostericity. It applies to any chosen reference motif, without differentiation between loop motifs and long-range motifs. We exemplified its capabilities by searching for kink-turn-like RNA tertiary motifs. Additionally, we demonstrated ARTEM’s broad applicability by searching for well-known nucleic acid structural motifs, such as G-quadruplex, GNRA tetraloop, and i-motif, as well as a unique RNA motif with four parallel base pairs recently described [32].
Kink-turn motif identificationTo search for instances of the kink-turn motif, we selected the two known backbone topology variants (Fig. 1, Table 1): an instance of the kink-turn internal loop from a 23S rRNA [33] (module #1) and an instance of the k-junction variant from a TPP riboswitch [34] (module #2). We executed ARTEM for the two instances against all RNA-containing entries in the Protein Data Bank [35], see the Methods section for details. The results included 26,564 matches (Additional file 1: Table S1). We annotated the matches with their backbone topology characteristics using urslib2 [8] to facilitate subsequent analysis.
Table 1 Selected instances of the kink-turn motif variantsWe benchmarked ARTEM against DSSR [11] using the BGSU representative set of RNA structures [36], which included 2,394 PDB entries (Fig. 2, Additional file 2: Table S2), see the Methods section for details. Within this set, ARTEM identified 255 matches annotated with the canonical internal loop topology, and 412 new topological variants that could not be detected by any other tool. Among the 255 hits, one was confirmed as a false positive: a double-kink internal loop with all canonical base pairs. Although it matched the reference kink-turn at 1.976 Å RMSD, it was manually rejected. In contrast, DSSR identified 292 kink-turns, including 36 false positives. Thus, in identifying canonical kink-turn topologies, ARTEM demonstrated higher precision than DSSR (99.6% vs. 87.7%), while achieving comparable recall (76.7% vs. 77.3%).
Fig. 2Benchmark of ARTEM and DSSR in identifying kink-turn motifs. The benchmark was performed on 2,394 PDB entries from the BGSU representative set of RNA structures. The UpSet plot was prepared using the UpSetPlot Python library [37]
We analyzed the false negatives of ARTEM and DSSR and attributed them to limitations inherent to their respective approaches. DSSR searched for internal loops with a specific angle between the stems and a tSH G-A base pair. As a result, it missed true instances featuring alternative base pairs and captured false matches lacking the characteristic arrangement of residues at positions L1, −1b, −1n, 1b, 1n, 2b, and 2n. In contrast, ARTEM searched for isosteric hits of the reference motifs matching at least 12 residues at RMSD ≤ 2.0 Å, and it required positions L1, −1b, −1n, 1b, 1n, 2b, and 2n to have a match. Consequently, ARTEM missed instances lacking some of these required positions (most frequently 1b and 2n), and those that were insufficiently isosteric with the two references.
New backbone topology variants of the kink-turn motifVia manual inspection of the ARTEM matches with new topologies, we identified instances of two new variants of the kink-turn motif (Table 1): a five-way junction in a 23S rRNA [38] (module #3, Fig. 3) and an external loop in a mitochondrial LSU rRNA fragment [39] (module #4, Additional file 3: Fig. S1A, C). Below, we use the commonly accepted nomenclature of the kink-turn structure [5], see the Methods section for details.
Fig. 3A J4/5 five-way k-junction module. (A) A 3D structure of the J4/5 five-way k-junction, LSU rRNA, PDB entry 7P7T, chain A. (B) A 2D interaction scheme of the junction: the canonical stem (C-stem) in gold, the non-canonical stem (NC-stem) in green and gray, the residues of the kink in purple, and the other stems in teal. The base pair representation follows the Leontis-Westhof (LW) classification [15]. The backbone connectivity is depicted with arrows. 5′- and 3′-ends are shown accordingly. Each base is marked with its number in the chain (in bold, left) and its named position in the kink-turn (right). The 3D representation was prepared with ChimeraX [29], with key residues labeled and prominent key hydrogen bonds indicated with gray dashed lines. The 2D representation was prepared with Inkscape [30]
We mapped the identified five-way junction (module #3) to the J4/5 region of the 23S rRNA structure. The J4/5 region was previously reported to form a k-junction [19], but it was incorrectly identified as a three-way junction. The interaction network of the J4/5 region is closer to the reference module #1, with a tSH G-A base pair in position 3n-3b, but it has a non-linear matching of residues 1n and 2n, resulting in a tHH 2n-2b base pair.
The external loop (module #4) was mapped to the J99/101 region of the 23S rRNA, which had not been previously reported as a kink-turn or a k-junction. The J99/101 region forms a canonical three-way k-junction architecture in non-mitochondrial 23S rRNAs (module #9, Additional file 3: Fig. S1B, D). Its interaction network is very close to that of the reference module #2, with one substantial difference: the 2n-2b base pair is a canonical cWW base pair, which is incompatible with the tSW interaction formed between residues −1n and 2b in the reference modules.
No-kink variants of the kink-turn motifWe observed matches of several unique backbone topologies that lacked the kink while exhibiting the C-stem/NC-stem arrangement isosteric to that of the kink variants (modules #5-#8, Table 1, Fig. 4, Additional file 3: Fig. S2). Three of these modules (#5-#7) were of different junction loop architectures, all of which can be classified as A-minor junctions, whose resemblance to kink-turns was reported previously [18]. In contrast, module #8 is not a loop but a helix-helix interface, showcasing the long-range kink-turn-like motif for the first time.
Fig. 4Selected no-kink variants #7 and #8. (A) A 3D structure of the four-way A-minor junction, group II intron, PDB entry 6ME0, chain A. (B) A 3D structure of the long-range helix-helix interface, 28S rRNA, PDB entry 7PWO, chain 1 (residue 2648 is missing in the coordinate file). The interaction schemes of (C) the four-way junction and (D) the long-range motif: the canonical stem (C-stem) in gold, the non-canonical stem (NC-stem) in green and gray, the residues matching the kink in purple, and the other residues in teal. The base pair representation follows the Leontis-Westhof (LW) classification [15]. The backbone connectivity is depicted with arrows. 5′- and 3′-ends are shown accordingly. Each base is marked with its number in the chain (in bold, left) and its named position in the kink-turn (right). The 3D representations were prepared with ChimeraX [29], with key residues labeled and prominent key hydrogen bonds indicated with gray dashed lines. The 2D representations were prepared with Inkscape [30]
The no-kink three-way junction (module #5, Additional file 3: Fig. S2A, C) and the no-kink five-way junction (module #6, Additional file 3: Fig. S2B, D) both have a pyrimidine in position 1b instead of the commonly found guanosine. While this is the only notable difference for module #6, module #5 also features a pyrimidine in position 2n and a non-linear matching of residues 1b and 2n, with 1b belonging to the non-bulge strand.
The no-kink four-way junction (module #7, Fig. 4A, C) is the closest module to the reference variants in terms of interactions, featuring the same geometric types of the four core base pairs: tSW L1-1n, tSW −1n-2b, tHS 1n-1b, and tHS 2b-2n. The long-range motif (module #8, Fig. 4B, D) is the only match that features a non-adenosine (guanosine) 1n residue. It also includes a uridine in syn conformation at position 2b that does not interact with the −1n residue. Notably, there is a striking 3D structure similarity between modules #7 and #8 (Fig. 4A,B), with ARTEM reporting a 17-residue match at 1.6 Å RMSD between the two modules. Their four RNA strands are arranged similarly but form canonical base pairs with different partners (see the top two base pairs in Fig. 4), resulting in distinct topologies: a four-way junction and a long-range helix-helix interface.
The identified kink/no-kink switchWe discovered that the J94/99 region of 23S rRNA, another ribosomal k-junction identified previously in H.marismortui [19] (module #10, Fig. 5A, C), adopts the no-kink junction topology in other bacteria (module #11, Fig. 5B, D) with a slight distortion of the core kink-turn interactions (Fig. 5D).
Fig. 5Selected J94/99 region modules. (A) A 3D structure of the J94/99 k-junction in H.marismortui, 23S rRNA, PDB entry 3CC2, chain 0. (B) A 3D structure of the J94/99 no-kink variant, LSU rRNA, PDB entry 4V8O, chain BA. The interaction schemes of (C) the J94/99 k-junction and (D) the no-kink variant: the canonical stem (C-stem) in gold, the non-canonical stem (NC-stem) in green and gray, the residues of the kink in purple, and the other residues in teal. The base pair representation follows the Leontis-Westhof (LW) classification [15]. The backbone connectivity is depicted with arrows. 5′- and 3′-ends are shown accordingly. Each base is marked with its number in the chain (in bold, left) and its named position in the kink-turn (right). The 3D representations were prepared with ChimeraX [29], with key residues labeled and prominent key hydrogen bonds indicated with gray dashed lines. The 2D representations were prepared with Inkscape [30]
In module #11, Helix 98, absent from the H.marismortui 23S rRNA [40], is inserted between residues L1.1 and L2 that form the kink in module #10. The 1n adenosine in the no-kink variant is pushed out of its position by residue 1b and interacts with residue L1.1 instead of L1 (Fig. 5D). Also, both modules #10 and #11 formally belong to complex pseudoknotted junction loops, which may be considered yet another backbone topology variant.
Conservation analysis of ribosomal k-junctionsWe analyzed the sequence conservation of the three k-junctions identified in 23S rRNAs (Additional file 3: Fig. S3). The conservation patterns differ substantially between the junction regions, with no conserved positions (defined as having an information content > 1.5 bits for a single base) shared among all three motifs. The J4/5 pattern shares three conserved positions with J94/99 (−1b = C, −1n = G, and 2b = A) and four other conserved positions with J99/101 (1n = A, 1b = G, 2n = A, and 3b = A), while the sets of conserved positions in J94/99 and J99/101 show no overlap.
The J4/5 region exhibits the largest number of conserved residues (10), including the tHS A-G base pair in position 3b-3n, which is also characteristic of the canonical Kt7 kink-turn (module #1). The 1n residue in the J94/99 region is the least conserved 1n residue among the junctions, consistent with its distorted position in the presence of Helix 98 (module #11). Surprisingly, the 2b-2n cWW U-A base pair is highly conserved in the J99/101 region, while the J99/101 C-stem base pairs are the least conserved among the junctions.
The coordination loop in group II introns is a kink-turnWhile manually inspecting the kink-turn matches obtained with ARTEM, we identified a kink-turn of the canonical internal loop architecture in a 3D structure of group IIC intron [41] (Fig. 6A). To the best of our knowledge, occurrences of kink-turns in group II intron structures were never previously reported. To close this gap, we selected a set of seven group II introns from the representative set of RNA structures [36] and surveyed them for kink-turns, both manually and using ARTEM, DSSR [11], and RNAMotifScanX [14], see the Methods section for details.
Fig. 6Group IIC intron coordination loop. (A) A 3D structure of the coordination loop kink-turn, group IIC intron, PDB entry 6ME0, chain A. (B) The interaction scheme of the kink-turn: the canonical stem (C-stem) in gold, the non-canonical stem (NC-stem) in green and gray, the residues of the kink in purple, and the other residues in teal. The base pair representation follows the Leontis-Westhof (LW) classification [15]. The backbone connectivity is depicted with arrows. 5′- and 3′-ends are shown accordingly. Each base is marked with its number in the chain (in bold, left) and its named position in the kink-turn (right). The tertiary interactions (Greek letters) and the intron regions are marked according to the conventional naming [41, 42]. (C) ARTEM benchmarking against other tools that identified at least one of the seven kink-turns in the seven representative group II intron structures. The 3D representation was prepared with ChimeraX [29], with key residues labeled and prominent key hydrogen bonds indicated with gray dashed lines. The 2D representation was prepared with Inkscape [30]. The barchart was prepared with Matplotlib [43]
We identified seven kink-turn modules in the seven group II intron structures (Additional file 3: Table S3). Five of the seven kink-turn modules were found in the functional intron region of domain I (DI) known as the coordination loop [42], present in the structures of group IIB and group IIC introns. As implied by its name, the loop coordinates the formation of the intron’s catalytic core by interacting with the Κ extension bulge and the exon binding site 1 (EBS1) loop [42]. Additionally, the coordination loop encompasses the single-residue exon binding site 3 (EBS3). Subsequently, exon–intron recognition is facilitated through interactions with the intron binding sites IBS1 and IBS3 (Fig. 6B). The remaining two kink-turns were identified as instances of a single Domain III (DIII) kink-turn module in group IIC introns.
The kink-turn architecture is crucial for the functioning of the coordination loop (Additional file 3: Table S3). The residues L1, L1.1, and L2 interact with the Κ extension. The L3 residue forms a base wedged element (BWE) [44] with a Domain IV (DIV) loop prior to DNA integration [41]. In the group IIC intron structures, the residues L2, L3, 1b, and EBS3 are exposed to the reverse transcriptase protein. Furthermore, the unusual long-range δ′-δ base pair in position 3b-3n facilitates the stacking between EBS3 and EBS1. Notably, the two representative group IIA intron structures lack EBS3, and their coordination loops do not form a kink-turn.
Surprisingly, the coordination loop was never identified as a kink-turn previously. Possibly, the long-range δ′-δ base pair replacing the local 3b-3n base pair and the looped-out EBS3 residue hid the motif from identification via sequence- and backbone-restrained kink-turn searches. DSSR failed to identify any of the coordination loop kink-turns and found only the two DIII kink-turn modules. RNAMotifScanX, searching with the consensus kink-turn interaction network, identified the two DIII kink-turns and a single coordination loop kink-turn. In contrast, ARTEM, searching with a single Kt7 reference module, identified four of the five coordination loop kink-turns with the 12nt threshold on the match size. With the 11nt threshold, ARTEM identified all five coordination loop kink-turns and one of the DIII kink-turns, along with eight false positive matches (matches of non-canonical backbone topologies). Consequently, ARTEM outperformed the existing tools, demonstrating an impressive recall of 57% at 100% precision with the 12nt threshold, and 86% recall at 43% precision with the 11nt threshold (Fig. 6C, Additional file 4: Table S4).
ARTEM can detect various types of RNA and DNA tertiary motifsTo demonstrate the broad applicability of ARTEM, we also used it to search for four types of nucleic acid tertiary motifs across all RNA- and DNA-containing PDB entries. This search against complete but redundant datasets aimed to verify the characteristic features of the motifs and to define optimal sequence and RMSD restraints for ARTEM use. Three of the four motifs are well-known: the parallel G-tetrad [45] (Additional file 3: Fig. S4A), GNRA tetraloop [46] (Additional file 3: Fig. S5), and i-motif [47] (Additional file 3: Fig. S6). The fourth is an unusual motif of four base pairs formed between parallel-oriented strands (Additional file 3: Fig. S7), recently identified in place of a predicted pseudoknot in the crystal structure of cap-independent translation enhancers (CITE) from Pea enation mosaic virus RNA 2 (PEMV2) [32]. We refer to this motif as the parallel-pairing motif.
Distinct RMSD distributions of all-guanine and non-all-guanine matches (Additional file 3: Fig. S4B, C) confirmed the high specificity of the parallel G-tetrad motif to guanosine bases, with only a few non-guanine matches identified with RMSD < 1.0 Å, such as an all-adenine tetrad (Additional file 3: Fig. S4D). In contrast, at RMSD from 1.0 to ~ 1.75 Å, we observed a notable portion of false all-guanine matches that spanned residues from several adjacent non-parallel tetrads (Additional file 3: Fig. S4E, F). Therefore, an RMSD threshold of 1.0 Å is recommended when searching for parallel tetrads with ARTEM, with additional sequence restraints if necessary. Overall, ARTEM identified tetrads with RMSD under 1.0 Å in 418 PDB entries, 417 of which are registered in the DSSR-G4DB database [48] as containing G-quadruplexes, while the remaining entry (PDB 5M1L [49]) involves non-canonical GAGA-tetrads.
Unexpectedly, among GNRA tetraloop matches identified at RMSD < 1.0 Å, we did not observe clear separation between RMSD distributions of GNRA and non-GNRA matches (Additional file 3: Fig. S5B). Among these matches, 6.8% were GAAG tetraloops, and 4.7% were UAAC tetraloops, which adopted the same fold but did not conform to the GNRA pattern. To account for redundancy, we analyzed non-coding RNA families containing these matches. The GAAG loop was found exclusively in ribosomal RNAs (both SSU and LSU, across seven RNA families) and was further stabilized either by RNA–protein interactions (LSU) or by a base-phosphate interaction (SSU, see Additional file 3: Fig. S5D). The UAAC loop was identified only in a lariat capping ribozyme and in bacterial SSU rRNA, where it was stabilized by a distant adenosine base in both cases (Additional file 3: Fig. S5E). As expected, non-GNRA loops require additional stabilization to form a GNRA-like fold. Interestingly, ARTEM identified a single GNRA tetraloop match formed by a DNA chain, indicating that this motif is not exclusive to RNA (Additional file 3: Fig. S5C). Overall, an RMSD threshold of 1.5 Å is recommended when searching for GNRA tetraloops with ARTEM, as matches with distorted folds appear above this threshold (Additional file 3: Fig. S5F). Out of 1,304 GNRA tetraloop matches identified by ARTEM at RMSD under 1.5 Å in the BGSU representative set, 1,140 (87.4%) were also identified by DSSR [11] or rna_motif utility of Rosetta package [50], while the remaining 164 matches were manually confirmed to adopt a GNRA-like conformation, with some hydrogen bonds exceeding the common thresholds (Additional file 3: Fig. S8).
Among RNA-containing PDB entries, ARTEM identified only 12 matches for the reference RNA i-motif instance at RMSD < 2.0 Å, all being all-cytidine matches at RMSD < 0.35 Å from the same PDB entry as the reference (Additional file 3: Fig. S6C). Thus, PDB entry 1I9K is the only entry containing an RNA i-motif, confirming that i-motifs are generally unfavorable in RNA structures. Among DNA-containing PDB entries, ARTEM identified 144 all-cytidine i-motif matches at RMSD < 1.0 Å, one match involving thymine-thymine base pairs at RMSD = 1.004 Å, and four matches involving adenosine-adenosine base pairs at RMSD = 1.24 Å (Additional file 3: Fig. S6B, D, E). Overall, ARTEM identified i-motif matches with RMSD under 2.0 Å in 18 PDB entries, 17 of which were also annotated as i-motif-containing by DSSR, while the remaining entry (PDB 1C11 [51]) was flagged by DSSR as containing a 10nt segment of an i-motif, below the tool’s 12nt size threshold.
Across RNA- and DNA-containing PDB entries, ARTEM did not identify a single match for the parallel-pairing motif at RMSD < 1.0 Å. The closest match was found at RMSD = 1.679 Å and did not exhibit the parallel-pairing pattern characteristic of the reference motif (Additional file 3: Fig. S7D, E, F). Therefore, this search confirms the uniqueness of the parallel-pairing motif described in the CITE RNA from PEMV2 [32]. Additionally, we searched for the PEMV2 parallel-pairing motif in the saguaro cactus virus (SCV) CITE RNA structure, released after we constructed our search datasets (PDB entry 8T29 [52]). In this structure, the motif lacks one residue, causing ARTEM to identify only a seven-residue match at 1.19 Å RMSD (Additional file 3: Fig. S7G, H, I). These analyses demonstrate ARTEM’s unique suitability for identifying similarities between nucleic acid structural motifs, regardless of the nucleic acid type, size, interaction network, or backbone connectivity.
Comments (0)