All mouse ES cell lines, such as IDG3.2 (Institute of Stem Cell Research, Helmholtz Zentrum München) and V6.5 (NBP1-41162, Novus Biologicals), were maintained in 1:1 neurobasal (21103049) DMEM (11320033) containing N2 (17502048) and B27 (17504044) supplements, 1% Glutamax (35050), 1% non-essential amino acids (1140050) and 0.1 mM 2-mercaptoethanol (31350-010) (all Thermo Fisher Scientific), as well as 1,000 U per ml LIF (ESGRO ESG1107, Merck), with the additional use of small-molecule inhibitors. For the 2iLIF condition, 1 mM MEK inhibitor PD0325901 (1408, Axon Medchem) and 3 mM glycogen synthase kinase 3 (GSK3) inhibitor CHIR99021 (4953/50 Tocris) were added for >2 days. Cells were passaged using Stempro-Accutase (A1110501, Thermo Fisher Scientific). Cells were grown on gelatin-coated plates that were prepared by first incubating the wells with 0.1% gelatin (EmbryoMax ES-006-B, Merck) and 1% FBS (EmbryoMax ES Cell Qualified FBS, ES-009-B, Merck) for 1 h at 37 °C followed by a rinse with PBS.
To transit from the ES cell to MEK-activated and Wnt-inhibited (LFI) or MEK–Wnt-activated state, Accutased ES cells were seeded onto gelatin-coated and FBS-coated plates in N2B27 medium supplemented with 1,000 U per ml LIF (ESGRO ESG1107, Merck), 12 ng ml−1 bFGF (100-18B, Peprotech) and either 2 μM inhibitor of Wnt production 2 (IWP2; 3533, Tocris) for LFI or 3 mM GSK3 inhibitor CHIR99021 (4953/50 Tocris) for LFW cells. To transit from naive ES to priming ES cells, Accutased naive ES cells were seeded onto gelatin-coated and FBS-coated plates in N2B27 medium supplemented with recombinant 20 ng ml−1 activin A (338-AC-050, R&D Systems) and 12 ng ml−1 bFGF (100-18B, Peprotech) for 12 h or as indicated in the figure legends (12, 24, 36 or 48 h). To determine developmental genes that were downregulated during naive-to-primed transition, publicly available data were used2. Epiblast-derived stem cells (EpiSCs) were cultured on gelatin-coated and FCS-coated plates in N2B27 medium supplemented with recombinant 12 ng ml−1 bFGF (100-18B, Peprotech), 20 ng ml−1 activin A (338-AC-050, R&D Systems) and 2 μM IWP2 (3533, Tocris) to suppress spontaneous differentiation. EpiSCs were passaged 1:4–1:10 every 3 days by triturating the colonies into small clumps using 0.5 mg ml−1 collagenase IV (Sigma). MEK–Wnt-inhibited RS cells were cultivated with 1,000 U per ml LIF, 2 μM IWP2 and 1 μM PD0325901 (manufacturers as mentioned above) and were passaged as ES cells. All cell lines in culture were tested for Mycoplasma contamination every 2–3 months.
Generation of CRISPR–Cas9 genome engineering mouse ES cell linesTo generate Lin28a-KO cells, we used two independent targeting strategies allowing us to verify the effect of LIN28A KO independent of the use of individual guide RNAs (gRNAs). The first strategy entailed the use of two gRNAs with target sites flanking the second Lin28a exon to excise it on both alleles. gRNA oligos were cloned into the SpCas9-T2A-PuroR/gRNA vector (px459) by cut ligation (gRNA sequences ACCCGTCTGGGAGATCCGG and AGGCTGGGAGTTCCGAGGTA). ES cells were transfected with an equimolar amount of each gRNA vector. Then, 2 days after transfection, cells were plated at clonal density and subjected to a transient puromycin selection (1 μg ml−1) (P8833, Sigma-Aldrich) for 48 h. Clones were picked 5–6 days later, expanded in 96-well plates and PCR-screened to identify clones with alleles harboring deletions. PCR primers (CAGCCAGGACAGGTTTCTTC and GCAGTTACTACAAACTCAAAACTG) were used to identify clones in which the Lin28a second exon was removed. Removal of the Lin28a second exon was confirmed with Sanger sequencing and loss of LIN28A expression was assessed by western blot. For CRISPR–Cas gene editing, all transfections were performed using Lipofectamine 3000 (Thermo Fisher Scientific) according to the manufacturer’s instructions.
Generation of a stable inducible LIN28A-expressing mouse ES cell lineTo generate the inducible PiggyBac system, the LIN28A–eGFP, FLAG–LIN28A WT and FLAG–LIN28A S200A constructs were synthesized as gBlocks (Integrated DNA Technologies) and inserted into the enhanced PiggyBac (ePB) vector for stable integration49. This plasmid contains a TET-on system for inducible transgene expression. Mouse ES cells were cotransfected with 4 μg of transposable vector and 1 μg of the piggyBac transposase using Lipofectamine 3000 (Thermo Fisher Scientific) according to the manufacturer’s instructions. Selection in 5 μg ml−1 blasticidin was initiated 3 days after transfection and maintained until resistant colonies became visible. For LIN28A–eGFP PiggyBac lines, individual resistant inducible clones were selected, expanded and checked for LIN28A–GFP expression, whereas, for expression of the FLAG–LIN28A WT and FLAG–LIN28A S200A transgenes evaluated with western blot, two verified clones were used for studies including QuantSeq (Fig. 2e). The LIN28A transgene was induced by adding 1 μg ml−1 Dox (Thermo Fisher Scientific).
Cell culture immunofluorescenceCells were grown in Geltrex-coated (Thermo Fisher Scientific, A1569601) eight-well chamber slides (80826, Ibidi) and fixed with 4% PFA–DPBS solution (Thermo Fisher Scientific; 16% formaldehyde (w/v), methanol-free, 28906) for 15 min at room temperature (RT) and permeabilized using 0.3% Triton X-100–DPBS solution for 15 min at RT. Primary and secondary antibodies were diluted as per the manufacturer’s recommended concentrations in 10% FBS and 0.3% Triton X-100–DPBS and incubated for 1 h at RT using the following antibodies: anti-Tbx3 (sc-17871, Santa Cruz), anti-LIN28A (A177, Cell Signaling and ab63740, Abcam), anti-Klf4 (AF3158, R&D Systems), anti-Nanog (8822, Cell Signaling), anti-Sox2 (2748S, Cell Signaling), anti-KLF2 (Mab5466, R&D Systems), donkey anti-rabbit immunoglobulin G (IgG) 555 (A31572, Invitrogen), donkey anti-goat IgG 488 (A11055, Invitrogen) and donkey anti-goat IgG 633 (Invitrogen, A21082). The samples were washed with DAPI (50 μg ml−1) solution and imaged using a Nikon Ti2 spinning disk confocal microscope with the parameters described in the next section.
Super-resolution cell imagingSteady-state super-resolution imaging of live ES cells and immunofluorescence images were acquired using an immunoOlympus IX83 microscope equipped with a VT-iSIM super-resolution imaging system (Visitech International) and Prime BSI Express scientific complementary metal–oxide–semiconductor camera (Teledyne Photometrics) with an Olympus ×150 (1.45 numerical aperture) TIRF Apochromat oil immersion objective. The imaging system was equipped with a motorized stage with piezo Z (ASI) and environmental chamber maintained cells at 37 °C with 5% CO2 (Okolab). Images were acquired with the 405-nm, 488-nm, 561-nm and 640-nm laser lines, generating an image size of 1,541 × 1,066 with 1 × 1 binning and a pixel size of 43 nm. The microscope was controlled with Micro-Manager version 2.0 gamma software50. ER-Tracker (E12353, Thermo Fisher Scientific) was used for staining of the cytoplasm in live cells. Deconvolution was performed using Huygens (Scientific Volume Imaging).
Flow cytometryFor flow cytometry experiments, single-cell suspensions were made using Accutase (A6964, Sigma-Aldrich) for 5 min in 37 °C or enzyme-free cell dissociation buffer (13151014, Gibco) for 30 min at 37 °C, washed with 5% FBS (EmbryoMax ES Cell Qualified FBS, ES-009-B, Merck) in PBS and incubated with fluorophore-conjugated antibodies against anti-CD117 (c-Kit) antibody (105820, Biolegen) or SSEA4 (MC813-70, Thermo Fisher Scientific) for 30–60 min on ice. Cells were centrifuged and resuspended. Then, 10,000 cells were analyzed using an LSR Fortessa cytometer (BD Biosciences, BD FACSDiva Version 9.2.). Cell debris were excluded by forward and side scatter gating. FlowJo was used for data analysis.
Western blottingCells were trypsinized and lysed using radioimmunoprecipitation assay (RIPA) buffer containing phosphatase (Sigma-Aldrich, 4906837001) and protease (Merck, 539134) inhibitors. After the addition of 4× LDS loading buffer (Thermo Fisher Scientific, NP0007) with 2-mercaptoethanol (Sigma-Aldrich, M3148), samples were heated to 95 °C for 5 min. Samples were run on NuPAGE 4–12% with Bis-Tris (Thermo Fisher Scientific, NP0321PK2) and blotted using the Power Blotter XL System (Thermo Fisher Scientific, PB0013). Following three 5-min washes with PBS-T, membranes were blocked with 5% milk powder (T145.1, Carl Roth) in TBS-T. Membranes were then incubated overnight at 4 °C with 5% milk powder in PBS-T containing the primary antibodies at the following dilutions: 1:1,000 anti-LIN28A (A177, Cell Signaling and AF3757, R&D Systems); 1:4,000 anti-H3 (Abcam, ab1791), 1:2,000 anti-glyceraldehyde 3-phosphate dehydrogenase (anti-GAPDH; Cell Signaling, 2118S) and 1:1,000 Klf4 (AF3158, R&D Systems). After three 5-min PBS-T washes, the membrane was incubated with horseradish peroxidase-conjugated goat anti-rabbit IgM (sc-2030, Santa Cruz) or goat anti-mouse IgG (Dianova, 115-035-003) in 5% milk powder in PBS-T. Following four 10-min washes with PBS-T, the membrane was incubated for 1 min with Clarity Western enhanced chemiluminescence substrate (170-5060, Bio-Rad Laboratories) and imaged with an Amersham Imager 600 (GE, 29-0834-61).
Cell fractionationA total of ~107 cells were resuspended in 500 μl of buffer S (10 mM HEPES pH 7.9, 10 mM KCl, 1.5 mM MgCl2, 0.34 M sucrose, 10% glycerol, 0.1% Triton X-100, 1 mM DTT and proteinase inhibitor (04693116001, Roche)). The cells were incubated for 5 min on ice. Nuclei were collected into a pellet by low-speed centrifugation (4 min, 1,300g) at 4 °C. The supernatant containing the cytoplasm fraction was further clarified by high-speed centrifugation (15 min, 20,000g, 4 °C) to remove cell debris and insoluble aggregates. To further purify the nuclei, they were washed three times in buffer S and spun down (1,700g for 5 min) at 4 °C to get a soluble nuclear fraction.
RNA preparation, reverse transcription–qPCR assays and RNA sequencingTotal RNA was prepared from cell pellets using an miRNeasy Micro Kit (Qiagen, 217084) according to the manufacturer’s instructions. The miR let-7 levels were analyzed using the miScript protocol (Qiagen, 218193) using primers amplifying miR let-7 (Qiagen, MS00005866; let-7g Qiagen, MS00010983). For QuantSeq 3′ mRNA sequencing (mRNA-seq), 0.5 μg of total RNA was used per library prepared using the Lexogen QuantSeq-FWD kit (Lexogen GmbH, 0.15) according to the manufacturer’s instructions. All libraries were evaluated on an Agilent 2200 TapeStation using the High-Sensitivity D1000 ScreenTape (Agilent, 5067-5585) and DNA concentration was measured using a Promega Quantifluor double-stranded DNA system (Promega, E2670). Samples were sequenced using HiSeq4000 to generate 100-nt single-end reads.
Thiol-linked alkylation for the metabolic sequencing of RNA (SLAMseq) was generated according to the published protocol8. Naive or priming ES cells were treated with 100 μM 4-thiouridine (s4U; Sigma, T4509-25MG) for 24 h of pulse labeling. During the last 15 h of pulse labeling, fresh s4U-containing medium was exchanged every 3 h to enhance s4U incorporation into the RNA species. Before initiation of the chase labeling with 10 mM uridine-containing (Sigma, U3750-25G) growth medium (100× excess of uridine compared with initial s4U concentrations), the medium was removed from the cells, which were washed twice with chase-labeling medium. The chase labeling was initiated upon washing off the pulse-labeling medium and, at the indicated time points, the medium was removed and cells were lysed directly in Qiazol (Qiagen, 79306), followed by RNA preparation.
QuantSeq 3′ mRNA-seq processing and analysisReads obtained with QuantSeq 3′ mRNA-seq were quantified using Salmon51 with a transcriptome built from GENCODE M22 transcripts. To monitor the expression level of the Lin28a–GFP transgene, the Lin28a–GFP sequence was included with the decoy sequences derived from the whole genome sequence52. The command used to quantify RNA expression with Salmon was ‘salmon quant -i -l A -r --threads --validateMappings --gcBias --seqBias -o ’.
Next, differential expression analysis was performed using DESeq2 (ref. 53), with effect size shrinkage implemented using the apeglm package54. To find genes that were differentially regulated upon Dox induction of LIN28A–GFP (Fig. 1g,h), we performed differential expression analysis comparing iLIN28A–GFP inductions following 6 h of Dox induction using Dox-inducible iLIN28A–GFP PS cells built in the LIN28A-KO background with and without 6 h of Dox induction. For iLIN28A–GFP overexpression analysis, the developmentally downregulated genes were defined as genes at least twofold downregulation at the 24-h and 48-h time points in previously published data2 (n = 931) and a list of naive pluripotency genes was derived from the same study.
To find genes that were differentially regulated upon phosphorylation of LIN28A, we performed differential expression analysis comparing LIN28A-KO cells, cultured in the presence of FGF2, with or without the induction of LIN28A WT (Fig. 2e, left) or LIN28A S200A (Fig. 2e, right). Reads were obtained with QuantSeq 3′ mRNA-seq, and quantified with Salmon as described above. Differentially expressed genes were determined with DESeq2 (ref. 53), with effect size shrinkage implemented using the apeglm package54, by applying criteria for an adjusted P value < 0.05 and fold change ≥ 1.5. The control group of genes was defined by an adjusted P value ≥ 0.05 and |log2(fold change)| < 0.5. For the subsequent analysis of 3′UTR features, we filtered the gene groups to contain only protein-coding genes with a minimum 3′UTR length of 100 nt and an expression level ≥ 5 TPM (transcripts per million reads) in LIN28A-KO or WT in the FGF2-treated condition. This resulted in 1,183 downregulated genes, 989 upregulated genes and 2,703 control genes. Representative 3′UTRs for each gene were annotated on the basis of the most abundant mRNA isoform in cells expressing LIN28A WT cultured in 2iL medium. Expression levels were estimated from TPM values obtained with Salmon by taking a mean TPM value for each transcript across all replicate experiments.
RNA stability prediction methodsTo calculate mRNA T>C conversion, the pipeline was enclosed in Snakemake (version 5.3.0)55. For analysis of mRNA 3′-end sequencing data, reads were demultiplexed with Cutadapt 1.18, prohibiting mismatches in the barcode56. Then, sequencing read quality was assessed (-q 20) and barcode-trimmed (--cut 12), discarding short reads (--length 16). Poly(A) stretches (>4) at the 3′-end were removed up to 20 tandems without indels or mismatches. Trimmed reads were aligned with Slamdunk map (version 0.3.3)57 to the reference mouse genome (mm10) using local alignment and allowing up to 100 multimapping reads (-n 100). Aligned reads were subject to a filtering step, retaining only alignments with a minimum identity of 95% and a minimum of 50% of the read bases mapped. Reads ambiguously mapping to more than one 3′UTR were discarded (https://github.com/AmeresLab/UTRannotation). Single-nucleotide polymorphisms (SNPs) were called using VarScan (version 2.4.1) with default parameters and establishing a tenfold coverage cutoff and a variant fraction cutoff of 0.8 (-c 10 -f 0.8). SNPs overlapping T>C conversions needed to present a base quality Phred score of at least >26. T>C conversions were counted with rolling windows and normalized by the T content, coverage of each position and average per possible transcript 3′UTR. Lastly, the T>C conversion quality was assessed by Slamdunk alleyoop (version 0.3.3)57.
To calculate half-life, T>C conversions were background-subtracted and normalized to the chase S4U treatment. T>C conversions were fit to an exponential decay nonlinear function using the R minpack.lm 1.2 (ref. 58). Only 3′UTR half-lives that correlated to a model R2 > 0.6 in all experimental conditions were retrained to measure the mRNA stability of 6,276 transcripts.
Preprocessing and feature computation for classification of developmentally destabilized transcriptsInitially, 3′UTRs were filtered for R2 > 0.6 and a delta half-life was computed (naive versus primed and LIN28-KO versus WT). All retained 3′UTRs were sorted according to their delta half-life. The top and bottom 1,000 3′UTRs with the largest half-life changes were assigned to a specific condition (for example, naive and primed).
Next, we computed a feature matrix for these 3′UTRs. Features were distinguished into seven classes and computed by the gradient boosting machine (GBBM) as described below.
Feature class
Feature description
Number of features
Simple features
G+C content, dinucleotide frequency, 3′UTR width and average PhastCon score
19
miCLIP-seq
Average m6A miCLIP-seq scores measured in different ES cell conditions
6
ES cell iCLIP-seq
Average CLIP-seq scores for various RBPs measured in mouse ES cells
62
TAIL-seq9
poly(A) length and number of A, monoC, monoG, monoT, oligoC, oligoG and oligoT from mouse ES cells
8
TargetSCAN
Binding predictions for different miRNAs
11
DeepBind11
Average binding prediction of 50-bp tilled 3′UTRs
105
POSTAR2 (ref. 10)
CLIP-seq peak scores from various RBPs and conditions
54
All these features were assembled in one matrix, filtered for zero variance features and split into a training set (75%) and a hold-out test set (25%). The training set was used for subsequent model training (one model per set and one model for all features together).
For Fig. 1c–e, we compared the univariate feature importance, as measured by the AUROC for each CLIP or POSTAR2 feature. The AUROCs were computed per feature to distinguish WT versus KO and naive versus primed 3′UTRs. For this analysis, we used the varImp function from the R package caret.
GBM training and model benchmarkingHyperparameter search and model training were performed with the R packages caret and GBM59. Models were trained for a binary classification task to distinguish WT versus LIN28-KO or naive versus primed specific 3′UTRs. Gradient-boosted trees were trained on the features of the training set using fivefold cross-validation repeated five times while performing hyperparameter optimization using grid search. Furthermore, the hyperparameter search was performed as a grid search over the parameters listed below with the aim to maximize the AUROC.
Parameter
Value
Interaction depth
1, 3, 6
Number of trees
1, 10, 20, 30, 40, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500
Shrinkage
0.001, 0.005, 0.01, 0.05, 0.1
Minimum number of observations in terminal node
10
The resulting GBM models were evaluated on the hold-out test set using the AUROC and MCC as performance measures.
The feature importance for each GBM model was computed as the relative influence and the reduction in model performance by shuffling a particular feature. For this analysis, we used the associated GBM R package functions ‘relative.influence’ and ‘permutation.test.gbm’.
Prediction of pLIN28A-dependent decay from RNA sequenceThe longest 5% of pLIN28A upregulated and downregulated 3′UTR sequences were excluded and the remaining 2,060 sequences were N-padded to ensure a uniform length of 3,425 nt across all inputs. Each sequence was then encoded using a one-hot encoding scheme, with the bases A, C, G, T and N represented as binary vectors [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1] and [0, 0, 0, 0], respectively. The 3′UTR regulation classes were further transformed into a binary class matrix, where the upregulated and downregulated classes were designated as [1, 0] and [0, 1], respectively.
The sequences were split into fixed training, validation and testing sets (66:17:17 split), preserving class distribution and using fixed sampling. A hybrid model combining a one-dimensional (1D) CNN and RNN was used23 through TensorFlow 2.12.0. The architecture included an initial 1D convolutional layer (Conv1D) with L1 and L2 regularization and three stacked convolutional blocks each containing an L1-regularized Conv1D layer, followed by a dropout layer, with max-pooling transformation. This was followed by a gated recurrent unit (GRU) and two dense fully connected layers, with the latter using the softmax activation, thereby producing predictive probabilities for each class. The GRU was oriented to process the data in reverse order, such that the nucleotide sequence was still most recently considered by the network, as opposed to the N-padded terminus. Layer normalization and a rectified linear unit (ReLu) activation function were applied following each Conv1D layer, with batch normalization and the ReLu activation function succeeding the GRU layer, dense layer and final softmax output (Extended Data Fig. 4a). Dropout and L1 and L2 regularization were incorporated as the main tools to reduce overfitting of the limited data.
The regularization coefficients (~0.89 for L2, ~7.53 × 104 for L1), dropout probability (~0.34), number of Conv1D blocks (3), Conv1D kernel number (5) and number of neurons (64) were set as per the optimization results from Optuna version 3.1.0 (ref. 60), optimizing for best validation set accuracy. The Adam optimizer was employed with the learning rate set to 5 × 10−5, beta_2 set to 0.98 and clipnorm set to 0.5, to additionally prevent overfitting. The loss was computed using the categorical cross-entropy. Training was performed on batches of 64 examples from the training set with early stopping, monitoring validation loss with a patience of 25 epochs and restoring the best weights upon stopping. The model’s final performance was evaluated on the test set on the basis of accuracy, AUROC score, precision, recall and F1-score metrics. The code for model training and subsequent feature importance analyses is available from GitHub (https://github.com/ulelab/LIN28A_RNPreassembly_bioinformatics).
SHAP importance analysis and characterization of 3′UTR features predictive of pLIN28A-dependent decayThe impact of individual features in the 3′UTRs on the predictions of the neural network was assessed using SHAP version 0.35.0 (ref. 24) with a gradient-based explainer. This process generates the SHAP values at the nucleotide level, which quantify the contribution of each hypothetical feature (in this case, each of the four nucleotides) to each prediction class (upregulated versus downregulated). For each evaluated sequence, this results in an array in which each row represents one nucleotide within a sequence and the four columns represent importance scores for each of the four nucleotides (that is, model features). All sequences were used for the SHAP analysis of each class.
Following our analysis with SHAP, we used the tool TF-MoDISco Lite version 2.0.0 to identify important sequence motifs in all 3′UTRs for the downregulated prediction class25 (Extended Data Fig. 4e). TF-MoDISco works by first extracting short spans of nucleotide sequence (that is, seqlets) with high absolute importance scores and then dividing them into positive (that is, the seqlet positively contributes to the prediction) and negative (that is, the seqlet negatively contributes to the prediction) metaclusters. For each metacluster, a similarity matrix is calculated and used for the identification of underlying sequence motifs by Leiden clustering. We ran TF-MoDISco Lite with the maximum number of seqlets per metacluster, the number of Leiden clusterings to perform with different random seeds and the window length surrounding the peak center for motif discovery set to 1,000, 10 and 150 nt, respectively.
To further understand the important trimer motifs in model prediction, we removed the N-padding and summarized the importance of individual trimers across 3′UTR sequences. By applying a sliding window of 3 nt using a step of 1 nt to each scored nucleotide sequence, we computed the mean importance score for each possible trimer at each sequence position. Next, each sequence was grouped into 100 bins and the scores across all nucleotide positions within one bin were summed and normalized by bin length. Finally, the mean importance score of each trimer was computed within each bin (across all evaluated sequences) and summarized in the metaprofile for the downregulated prediction class (Fig. 3a). Specific trimers with high positive or negative importance scores for the downregulated prediction class were highlighted in color against a background of all possible trimers.
To quantify the contributions of individual nucleotides at relative positions within the 3′UTRs on model predictions (Extended Data Fig. 4c), we normalized the nucleotide-level importance scores for the downregulated prediction class across all transcripts to their respective 3′UTR lengths. The normalized scores were subsequently split into 100 bins according to their relative position in the 3′UTR and scores within each bin were summed. Lastly, the mean score in each bin was calculated across all evaluated 3′UTRs and these summarized scores were visualized as a heat map (Extended Data Fig. 4c). Within this heat map, each cell represents the summarized importance score of a specific nucleotide within a particular bin.
iCLIPWe used iCLIP to identify a repertoire of RBP binding sites in PS cells following the iiCLIP protocol61 with some modifications. Cells were ultraviolet (UV) cross-linked on ice and then lysed in RIPA and CLIP lysis buffer. Then, 0.4 U of RNaseI and 4 U of Turbo DNase were added per 1 ml of cell lysate at 1 mg ml−1 protein concentration for RNA fragmentation. Negative controls (no UV) were prepared. Antibodies (anti-GFP polyclonal antibody, Thermo Fisher Scientific, A6455; anti-LIN28A polyclonal antibody, Cell Signaling, 3978; anti-PABPC1, Abcam, ab21060 and Proteintech, 10970-1-AP; PABPC4, Proteintech, 14960-1-AP) were coupled to magnetic protein G beads used to isolate protein–RNA complexes and RNA was ligated to a preadenylated infrared-labeled IRL3 adaptor62 with the following sequence:
/5rApp/AG ATC GGA AGA GCG GTT CAG AAA AAA AAA AAA /iAzideN/AA AAA AAA AAA A/3Bio/
The complexes were then size-separated by SDS–PAGE, blotted onto nitrocellulose and visualized by Odyssey scanning. For the multiplexed sample, one replicate was run in parallel with the IRL3 adaptor to allow quality control of the RNP complex on the membrane and to help with cutting of the bands. RNA was released from the membrane by proteinase K digestion and recovered by precipitation. Complementary DNA (cDNA) was synthesized with Superscript IV reverse transcriptase (Life Technologies) and AMPure XP bead purification (Beckman-Coulter) and then circularized using Circligase II (Epicentre) followed by AMPure XP bead purification. After PCR amplification, libraries were size-selected with AMPure beads (if necessary by gel purification) and quality-controlled for sequencing. Libraries were sequenced as single-end 100-bp reads on an Illumina HiSeq 4000.
iCLIP data analysisSequencing data produced by iCLIP experiments were processed on the iMaps Genialis server (https://imaps.genialis.com/). Briefly, experimental barcodes were removed with Cutadapt56 and sequencing reads were aligned with STAR63 to the mouse genome build (GRCm38.p5 GENCODE version 15 annotation) allowing two mismatches. Unique molecular identifiers (UMIs) were used to distinguish and remove the PCR duplicates. The nucleotide preceding each sequencing read was assigned as the cross-link event.
iCLIP data for LIN28A WT (in 2iL-treated and FGF2-treated cells), LIN28A S200A (in FGF2-treated cells) and PABPC1 and PABPC4 (in LIN28A-KO cells with and without LIN28A overexpression) were analyzed on the iMaps Goodwright server (https://imaps.goodwright.com/). First, reads were demultiplexed using Ultraplex64, using default settings, and barcodes were removed using Cutadapt version 3.4 (ref. 65). Reads were then premapped to ribosomal RNA, transfer RNA sequences referred to as small RNA (smRNA) and smRNA with Bowtie66. Next, reads that did not map with Bowtie were aligned with STAR version 2.7.9a (ref. 63) using the mouse genome build (GRCm39 GENCODE M28 annotation). This was followed by the removal of PCR duplicates using UMI-tools67 and identification of cross-link events, as described above. Peaks of the cross-linking signal were identified with Clippy version 1.4.1 (ref. 68) and used together with the cross-link sites to find enriched motifs with positionally enriched k-mer analysis (PEKA) version 1.0.0 (ref. 48) using the default settings. For Clippy and PEKA, the GENCODE primary assembly annotation M28 was filtered to retain only entries with a transcript support level of 1 or 2, in genes where such transcripts were available, and used to produce a segmentation file with the get_segments function from the iCount tool69. All files generated by running the analysis pipeline are available from the iMaps Goodwright webserver for analysis of CLIP data (see https://imaps.goodwright.com/collections/882635250203/ and https://imaps.goodwright.com/collections/340215254997/ for LIN28A and PABPC1/4 iCLIPs, respectively) and on the updated Flow webserver (see https://app.flow.bio/projects/882635250203/ and https://app.flow.bio/projects/340215254997/ for LIN28A and PABPC1/4 iCLIPs, respectively). We archived key data analyzed in this study, specifically, cross-link sites, peaks and motif enrichments from PEKA, on Zenodo (https://doi.org/10.5281/zenodo.10054231)70. The code and settings used in the iCLIP analysis pipeline (release v0.30) can be viewed at https://github.com/goodwright/imaps-nf, and are also archived on Zenodo (https://doi.org/10.5281/zenodo.10054231)70.
Identification of motif groups from CLIP dataFor the identification of enriched motif groups, we obtained PEKA results in 3′UTRs (files ending in *5mer_distribution_UTR3.tsv, accessible from https://imaps.goodwright.com/collections/882635250203/) for all LIN28A WT (in 2iL-treated and FGF2-treated cells) and LIN28A S200A (in FGF2-treated cells) iCLIPs, except for LIN28A-S200A_ESC_LIF-CHIR-FGF0220626_MM_2, which was excluded from subsequent analyses because of low read coverage. First, we combined enriched k-mers (P < 0.05) from all samples into one group of unique k-mers (n = 93), which we then clustered on the basis of their sequence similarity and their ranking in PEKA. To achieve this, we computed Euclidean distances between k-mer ranks in PEKA and Jaccard distances between k-mer sequences. To obtain sequence distances, each k-mer sequence was first converted into a list of all possible substrings with length less than k (for example, a trimer ‘UGA’ would be converted into ‘U’, ‘G’, ‘A’, ‘UG’ and ‘GA’). Then, Jaccard similarity was calculated on sets of substrings for each pair of k-mers. The Jaccard similarity is a quotient of the number of shared substrings between two k-mers and the number of all substrings in the union. Finally, Jaccard similarities were converted into distances by subtracting the similarity values from 1. Next, we applied standard scaling to each of the resulting distance matrices to account for differences in variance between the two metrics, followed by min–max scaling. Normalized matrices were combined with Pythagorean addition and used to cluster k-mers using ‘scipy.hierarchy’, with correlation to compute distances and the UPGMA (unweighted pair group method with arithmetic mean) algorithm to perform clustering. Next, the dendrogram was plotted with ‘scipy.hierarchy.dendrogram’ and the tree was cut into four clusters (Extended Data Fig. 5a) to obtain the following motif groups: (1) an AGGG motif group, representative of the ZnF domain (Fig. 3d); (2) the UGGG motifs; (3) the GAU motif group, bound by the coding sequence (CDS) (Fig. 3e); and (4) the AUU motif group (Fig. 3f). Finally, we combined AGGG and UGGG into one motif group (WGG), as they exhibited similar k-mer sequences and enrichment patterns in PEKA (Extended Data Fig. 5a). The k-mer logos for resulting motif groups were plotted as described in the literature48. The source code for k-mer clustering and for generation of k-mer logos is available on Zenodo71.
For the quantification of significantly enriched A-rich and U-rich 5-mers at 3′UTR cross-link sites of PABPC1 and PABPC4 (Extended Data Fig. 8d), we obtained enriched 5-mers for each sample (P < 0.05) from files ending in *5mer_distribution_UTR3.tsv (accessible from https://imaps.goodwright.com/collections/340215254997/). Then we defined U-rich and A-rich k-mers as those with three or more Us or As, respectively, and computed the proportion of k-mers in each of these groups relative to all significantly enriched k-mers.
Metaprofiles of motif coverage around cross-link sitesWe used the AGGG, UGGG, GAU and AUU motif groups to plot metaprofiles of motif coverage around cross-link sites in Fig. 3d–f with a script described in the literature48, which is available on Zenodo72. The script visualizes the coverage of a k-mer group around cross-link sites within specified transcriptomic regions. The coverage is expressed as a percentage of cDNAs that have a k-mer from the group aligned to a specific position relative to the cross-link site. For this study, we used cross-link sites in the 3′UTR regions as input and computed motif coverage in 601-nt-long regions, centered on the cross-link sites. The cDNA scores of input cross-links were capped at 20 to avoid any excessive contributions of a few cross-link sites with disproportionately high cDNA scores. Finally, coverage was smoothed with a rolling mean using a triangular window of 20 nt and plotted.
Motif-based binding-site assignmentWe defined LIN28A binding sites corresponding to WGG, GAU and AUU motif groups using the approach of motif-based binding-site assignment, which was adopted from the approach used previously by Hallegger et al.73. First, cross-link sites from all LIN28A WT (in 2iL-treated and FGF2-treated cells) and LIN28A S200A (in FGF2-treated cells) iCLIP samples, except for LIN28A-S200A_ESC_LIF-CHIR-FGF0220626_MM_2, were merged by summing up cDNA counts at overlapping positions. Then, binding sites were assigned separately for each motif group where k-mers from a given group were located at relevant positions in the range of ±20 nt around cross-link sites. Relevant positions for each k-mer were determined by combining relevant positions identified by PEKA (prtxn, available in files ending in *5mer_distribution_UTR3.tsv) from all LIN28A iCLIP samples. The script for motif-based binding-site assignment is available on GitHub74. Motif-based binding sites of LIN28A are shown as auxiliary tracks in Extended Data Fig. 7a,d,e.
Visualization of nucleotide distribution around PAS in downregulated and upregulated or control genesFor the generation of the line plot of nucleotide composition 100 nt upstream to 20 nt downstream of the PAS for downregulated and upregulated or control transcripts (Extended Data Fig. 4d), we iterated over the 3′UTR sequences of each group and extracted the location of the terminal canonical PAS (‘AATAAA’) for each sequence; sequences without the canonical PAS or with the PAS located less than 20 nt from the 3′UTR termini were removed. The number of sequences excluded and retained for the downregulated group was 310 and 873, respectively. Meanwhile, for the upregulated or control group, there were 1,154 excluded sequences and 2,539 kept sequences. For the remaining valid sequences, we summed the instances of each nucleotide at every position within the window of 100 bp downstream and 20 bp upstream of the PAS. These nucleotide counts were subsequently normalized at each location to represent their percentage of the total number of considered sequences. The scores were smoothed by a rolling mean with a window of 5 nt and plotted as a separate plot for each transcript group.
Visualization of iCLIP data in 3′UTRs of naive genesThe visualization of iCLIP data within the 3′UTRs of Tfcp2l1, Zfp281 and Esrrb (Extended Data Fig. 7a,d,e) was performed with a clipplotr tool75, using the normalization of iCLIP cDNA counts at a given cross-link site by the experimental library size. Normalized counts were smoothed with a rolling mean across a window of 50 nt and plotted across the regions of interest. Auxiliary tracks were added below the cross-linking signal to represent LIN28A binding sites corresponding to WGG, GAU and AUU motif groups, the location of the AAA trimer and the location of the canonical PAS ‘AAUAAA’.
Estimation of gene-level expressionExpression levels for each gene were obtained by summing up TPM values of transcripts with the same stable Ensembl gene identifier and then averaging these sums across replicate 3′-end sequencing experiments. Transcript TPM values were obtained from 3′-end sequencing data with Salmon, as described in ‘QuantSeq 3′ mRNA-seq processing and analysis’. These gene-level expression estimates are referred to in text as the gene-level TPM.
iCLIP cross-linking profiles for 200 downregulated 3′UTRsFor the generation of iCLIP cross-linking profiles for LIN28A WT (i
Comments (0)