HNC gene expression studies were primarily accessed from GEO and ArrayExpress. Relevant studies were identified using the search terms “Cancer” in combination with the terms “Head and neck,” “Oral,” “Laryngeal,” “Oropharyngeal,” and “Hypopharyngeal,” and by reviewing all datasets that were retrieved by these searches. For GEO searches, datasets were restricted to those with a minimum of ten samples. We identified additional datasets by searching the reports that were associated with these datasets as well as additional review articles, until we were unable to identify any additional suitable datasets. Clinical data was accessed from the metadata that accompanied each dataset within databases, as well as from relevant reports. Where data that was needed to perform the survival and LNM meta-analysis was incompletely reported, authors and journals were contacted to request these data.
All clinical metadata related to survival (Any survival measure) and LNM was retrieved. Also retrieved, where available, were data indicating tumor grade. Other variables that were retrieved included demographic information (patient age, sex, and reported ancestry (race or ethnicity)), clinicopathological variables (tumor subsite, HPV status, measure of HPV status), details of the patient study (country or sample collection), and data pertaining to HNC-related risk habits (smoking and alcohol consumption status and intensity measures). For the TCGA study, HPV status data was obtained from a publication that applied VirusScan [28] to detect HPV RNA within raw RNA sequencing reads, representing the most complete source of HPV status data in terms of patient numbers. To spot-check the accuracy of clinical data, patient sex was inferred based on the ratio of expression of the XIST and RPS4Y1 genes and compared with clinical annotation of sex. This resulted in exclusion of two studies that had inconsistent clinical data.
The curated data compendium included a combined total of 2074 primary HNCs derived from 29 studies (Additional file 2: Table S1). Meta-analyses were performed to identify genes associated with patient survival, LNM status, and tumor grade, applied to the subset of HNCs that were annotated for each variable: These included 1638 HNCs (across 16 cohorts) with survival outcome data, 1449 HNCs (20 cohorts) with LNM status data, and 1139 HNCs (13 cohorts) with tumor grade data.
Processing gene expression data (meta-analysis datasets)Gene expression datasets that were generated using Affymetrix arrays (N = 21) were processed as follows. To ensure accurate annotation of microarray probes, raw data (.CEL files) were accessed and processed using the “affy” R package in combination with platform-specific custom CDF files that were accessed from Brainarray (http://brainarray.mbni.med.umich.edu/). Expression datasets were normalized using the mas5 algorithm. Samples were next restricted to primary tumors, followed by quantile normalization of the expression data. Probe-level data was next summarized to gene-level data using the WGCNA package [29], using the default “maxmean” method for probe filtering. For each gene, this method selects the probe with the maximum mean expression across all samples as a representative measure of the gene. Summarized gene data were log2 transformed and converted to standard gene expression scores. For each gene, standard gene expression scores were calculated for each patient sample by subtracting the mean expression of the gene and dividing by the standard deviation. Statistical pipelines that were used to perform meta-analyses were applied to standard scores.
Eight datasets were generated using non-Affymetrix microarrays (Microarrays that were manufactured by Agilent, Illumina, and the German Cancer Research Center). These datasets were downloaded from GEO as series matrix files using the GEOquery R package. These datasets were preprocessed as follows: Gene names were converted to Entrez IDs using array annotation “Platform” files that accompanied each dataset. Where Entrez IDs were not included in the annotation file, gene names were converted to Entrez IDs using biomaRt [30]. Datasets were restricted to primary tumors and were filtered to remove samples with missing data for 10% or greater of genes, and to remove genes that had missing data for 10% or greater of samples. Datasets were then quantile normalized. For genes with multiple probes, the WGCNA package was used to identify the probe with the maximum mean expression across samples, which was selected to be a representative measure for each gene. Datasets were then log2 transformed if not already in log2 space and converted to standard gene expression scores as described for Affymetrix-based datasets.
Preprocessed TCGA bulk RNA-Seq data (gene-level HTSeq counts) were downloaded from TCGAbiolinks [31]. TCGA data was processed for meta-analyses using an approach that was consistent with array-based datasets: The dataset was restricted to primary tumor samples and then quantile normalized. Gene names were converted from Ensembl IDs to Entrez IDs using biomaRt [32]. Ensembl ID-level data was summarized to Entrez gene-level data using the WGCNA package “CollapseRows” function. The default “maxmean” method was used to select features with higher expression where Entrez IDs matched multiple Ensembl IDs. The datasets were then log2 transformed and converted to standard gene expression scores as described for Affymetrix-based datasets.
For applications other than meta-analyses, TCGA RNA-Seq data was processed using an alternative normalization approach in order to process primary HNC and tumor-adjacent normal samples in parallel, as quantile normalization assumes similar data distributions across samples [33]. HTSeq counts were converted to standard scores such that expression data for each HNC sample had a mean of zero and standard deviation of 1. Standard scores were then log2 transformed and batch corrected (correcting for sample plate) using COMBAT [34]. Gene names were converted from Ensembl IDs to Entrez IDs using biomaRt [30]. Ensembl ID-level data was summarized to Entrez gene-level data using the WGCNA package “CollapseRows” function [29]. The default “maxmean” method was used to select features with higher expression where Entrez IDs matched multiple Ensembl IDs.
Meta-analysis of genes associated with survivalThis meta-analysis included all datasets that had at least 20 primary HNCs with survival and gene expression data (N = 16 studies with a combined total of 1638 HNCs). Clinical data pertaining to all measures of survival was accessed for each study, and survival time was converted to months. Survival analysis was performed using overall survival (OS) where possible, and other survival measures (progression-free survival, or distant metastasis-free survival) where OS was not reported (Table 1). For each dataset separately, Cox regression models were used to calculate z-scores for association of each gene with survival. For genes that were represented in two or more datasets (N = 23,558), Liptak’s weighted meta-z test [35, 36] was used to combine z-scores for each dataset into a single “meta-z-score,” a summary statistic that indicates the association of gene with survival across studies. Liptak’s meta-z test was applied with weights set to the square roots of dataset sample sizes. Genes were considered to be significantly adversely associated with survival (anti-survival) if they had a meta-z-score of 3.09 or greater (i.e., P < 0.001) and favorably prognostic (pro-survival) if they had a meta-z-score of − 3.09 or less.
Table 1 Gene expression studies that were included in meta-analyses to identify survival and lymph node metastasis-associated genesMeta-analysis of genes associated with LNMThis meta-analysis included all datasets that had at least five LNM + and five LNM0 primary HNCs (N = 20 studies with a combined total of 1449 patient primary). LNM data was accessed from reports or metadata files as either LNM status (presence or absence of LNM) or was converted from a continuous measure of LNM burden (LNM stage, ratio, or number of LMs). For each gene that was available in at least half of the studies, the following statistics were calculated for each study separately: The standardized mean difference in expression between LNM + and LNM0 primary HNCs, the standard deviation of expression in each of these groups, and the number of samples in each group. Next, we used random effects models [58] to calculate meta-z-scores and effect size summary statistics for the association of each gene with LNM status across studies, based on the combined standardized differences and standard deviations, weighted by study sample size. Genes were considered to be positively associated with LNM (Pro-LNM) if they had a meta-z-score of 3.09 or greater (i.e., P < 0.001) and negatively associated with LNM (Anti-LNM) if they had a meta-z-score of − 3.09 or less.
Meta-analysis to identify genes associated with tumor gradeA meta-analysis was performed to identify genes that were associated with tumor grade (i.e., level of differentiation), where grade was reported either using a numeric grading system of the level of differentiation upon histological analysis (well, moderate, poor). This meta-analysis consisted of 13 studies with a combined total of 1139 primary HNCs. In each study separately, linear regression was applied to test the association of each gene with grade or differentiation level, treating grade, and differentiation level as ordinal variables. For genes that were represented in two or more datasets (N = 25,058 genes), Liptak’s weighted meta-z test was used to combine z-scores for each dataset into a single “meta-z-score,” a summary statistic that indicates the association of gene with grade across studies. Liptak’s meta-z test was applied with weights set to the square roots of dataset sample sizes. Genes were considered to be positively associated with grade (pro-grade) if they had a meta-z-score of 3.09 or greater (i.e., P < 0.001) and negatively associated with grade (anti-grade) if they had a meta-z-score of − 3.09 or less.
Testing the independence of prognostic gene signatures from HPV statusRegression models were used to test the association of survival gene signatures with survival, adjusted for HPV status, and to test the association of LNM gene signatures with LNM status, adjusted for HPV status. Expression scores were calculated for each prognostic gene signature (i.e., set of prognostic genes) as the mean of expression (standardized gene expression scores) of all genes within the signature. Each patient (primary HNC) was thereby assigned an expression score for each prognostic signature. Survival gene signatures included all genes that were negatively (anti-survival) and positively (pro-survival) associated with survival, as well as genes within survival gene clusters (S1-6). Cox regression models were used to test for association of each survival gene expression score with survival, adjusting for HPV status, in all studies (N = 4) that had at least ten patients with complete data for survival, HPV status, and gene expression. LNM gene signatures included all genes that were negatively (anti-LNM) and positively (pro-LNM) associated with LNM, and genes within each LNM gene cluster (L1-6). Logistic regression models were used to test for association of each LNM gene expression score with LNM status, adjusting for HPV status, in all studies (N = 6) that had at least ten patients with complete data for LNM status, HPV status, and gene expression. For each prognostic (Survival or LNM) gene signature, an HPV-adjusted meta-z-score was calculated using Liptak’s weighted meta-z test to combine z-scores across studies, weighted by study sample size. Additional analyses performed to investigate effects of HPV status and other potential effect-modifier on gene-survival and gene-LNM associations are described in Additional file 1: Supplementary Methods & Results.
Gene set enrichment analysisGSEA was applied to all genes that were analyzed as part of the survival and LNM gene meta-analyses, to identify curated genes that were most significantly associated with each outcome, from a database of 18,993 curated gene sets. GSEA was applied to survival and LNM-associated genes using the “fgsea” R package (bioRxiv. https://doi.org/10.1101/060012). For consistency between survival and LNM-associated genes, genes were ranked by meta-z-scores, as this summary statistic was available for both. Curated gene sets were accessed from the Molecular Signatures Database (MSigDB) [59]. Selected for analysis were all gene sets in the “C1,” “C2,” “C5,” “C6,” and “H” gene categories, except for gene sets in the “CGP” (chemical and genetic perturbations) subcategory (N = 18,993 gene sets). CGP subcategory gene sets as well as gene sets in other categories (C3, C4, C7, and C8) were excluded due to the sparsity of their annotation, which makes them difficult to interpret. Gene sets with fewer than fifteen or more than 500 gene sets were removed to exclude enrichment that are less statistically and biologically meaningful. GSEA scores and p-values were calculated for the p53-DREAM target gene set (A set of 201 p53-DREAM target genes accessed from [60]) by adding this gene set to the list of MSigDB gene sets before performing GSEA.
Gene set overrepresentation analysisGSOA was applied to identify gene sets that most significantly overlapped with survival and LNM-gene clusters, from a database of 18,437 curated gene sets. GSOA was performed using the msigdbr package, which was used to access gene sets from MSigDB, in combination with the clusterProfiler package, which was used to perform hypergeometric tests. Selected for analysis were all MSigDB gene sets that were used GSEA (See: Gene set enrichment analysis). When applying GSOA to genes derived from each meta-analysis, the background gene list used for GSOA consisted of all genes considered in the meta-analysis, i.e., for which meta-z-scores were calculated. These represented genes for which data was available in a sufficient number of studies to be considered.
Unsupervised clustering of prognostic genes based on co-expression in HNC populationsPhenograph [61], an unsupervised clustering method, was used to cluster survival and LNM-associated genes (That were previously identified by meta-analyses) based on their co-expression within HNC bulk transcriptional data. Phenograph was selected over other unsupervised clustering methods to avoid the step of selecting the number of gene clusters (K) a priori, which is required for other methods and introduces bias. For prognostic genes that were associated with each outcome (survival and LNM), Phenograph was applied to a combined matrix of uniformly processed gene expression profiles from twenty studies. Since Phenograph-based clustering does not tolerate missing data, the gene expression matrices were generated using an approach that maximized the number of prognostic genes and HNC that had complete data. To achieve this, for genes associated with each clinical outcome, studies were restricted to HNCs that included data for at least 80% of genes, and data from these studies was combined into a single matrix. Clustering was then applied to genes that had complete (non-missing) data for all samples within the combined matrix. For survival-associated genes, this approach yielded a combined matrix of 1642 HNCs derived from 20 studies, which had complete data for 958/1212 (79%) of all survival-associated genes. For LNM-associated genes, the combined matrix also included 1642 primary HNCs that were derived from twenty patient studies, which included complete data for 742/877 (85%) of the LNM-associated genes. Phenograph was then applied to the combined matrix of gene expression profiles in order to identify co-expressed gene clusters in the sets of survival and LNM-associated genes. All genes that were represented in the combined gene expression matrices were included in the resulting gene clusters.
Analysis of periodic expression of LNM-associated genes based on expression in synchronously dividing cellsData that was previously published by Dominguez et al. [62] was used to investigate cell cycle phase-specific expression of LNM-associated genes. These data were generated by applying bulk RNA-Seq to map transcription in synchronously dividing cells (HeLa) that were collected at fourteen timepoints over the course of two mitotic cycles. Normalized gene expression data in the form of fragments per kilobase of transcript per million mapped reads (FKPM) was accessed from the Dominguez et al. report [62], as was data indicating the cell cycle phase within which each gene was expressed.
Processing the Stanford scRNA-Seq datasetThis dataset was described in our recent report [63] and is accessible from GEO (Accession number: GSE140042). For the current study, analysis was restricted to primary HNCs that were processed using enzymatic digestion, for consistency with the Puram dataset. Cell Ranger was used to align RNA-Seq reads to the latest GENCODE human transcriptome (Genome build hg38) and to quantify RNA counts. Sparse data matrices were then loaded into a Seurat object, which was filtered to remove genes that were present in ten cells or fewer. Low-quality and dying cells were removed by excluding cells with a unique feature count of fewer than 200 (N = 72 cells) as well as cells with a mitochondrial genome fraction of 0.4 or greater. Potential doublets were removed by excluding cells with a unique feature count of greater than 4000 (n = 444 cells). Integration was then performed by splitting the dataset into separate Seurat objects, with each object containing all the cells that derived from one HNC sample. Gene expression counts for each cell were normalized using regularized negative binomial regression, and variable genes (N = 2000) were found for each sample using the “vst” method [64]. Samples were then integrated into a single gene expression object by finding integration anchors using the “FindIntegrationAnchors” and “IntegrateData” commands. The combined genes were then scaled and centered using linear models. This integration approach removed sample batch effects such that cells clustered by cell type rather than by sample. Unsupervised clustering was applied to the integrated Seurat object in order to identify cell clusters using nearest neighbor modularity optimization [65]. PCA performed with 50 principal components (PCs) and elbow plots were then used to select the appropriate number of PCs. Unsupervised clustering was then applied to cells. To identify the appropriate number of cell clusters (i.e., the appropriate level of granularity), cell clustering was performed at multiple resolutions ranging from 0.3 to 1 in increments of 0.1. The optimal resolution was identified based on visualization of the resulting cell clusters using principal component analysis (PCA) and Manifold Approximation and Projection (UMAP). This approach was used to select the number of clusters that separated the major cell types into different clusters and that separated cell types into subclusters (cell subtypes) only where separate subclusters were clearly visible based on PCA and UMAP visualization. Cell clusters were manually annotated with their cell type and subtype by visualizing their expression of known cell type marker genes. Also visualized were gene expression scores of cell type marker gene signatures that were accessed from PangloaDB [66]. Gene expression scores were calculated for each signature as the mean expression (scaled normalized counts) of all genes in the signature, which indicated the expression of the signature in each cell. Cell type assignments were confirmed by applying Seurat to transfer cluster labels from the preexisting Puram HNC scRNA-Seq dataset [67]. This approach used a model that was trained on primary HNCs from the Puram dataset, for which cell types had been previously annotated, to classify cell clusters in the new Stanford scRNA-Seq dataset. Where multiple cell clusters of the same cell type were observed (such as classic fibroblasts and myofibroblasts), cell subtypes were manually annotated by visualizing gene expression signatures for cell subtypes.
Processing the Puram scRNA-Seq datasetThe Puram scRNA-Seq dataset was accessed from GEO (Accession number: GSE103322) as a preprocessed series matrix file. The dataset was then loaded into a Seurat object and was split into separate sample objects, with each object containing all of the cells that derived from one sample. Samples were restricted to primary HNCs with a minimum of 200 cells (N = 9 samples). The Puram dataset was subsequently processed using Seurat, as described for the Stanford scRNA-Seq dataset (See “Processing the Stanford scRNA-Seq dataset”). Cell type labels that were previously assigned by Puram et al. [67] were accessed from the GEO metadata. The validity of these cell type assignments was confirmed by UMAP-based visualizing expression of cell type marker genes and signatures.
Cells were labeled in a way that was consistent between the Puram and Stanford scRNA-Seq datasets, in order to facilitate comparison between these datasets. For this reason, cells that were labeled as myocytes in the Puram dataset were excluded from all analysis, as cells expressing myocyte markers were not observed in the Stanford dataset. Moreover, while macrophages or dendritic cells were labeled by Puram et al., these cells are labeled as myeloid cells in the current study, as we found that in both the Puram and Stanford scRNA-Seq datasets, myeloid lineage cells clearly separated from other cell types but expressed markers of macrophages, dendritic cells, and monocytes. This is consistent with emerging evidence that cells of the mononuclear phagocyte system (macrophages, dendritic cells, and monocytes) do not represent discrete cell types but have overlapping transcriptional profiles and functions [68, 69].
Prediction of additional cell phenotypes/states in scRNA-Seq datasetsCell cycle phase was inferred using Seurat, based on expression of cell phase-specific marker genes. CytoTRACE [70] was applied to the raw count gene expression matrix for all epithelial cells, as per user protocol. CytoTRACE was applied to malignant cells only, according to the user manual recommendation that CytoTRACE be applied separately to cells of different lineages. Epithelial to mesenchymal (EMT) score was calculated as the sum of normalized counts for mesenchymal genes (VIM, CDH2, FOXC2, SNAI1, SNAI2, TWIST1, GSC, FN1, ITGB6, MMP2, MMP3, MMP9, and SOX10) minus the sum of normalized counts of epithelial genes (CDH1, DSP, and TJP1), as previously described [71, 72].
Bulk transcriptional profiling of flow-sorted cellsBulk RNA-Seq was used to profile transcriptomes of distinct cell populations that were isolated from primary HNCs using fluorescence activated cell sorting (FACS):
Patient samplesPrimary tumor tissue samples were collected between March 2017 and April 2018 from patients (n = 15) undergoing surgical resection of HNSCC (including squamous cell carcinoma of the oral cavity, oropharynx, and larynx) at the Stanford Hospital, Stanford, CA, after informed consent. Inclusion criteria included a diagnosis of HNSCC and age ≥ 18 years. Fresh tumor tissue specimens, with clinical annotation, were collected at the time of extirpative surgery and freshly frozen within 6 h after surgical resection. This study was performed in compliance with ethical regulations outlined in a Stanford Institutional Review Board (IRB)-approved protocol (protocol no. 11402). Details of patient clinicopathologic features are provided in Additional file 2: Table S2.
Sample preparation for florescence-activated cell sorting (FACS)FACS sample preparation included obtaining tumor tissue from consented patients within 4 h after tumor tissue removal. Tumor tissue was placed on ice in 50 μl tissue digestion media, DMEM-F12 + with magnesium and calcium (Corning Cellgro, Manassas, VA), 1%FBS (heat inactivated), 10 units/ml Penicillin-10ug/ml Streptomycin (Gibco, Grand Island, NY), and 25 mM hepes (Gibco, Grand Island, NY). Tumor tissue was thoroughly diced with a sterile scalpel and placed in a gentleMACS C-tube (Miltenyi Biotec, Sunnyvale, CA) containing 1.5 ml of tissue digestion media. Tissue was mechanically digested on the GentleMACS dissociator five times under the human tumor tissue program h_tumor_01. Two milliliters of tissue digestion media and 0.5 ml of 3000U/ml collagenase/1000U/ml hyaluronidase (StemCell Technologies, Vancouver, BC) were added to the C-tube after mechanical digestion. The tissue in the C-tube was incubated at 37° C rotating for 1 h, then filtered with a 40-μm nylon cell strainer (Falcon, Corning, NY) into a 14-ml tube containing 14 ml tissue digestion media and centrifuged at 4 °C for 10 min at 514RCF. The enzymatically digested cell pellet was resuspended in 1–4 ml ACK lysis buffer (Gibco, Grand Island, NY) depending on the pellet size and number of red blood cells, for 2 min on ice. Cells were filtered as before, washed with 14 ml of FACS buffer (phosphate buffered saline) without calcium or magnesium (Corning, Manassas, VA), 2%FBS heat inactivated, 10 units/ml Penicillin-10ug/ml Streptomycin (Gibco, Grand Island, NY), and 1 mM Ultra-pure EDTA (Invitrogen, Carlsbad, CA), and centrifuged at 4 °C for 10 min at 514RCF. Cells were resuspended in FACS buffer, counted on a hemacytometer and washed one more time with FACS buffer. Cells were kept in FACS buffer on ice until flow cytometry staining.
Flow cytometry staining and sortingCells were incubated in the dark on ice for 30 min with fluorescent markers (Additional file 2: Table S3), at the manufacturers’ recommended concentration, except for DAPI, which was added after the last wash. Cells were washed three times with FACS buffer and sorted on a BD Aria II SORP using the BD FACSDIVA v8.0.1 software into 4 groups, CD3 + CD19 + CD45 + CD68 + (leukocytes), unstained (malignant cells), FAP + (fibroblasts), and CD31 + or CD140a + (endothelial cells) in tissue digestion media containing 30% FBS. Cell sorts had an average efficiency of 86.8% on Purity precision sorting, rerunning sorted samples to test for purity was not performed due to the need for enough RNA to sequence. Cells were spun at 4 °C for 10 min at 514RCF and resuspended in RNAlater stabilization solution (Invitrogen, Carlsbad, CA) at the recommended manufacturer’s concentration and stored at 4 °C for less than a week before RNA extraction.
Flow cytometry gatingCells were analyzed using FlowJo V. 10.6.1 and first gated on single cell size using FSC width and height and cell granularity using SSC width and height (Additional file 1: Figure S1). Live cells were gated using the DAPI stain. From the live cell gate, the leukocyte group in FITC and endothelial group in PE were used to separate out CD3 + CD31 leukocyte cells from CD3 − CD31 endothelial cells. Leukocyte and endothelial negative populations were used to gate further for fibroblasts in APC and the malignant unstained (leukocyte, endothelial, and fibroblast negative) group.
Bulk RNA sequencing of flow-sorted cellsRNA was extracted from sorted cells within a week of cell sorting. After washing in PBS, cell pellets were used to prepare RNA with the RNAeasy + micro kit with column removal of genomic RNA. RNA samples were quality controlled using the Agilent 2100 Bioanalyzer system. Library preparation was performed using the SMARTer Stranded Total RNA-seq v2 Pico input mammalian kit (Clontech) at the Stanford Protein And Nucleic acid (PAN) facility. Bulk RNA sequencing was performed using the Illumina Hiseq4000 System, inputting 500 pg–5 ng of total RNA per sample and pooling 8–12 samples into each sequencing lane. Sequencing was performed at the Stanford Center for Genomics and Personalized Medicine (SCGPM) facility. This dataset is accessible from Gene Expression Omnibus (Accession number: GSE113839).
Preprocessing and analyzing the Stanford flow-sorted cell bulk RNA-Seq dataTrim Galore! was used to perform adaptor trimming and filtering of raw reads. Kallisto [73] was used to align reads to the GENCODE 34 human transcriptome (Genome build hg38). MultiQC was used to perform quality control of RNA-Seq samples based on the output of Trim Galore! and Kallisto. Transcript-level counts were summarized to gene level using tximport [74]. DESeq2 [75] was used to convert gene-level count data (The output of tximport [74]) to a “DESeqDataSeq” object and to normalize the RNA-Seq counts by dividing them by estimated size factors. These normalized RNA-Seq counts were used to identify the cell type that featured highest expression of each prognostic gene, representing the cell type with the maximum mean normalized count value. Normalized counts were log2-transformed prior to data visualization.
Estimating cell fractions within the Stanford flow-sorted cell bulk RNA-Seq datasetCIBERSORTx [76] was applied to gene-level transcripts per million (TPM) data in combination with a signature matrix derived from the Puram scRNA-Seq dataset. This signature matrix was derived from HNC scRNA-Seq data, ensuring that the gene signatures used to infer cell fractions were representative of cell types found within HNC tumor microenvironments.
Preprocessing and analyzing the Huang bulk RNA-Seq datasetRaw fastq files were accessed from the European Genome Phenome Archive (Dataset ID: EGAD00001004489) and were processed as described for the Stanford bulk RNA-Seq dataset (see “Preprocessing and analyzing the Stanford flow-sorted cell bulk RNA-Seq data”). Wilcoxon rank sum tests were used to test for differences in mean expression (DESeq2-normalized counts) of anti-LNM and pro-LNM genes between primary HNCs and patient-matched LNMs.
Testing association of gene expression with TP53 mutation statusAssociation of pro-LNM cluster 4 genes with somatic TP53 mutations was established based of differential expression analysis within the TCGA [16] and Wichmann [19] bulk gene expression datasets, as well as the Puram scRNA-Seq dataset. Within each bulk gene expression dataset, Wilcoxon rank sum tests were used to test for differences in mean gene expression (normalized counts) of cluster L4 genes between subgroups of HNCs that were stratified based on their TP53 mutation and HPV status. Cluster L4 gene expression was compared between p53 proficient HNCs (HPV − ve and TP53wt) and two separate groups of p53-deficient HNCs, including TP53mut (HPV − ve) HNCs, and HPV + ve HNCs. For the TCGA study, TP53 mutation data was accessed from the MC3 Public MAF file [77]. For the Wichmann study, TP53 mutation data was accessed from GEO metadata (Accession: GSE65858). Excluded from the analysis were Wichmann study HNCs that were annotated as having “non-disruptive” TP53 mutations and that were HPV negative, due to the ambiguity of their p53 proficiency. In the Puram scRNA-Seq dataset, multiple linear regression was used to test for association of mean cluster L4 gene expression (Normalized counts) with TP53 mutation status (the mutation status of the overall tumor), adjusted for cell cycle phase (estimated by Seurat), within malignant cells. TP53 mutation status (as indicated by targeted or whole exome sequencing) were accessed from the Puram et al. report.
Analysis of expression of LNM-associated genes in oral premalignant lesionsOPL data was accessed from GEO (Accession: GSE26549) [78]. This dataset included gene expression array data for 86 OPL (oral leukoplakia) biopsies that were annotated with follow-up (oral cancer-free survival) information, 84 of which were also were annotated for histology. Raw expression array. CEL files (Affymetrix Human Gene 1.0 ST Array) were processed using the “affy” R package in combination with a platform-specific custom CDF file that was accessed from Brainarray (http://brainarray.mbni.med.umich.edu/). Expression data were normalized using the mas5 algorithm. Probe-level data was next summarized to gene-level data using the WGCNA package [29], using the default “maxmean” method for probe filtering, and summarized gene data were log2 transformed. Subsequent statistical analyses were applied to log2-transformed data.
Data analysis softwareData analysis was performed using R versions 3.6.1 and 4.1.0. Bulk RNA-Seq reads were trimmed and filtered using Trim Galore! (Version 0.6.0) and were quality controlled using MultiQC v1.9 within Python 2.7.5. Bulk RNA-Seq reads were pseudoaligned using Kallisto (linux-v0.46.0). Aligned bulk RNA-Seq reads were converted to gene-level estimates using Tximport 1.14.2. Gene-level bulk RNA-Seq counts were normalized using DESeq2 (1.26.0). Single-cell RNA-Seq reads (10x Genomics) were processed and aligned using Cell Ranger (6.1.2). Subsequent processing and analysis of single-cell RNA-Seq data was performed using Seurat (4.1.0). Flow cytometry data were analyzed using FlowJo Cytometry Analysis Software (BD Biosciences). Other programs and tools are indicated in the relevant “Methods” and “Results” sections.
Comments (0)