The National Cancer Database (NCDB) is one of the largest cancer registries in the world providing real-world data on the management and outcomes of patients with malignancies in the USA.14 It has been established and curated jointly by the American Cancer Society and the Commission on Cancer of the American College of Surgeons. It is a hospital-based database that prospectively captures data from patients who received cancer care at a Commission on Cancer-accredited hospital. The database covers more than 1500 facilities (approximately 30% of all hospitals in the USA) and approximately 70% of all newly diagnosed cancer cases in the USA.
Dedicated registrars extract and collect patient data from medical charts.15 Multiple electronic automated checks alert data registrars to missing data fields and internally inconsistent data to enhance data integrity. In addition, frequent auditing of the data ensures quality control and improvement. For patients who receive care at multiple facilities, only the most complete record is maintained in the database. Data are de-identified and individual patients and providers cannot be identified. However, each hospital is assigned a unique identifier. To protect the privacy of patients, adult files include only those aged >18 years. Furthermore, any reporting of groups of fewer than 10 patients must be masked in order to protect confidentiality. For example, if a sample contains only eight patients with a certain diagnosis, it could not be published as ‘8’ but could be reported as ‘<10.’
Data IncludedAll major gynecologic malignancy cases are collected in the NCDB in separate dataset files (uterus, ovary, vagina, vulva, cervix) that are renewed each year. The most up to date dataset includes cases diagnosed between 2004 and 2021. Data prior to 2004 have been collected but are not available in the public use files. The database includes basic demographic variables (patient age, race, type of insurance, median household income at postal code of residence), type of treatment facility, pathology variables (histology, tumor grade, number and status of regional lymph node examined), details of surgery performed (type of surgery, performance of lymphadenectomy, mode of surgery, inpatient hospital stay, peri-operative mortality) as well as receipt of chemotherapy and radiotherapy, follow-up, and patient status. A detailed variable dictionary is available online for prospective researchers.16
Access, Management, and StorageResearchers can submit their proposals after obtaining a letter of support from the chairman of the cancer committee at the researcher’s institution through a dedicated website. A data analyst is required and must be located at the facility of the primary investigator. Data are free of cost; however, they are available only for researchers affiliated with one of the approximately 1500 participating Commission on Cancer institutions and thus are not available for researchers outside of the USA. The NCDB participant user files are de-identified and Health Insurance Portability and Accountability Act compliant; however, local institutional review board review of the proposed research is required.
Following review of the research plan and approval of the proposed application, the researcher can sign the data sharing agreement and receive the data file requested in a format that can be downloaded and processed by all major statistical software programs. Data should be securely maintained until the proposed research plan is completed while only the primary and co-investigators listed in the proposal can have access to the data. Access or distribution of the data to outside parties not listed in the original agreement is prohibited. The American College of Surgeons and the Commission on Cancer do not verify, and are not responsible for, the analytical or statistical methodology employed in each study, or the conclusions drawn from these data.
Database HighlightsThe NCDB is one of the most comprehensive cancer databases with a mechanism that permits continuous collection of high-quality data. It provides a unique platform to investigate patterns of care of patients with gynecologic malignancies in the USA and provide hypothesis generating data. The large number of patients included across multiple sites and histologic subtypes is a major advantage, while a relatively large number of variables collected permit meaningful analyses.
Database LimitationsThe major limitation of the NCDB, similar to other observational databases, is patient selection bias since it includes only patients managed at Commission on Cancer accredited hospitals that are required to demonstrate compliance with certain quality metrics of cancer-care delivery. The NCDB is hospital-based and thus not designed to represent the total US population overall. Moreover, other limitations include lack of genomic or imaging data, or information on tumor recurrence and cause of death, precluding calculation of recurrence, disease-free or cancer-specific survival. However, time from cancer diagnosis to last contact or patient death is provided and permits calculation of overall survival. Capturing patient’s functional status and comorbidity profile is limited since only the Charlson-Deyo Comorbidity Index is provided, while other functional status indexes such as Eastern Cooperative Oncology Group are not available. In addition, details on chemotherapy administration (specific agents used, dose, and number of cycles) and specific details on the surgical procedures performed are lacking. Accuracy of staging information assigned to each case may also vary since over-coding has been described in other tumor sites, and care should be taken to avoid staging misclassifications when analyzing data for certain sites such as the cervix.17 Similarly, accuracy of information related to radiation therapy administration may be limited.18 Lastly, for certain variables depending on disease site there is a high prevalence of missing data that could potentially bias conclusions.19
Surveillance, Epidemiology, and End Results (SEER) ProgramThe Surveillance, Epidemiology, and End Results (SEER) Program was created following the National Cancer Act and is maintained by the National Cancer Institute. SEER is administered and funded by the Division of Cancer Control and Population Sciences at the National Cancer Institute. SEER currently collects and publishes data on cancer incidence, prevalence, and survival from population-based cancer registries covering approximately one-third of the US population. It provides incidence, overall survival, and mortality data for histopathologic cancer subtypes.20 21
The SEER Database has been rigorously studied and validated. It is considered to be the gold standard of cancer registries and has become the standard for data quality among international cancer registries. SEER regions are included based on their capacity to maintain a high-quality cancer registry and having populations representative of the total US population.22 Initially, seven registries were included, and incrementally were expanded to the current 22 registries. The SEER population tends to have an over-representation of people born in foreign countries and those residing in urban areas. Furthermore, there is a deliberate over-sampling of specific racial and ethnic minority groups, in order to improve the diversity of the SEER population and allow for evaluation of differences by race/ethnicity.23 The population covered by SEER is representative of the general US population in regard to measures of poverty and education.
Data IncludedSEER data are now available for the years 1975–2019. Data collected include all primary invasive cancers and in situ carcinomas. SEER includes demographic information, such as age at diagnosis, date of diagnosis, gender, race/ethnicity, and county of residence. Additionally, it includes primary tumor information, such as stage, and pathologic findings. Cancer data are updated annually to capture vital status, survival time, and cause of death. Vital status is confirmed by linkage to the National Death Index and with supplemental data on date of last known contact obtained by medical record abstraction. Tumor recurrence data are currently not collected and therefore progression-free survival, correlates of local, regional, and distant control, and the effectiveness of salvage therapy cannot be assessed. For some non-gynecologic cancers, biomarkers are also included in most recent years. Although some information about upfront treatment is included, several important classes of data regarding treatment, including detailed chemotherapy (ie, specific drugs used and dosing) and radiotherapy information, are unavailable.
A recent study comparing SEER with SEER–Medicare data, reported an overall sensitivity of 80% for SEER radiotherapy data and 68% for SEER chemotherapy data.24 Details of surgical management during the first course of treatment are extracted from health records. However, specific details about the surgery are unavailable (as an example, whether a hysterectomy was laparoscopic or open). Additionally, the intent of the surgeon is not captured, and therefore it is uncertain whether a patient underwent surgery for palliative or curative intent.23
Access, Management, and StorageAccess to the SEER website (www.seer.cancer.gov) is unrestricted. SEER data are free to use, but those who want to use it must get permission from SEER. This is accomplished through submitting a request on the SEER website and signing a data-use agreement.
SEER data can be accessed through SEER online software, or through SEER*Stat and then downloaded in tabular form to allow for additional analyses. The online software includes SEER*Stat, SEER PREP, JointPoint, and the Health Disparities Calculator. Use of all databases and software is free and can be accessed by completing the online application form. SEER*Stat is a versatile tool for making queries of SEER data. It is particularly useful for easily calculating age-adjusted rates. The Cancer Statistics Review option provides summaries of all cancers and site-specific cancers. The summaries include 5-year survival data, relative survival as compared with the general population, and survival by tumor stage and race/ethnicity.
In addition to cancer datasets, other datasets in the SEER program are the standard population data for SEER areas, US mortality data, and US population data linked to a Census Tract Socioeconomic Status (SES) Index or to county attributes. The specialized Census Tract-level SES and Rurality Database (2006–2018) has five census tract-level attributes: two SES quintiles, two rurality variables, and persistent poverty. These data can be used for matched analyses with SEER cancer data.
Database HighlightsThe SEER Database is an excellent, increasingly used resource for clinical cancer investigation. Unlike the NCDB, SEER is a population-based database. Population-based databases like SEER are important for understanding the implications of pathology diagnoses across demographic groups, geographic regions, and time, and they provide unique insights into the practice of oncology in the USA that are not attainable from other sources.25 26
SEER is useful in the analysis of national demographic differences and regional trends in the diagnosis, treatment, and outcome of all cancers.23 27 28 Furthermore, comparative effectiveness studies regarding the benefits of cancer treatment can be done, with the understanding that inherent limitations in the SEER Database should be acknowledged.24 Details of the development of prognostic models based on initial demographic and clinicopathologic variables have also been published.29 Because of its ability to provide large sample sizes, SEER also allows the opportunity to analyze rare diseases and risk of second malignancies.
Database LimitationsWhile the SEER Database is a valuable tool for clinical cancer research, several limitations should be considered when analyzing and interpreting results from SEER. There are minimal data about non-cancer information, such as comorbidities, limiting the ability to adjust for some potential confounders. Many limitations revolve around under-reported and incomplete data regarding adjuvant therapy (specific chemotherapy agents and dosing, type of radiation therapy, fractionation, and dosing), unrecorded variables, variations in data reporting, migration of patients in and out of SEER registry areas, unmeasured confounding, and selection bias.23 30 Finally, caution should be taken when using treatment information owing to to limitations of the granularity of treatment agents, modalities, and dosing from SEER.24 Detailed specific information about the type, dose, and duration of chemotherapy, radiation therapy, are not within the purview of SEER data collection. Knowledge gaps pertaining to treatment and follow-up may arise when individuals transition between SEER and non-SEER regions, potentially introducing bias into conclusions regarding cancer behavior. Notably, SEER does not capture data related to risk-reduction procedures (eg, risk-reducing salpingo-oophorectomy) or organ removal for non-cancer indications (eg, hysterectomy). Nevertheless, it is possible to deal with this limitation by leveraging supplementary data sources discussed in this review.
SEER–Medicare DatabaseThe SEER–Medicare linked data resource is a national population-based data source of Medicare beneficiaries with cancer. This database combines Medicare claims information with the SEER cancer registry, such that each patient has full cancer-specific information linked with all their Medicare-billed medical care. The SEER–Medicare database is owned by the National Cancer Institute and Centers for Medicare and Medicaid Services. It is directly overseen by the Information Management System (National Cancer Institute’s information contractor).
Data IncludedThe SEER–Medicare database specifically reports on Medicare patients who have been diagnosed with cancer. The SEER cancer registry collects cancer incidence data from population-based cancer registries covering approximately 50% of the US population. The paired Medicare data captures all insurance claims reimbursed by fee-for-service Medicare, but not any that are reimbursed by private payers and third-party payers. This database does not capture any treatment received before a patient was enrolled on Medicare, after a patient discontinues Medicare, or during lapses in Medicare coverage. Currently, the SEER–Medicare database contains information on cancers diagnosed between 1999 and 2019.
SEER–Medicare includes information on patient demographics, primary tumor site, tumor morphology, stage at diagnosis, first course of treatment, vital status (death date), and cause of death. It captures treatments received in real-world practice, including the type of hysterectomy and surgical staging based on Current Procedural Terminology (CPT) codes (laparoscopic, open, cytoreductive surgery, radical hysterectomy, lymph node dissection, lymph node biopsy, type of colectomy), chemotherapy infusions through Healthcare Common Procedure Coding System infusion and drug codes (J codes, NDC codes), and radiation therapies through CPT codes (external beam, intensity-modulated radiation therapy, brachytherapy).
SEER–Medicare also captures oral cancer-directed therapies (PARP (poly-(ADP-ribose)-polymerase) inhibitors, mTOR (mammalian target of rapamycin) inhibitors) and other prescriptions. SEER–Medicare also provides reimbursement costs for all Medicare claims, allowing for calculation of global and specific costs of care. It does not, however, capture any out-of-pocket reimbursements, or payments made by individual institutions or individuals to any organizations outside of Medicare. It also captures care rendered by individual providers (who can be tracked through a designated de-identified provider variable), as well as hospital site and type where a service was rendered. It provides patient demographics, though items such as zip code require a special additional application. Health provider-based demographics can also be cross-linked to an American Medical Association-Physicians file, via an additional application and fee submitted to the American Medical Association. Dates are available for billed claims for all treatments. Researchers receive data for all Medicare claims for each patient during the years of interest, including Medicare claims unrelated to cancer care.
Access, Management, and StorageSEER–Medicare data are not publicly available. Access requires a data use agreement application, for a specific research question. The purpose of the data use agreement is to ensure confidentiality of patients and providers in the SEER registry. The purchase fee for a comprehensive cohort of gynecologic cancers may range from US$10 000 to US$15 000, depending on which files are requested. Application review and approval may take several months. The SEER–Medicare application instructions and fee for obtaining SEER–Medicare data, as well as a comprehensive summary of database files, data dictionaries explaining the variables included, and available support services, are available online.31
The database itself is stored by the US government and provided to the researcher in a secure password-protected fashion after data use agreement approval. The principal investigator must provide a detailed storage and protection plan for the data (storage on a secure password-protected server). Data must be stored at the institution of the principal investigator listed on the agreement and may not be transferred or shared by other institutions. Data use agreements are 5 years from the date of approval and are not typically extendable. At the conclusion of the agreement, all data including analytic files must be destroyed.
The data are stored in separate files including the Cancer File, and multiple Medicare claims files organized by inpatient, outpatient, drug, home health agency, durable medical equipment, hospice, hospital, census tract, and zip-code characteristics. Patients are identified by a unique identifier across these files to permit linking and merging of the data for analysis. Thus, extensive computer programming and coding is required to collate and merge information to answer a particular research question. The data files can be analyzed using a statistical software program such as SAS, Stata, R, or other similar computing capabilities.
Database HighlightsSEER–Medicare is particularly useful for identifying and validating actual treatments received by patients immediately preceding or following a cancer diagnosis across the USA. For example, one can determine the chemotherapy drug infused (paclitaxel, carboplatin, cisplatin, or other) during adjuvant therapy, whether it was given concurrently or serially with or without radiation, for endometrial cancer.32 It allows researchers to conduct research questions in rarer but more aggressive subtypes of gynecologic cancers, such as uterine carcinosarcoma.33 It also permits capture of rare outcome events, such as deep venous thromboembolism, or stroke, that might be difficult to study using smaller institutional or local databases.34 It allows for evaluation of outcomes related to cancer and treatment-based factors, such as overall survival, and cancer-specific survival. It also permits analyses regarding healthcare use, national and regional patterns of care, and cost of care rendered.35
Database LimitationsThe SEER–Medicare database has a population of patients with cancer that is predominantly over 65 years of age because SEER–Medicare specifically reports on fee-for-service Medicare patients who have been diagnosed with cancer. Nearly all Medicare enrollees are 65 years or older because adults in the USA may start enrollment at age 65 according to federal regulations. A small subset of adults who are younger than 65 may also enroll in Medicare, if they qualify based on meeting criteria for a disability, end-stage renal disease, or have amyotrophic lateral sclerosis. Thus, SEER–Medicare is less useful for cancers that primarily occur in younger populations. The SEER–Medicare database does not include patients enrolled in Medicare Advantage plans. As enrollment in Medicare Advantage plans increased to over 50% in 2023,36 a smaller proportion of patients are included in SEER–Medicare.
SEER–Medicare does not specifically capture measures regarding functional status or quality of life. It also does not directly capture metastases occurring after diagnosis. Thus recurrence, progression, and derived outcomes, such as disease-free survival, are not available. It does not acquire intent of treatment, physician treatment recommendations, or patient preferences related to treatment. Pathology-specific limitations include lack of surgical margin status and lymphovascular space invasion. It also does not include laboratory or imaging test results, body mass index, or American Society of Anesthesiology score. SEER–Medicare does not directly obtain treatment-related toxicities, although acute events can be identified if a clinician has billed for it or if patient received treatment for it. It does not necessarily specify the start date of chronic conditions (since that could have started in years prior to a patient enrolling on Medicare). SEER–Medicare provides a thorough explanation of the limitations of its dataset.37
To protect confidentiality, any reporting of groups of fewer than 11 patients must be masked. For example, if a sample contains only eight patients with a certain diagnosis, it could not be published as ‘8’ but could be reported as ‘<11.’ When working with specific gynecologic cancer populations, some of these resultant variable fields may frequently be quite small, and care must be taken to report these in a confidentially appropriate fashion. SEER–Medicare provides guidance on appropriate reporting of data for small cell counts.38
MarketScan DatabaseMerative MarketScan is a widely used commercial source of healthcare data composed of insurance claims for individuals enrolled in employer-sponsored commercial health insurance plans and supplemental Medicare plans, as well as Medicaid beneficiaries from 11 states.39 MarketScan data include claims for more than 270 million unique patients. Following a 2022 reorganization, MarketScan data are owned and licensed by Merative, a standalone healthcare data and analytics company that was previously a division of IBM.39 Merative provides services to employers, health plans, and state Medicaid agencies who agree to contribute data to MarketScan. These data are standardized and de-identified and can be licensed for commercial or academic research.
Data IncludedMarketScan covers all cancer types, and includes records dating back to 1995.39 Data are updated regularly with an approximate 12–24 month delay to data availability. As of 2023, the most recent year available was 2022. MarketScan is one of the most contemporaneous large data sources in the US. MarketScan data include claims submitted by hospitals, outpatient clinics, pharmacies, and other healthcare facilities. Claims data include International Classification of Disease (ICD) and Current Procedural Terminology (CPT) codes as well as charges, reimbursement and out-of-pocket costs.40 It includes health plan enrollment and demographic data, claims, and benefit plan information. For some populations, data on disability and workplace absence, inpatient drug administration, dental care, and laboratory results are also available through linkages to specialized databases. Custom linkages are also possible.
Access, Management, and StorageTo conduct research using MarketScan, investigators and their intuitions must work with Merative (previously IBM Watson Health and Truven) to obtain a license. Cost can be prohibitive to researchers without extramural funding or significant institutional support and range up to US$50 000. Some academic intuitions may provide MarketScan access to faculty through an existing license. Researchers desiring to access these data should contact Merative through their website.41 Licenses are finite and must be renewed after expiration. Furthermore, querying and analyzing the data may require an experienced analyst or programmer.
Database HighlightsMarketScan data are useful for describing trends and patterns of use of cancer-related care, measuring the financial burden of such care, and investigating clinical outcomes that are amenable to measurement in claims data.
Database LimitationsThe availability of some demographic characteristics, such as race and ethnicity and socioeconomic variables, is limited.42 Furthermore, while cancer care can be identified through diagnosis and procedure codes, histologic information, stage, recurrence, and cause of death are difficult to ascertain reliably from claims data, limiting the usefulness of MarketScan to studies focused on long-term cancer outcomes.43 Changes in a patient’s insurance status limit subsequent follow-up. Like other sources of coded data, claims in MarketScan were impacted by the transition of the 10th revision of ICD in 2015. Finally, MarketScan data are drawn from a convenience sample that over-represents employees of large employers and their dependents, leading to the systematic under-representation of individuals employed in small- and medium-sized companies.39
Healthcare Cost and Use Project Database (HCUP)The Healthcare Cost and Utilization Project (HCUP) is the United States health service data platform that is supported by the Agency for Healthcare Research and Quality (AHRQ), one of the twelve federal agencies within the United States Department of Health and Human Service.44 The AHRQ’s central mission is to promote patient safety and quality improvement.
Launched more than three decades ago in 1988 and developed through federal, state, and industry partnership, HCUP is the largest all-payer encounters-based healthcare database in the USA. HCUP includes a group of several databases capturing national and state-level data. The national-level databases include the National Inpatient Sample (NIS), the Nationwide Ambulatory Surgical Sample (NASS), the Nationwide Readmission Database, the Nationwide Emergency Department Sample, and the Kid’s Inpatient Database. The state-level databases include the State Inpatient Databases, the State Ambulatory Surgery and Services Databases, and State Emergency Department Databases. This article specifically focuses on NIS and NASS owing to their relevance to surgical care in gynecologic oncology.
NIP is HCUP’s national-level inpatient database.45 It approximates a stratified-sample of 20% of discharges in each center from all the HCUP-participating hospitals across 48 states and the District of Columbia. Every year the dataset captures more than 7 million inpatient admissions. In 2020, a total of 4580 hospitals participated in the HCUP program. When weighted for national survey estimates, NIS covered more than 97% of the US population.
NASS is HCUP’s national-level outpatient database that was started in 2016.46 It approximates a stratified-sample of 67% of ambulatory surgery encounters in each hospital-owned center every year. In 2020, nearly 2900 hospital-owned centers across 35 states and the District of Columbia participated in the HCUP data-capturing mechanism, collecting 7.8 million outpatient surgical encounters in that year. The research implications for gynecologic malignancies include the strength of these databases in the area of surgical care and outcomes' evaluation. These databases have been used widely in gynecologic oncology, including a number of studies examining trends, patterns of care, and outcomes.
Data IncludedKey data elements of the NIS program are patient demographic data, hospital parameters, length of stay, total charges, and mortality data during the admission. In addition, the program captures a maximum of 40 diagnoses and 25 procedures for the index admission in each encounter. These diagnoses and procedures are recorded via WHO’s ICD codes. As of the fourth quarter of 2015, the program transitioned to the 10th revision (ICD-10): clinical modification (ICD-10-CM) for diagnostic codes and procedure coding system (ICD-10-PCS) for procedure codes. The ICD-10-PCS data includes time to the index procedure. HCUP also provide their own codes for clinical categories (diagnosis-related group).
Unlike the NIS program, NASS captures the surgical procedure data via the American Medical Association’s CPT codes. The diagnosis codes in NASS follow the ICD-10-CM. NASS also provides HCUP’s own clinical categorical schema (clinical classifications software-services and procedures).47
Access, Management, and StorageBoth NIS and NASS are publicly available, de-identified databases and are available through the HCUP central distributor.48 Investigators take a review course to go over the outline and content of the data platform, data handling and storage, and reporting guidelines. A dataset is available per one full year basis from January to December, and the cost of the dataset varies across the program.
Database HighlightsFirst, these databases can provide temporal national trends of patient demographics, surgical practice, and peri-operative outcomes—for example, temporal trends of hysterectomy route after the LACC trial report).49 These databases are useful for assessing the characteristics in certain areas, such as (i) uncommon clinical entities (pregnancy and malignancy,50 pelvic organ dysfunction surgical treatment in the setting of gynecologic malignancy),51 (ii) diseases that are not captured in other tumor registries (cervical and endometrial pre-malignancies),52 53 (iii) outpatient surgical practices (same-day hysterectomy, operative hysteroscopy), and (iv) rare surgical procedures (pelvic exenteration,54 cesarean radical hysterectomy).55 Third, these databases may be useful to examine peri-operative morbidity and mortality. Unique information in these databases to assess peri-operative outcomes includes patient factors (obesity, medical comorbidity, and frailty) and hospital parameters (bed capacity, location, and teaching status). Failure-to-rescue (mortality risk after peri-operative morbidity)56 is an important quality metric of peri-operative care, and NIS program can be used to assess this outcome. The impact of a national-level healthcare crisis can be also assessed (COVID-19 inpatient case fatality in patients with cancer).57
Database LimitationsImportant limitations of these databases include lack of information for cancer stage, histology, and post-discharge data, including long-term oncologic outcome (recurrence and death) and adjuvant therapy. Despite these limitations, the HCUP databases are valuable research tools for investigators for whom surgical care and peri-operative management are of interest. The data available through these platforms allow for a broad understanding of national outcomes that has proved valuable for clinical practice in a number of scenarios.
Premier Healthcare Database (PHD)The Premier Healthcare Database (PHD) is a large, all-payer database that captures data from hospitals across the USA. The origins of the PHD began in 1997 and the data source has evolved over time. Currently, PHD collects data annually from over 700 hospitals, with the exact number of participating hospitals varying each year. The hospitals sampled are from across the USA and include non-profit and non-governmental teaching and community hospitals in both urban and rural settings.58
Data IncludedPHD captures both inpatient and outpatient encounters at participating hospitals and, as of 2020, contained over 231 million unique patients. PHD includes comprehensive inpatient data for all participating hospitals. Each year, more than 10 million inpatient visits, approximately 25% of inpatient admissions in the USA, are captured in the PHD. It also captures hospital-based outpatient visits, including from ambulatory surgery centers, emergency departments, and alternative sites of care. Each year, more than 93 million outpatient visits are logged in the PHD.58
Within the PHD, patients are assigned a unique identifier and can be tracked across visits within a given hospital. Each PHD encounter records clinical and demographic data, physician and hospital data, and service data from the encounter. Clinicodemographic data that are available include patient sex, race and ethnicity, payer, and age. PHD obtains ICD codes, including both ICD-9 and ICD-10 codes, for both diagnoses and procedures, and CPT codes and Healthcare Common Procedure Coding System codes. Admission and discharge diagnoses can be discerned and the disposition of the patients at discharge is available.58
Each encounter includes data elements characterizing the treating facility, including region of the country in which the hospital is located, hospital bed size, hospital teaching status, and whether the hospital is in an urban or rural location. Comparative analyses have demonstrated that the make-up of hospitals in PHD is roughly similar to survey data provided by the American Hospital Association, although there are fewer smaller hospitals in PHD. In addition to hospital characteristics, the specialty of the admitting and attending physician are available in PHD.
The PHD captures charge coding, which allows identification of medications, tests, services, and devices that were used during a given patient encounter. These data in turn are used to determine costs associated with each encounter.58 Details of chemotherapy treatment and radiation therapy are obtained if they are administered at a hospital-owned facility.
Access, Management, and StorageThe PHD is a proprietary database that can be licensed for use.59 All data are de-identified. The database is purchased from the company for use. Analysis can be performed using standard statistical software.
Database HighlightsThe PHD has been used widely in numerous studies in healthcare. The comprehensive capture of inpatient data and hospital-based outpatient care is well suited to studies examining safety and quality.60 61 Similarly, the PHD has been widely used to explore real-world comparative effectiveness of treatments.62 63
The PHD also has a number of unique features that have been leveraged. Only a limited number of data sources capture prescription drug use in hospitalized patients. The pharmacologic data within PHD can be used to examine inpatient drug prescribing. Likewise, inclusion of devices and services used by patients is unique and has been leveraged to study device use where an ICD or CPT billing code may not yet exist.64 65 Finally, unlike many data sources that capture charges, the PHD obtains the actual cost to a hospital to deliver a service. Some hospitals report actual cost data, whereas others report cost-to-charge ratios.66
Database LimitationsThe PHD has several limitations. Importantly, while an individual patient’s encounters within a given hospital can be linked, encounters across different hospitals from the same patient cannot be linked. Second, although the PHD records hospital-based outpatient treatments, such as emergency department visits, other outpatient care, such as physician office visits and outpatient prescription drug use, are generally not captured. As such, the PHD is often best suited to studies examining episodic, hospital-based care as opposed to longitudinal care that spans inpatient and outpatient settings. Finally, like many administrative databases, the PHD lacks granular disease characteristics (eg, histopathologic type and staging) that may be available in tumor registries that are linked to disease-specific data.
National Surgical Quality Improvement Program (NSQIP)After the historic success of the Veterans Affairs surgical database in reducing surgical morbidity and mortality, the American College of Surgeons (ACS) developed the NSQIP Database for the private sector to replicate the success of the Veterans Affairs.67 The NSQIP now operates in more than 700 hospitals across the USA in 49 out of 50 states. The exception is Michigan, where a similar program—the Blue Cross Blue Shield of Michigan—funds the Michigan Surgical Quality Collaborative.
These programs are unique compared with other datasets discussed in this article for several reasons. First, these programs deploy abstractors within the participating hospitals to abstract patient-level data of a subset of patients. Both programs deploy sampling methodology to capture a representative subset of patients. Details of their sampling methods have been described.68 Second, the abstractors (or nurses in some cases) call the patients at 30 days to determine surgical morbidity and mortality. Finally, both programs report the hospital-level de-identified information to compare participating hospitals and encourage best practices to reduce surgical morbidity and mortality. Given that Michigan Surgical Quality Collaborative data are not widely available to anyone outside of Michigan, the remainder of this section will focus on NSQIP data only.
Access, Management, and StorageThe NSQIP dataset resides with the ACS. Participating organizations can obtain the dataset from the ACS website.69 These datasets are available from 2005 to 2020 for the entire dataset. However, in 2014 a special procedure-targeted dataset was created by the ACS for hysterectomies. The application process for the NSQIP dataset is relatively simple if the requesting person uses their institutional email address, as long as their institution participates in the NSQIP. Once the application is filled and the applicant’s institutional leadership approves the request, the dataset is available for download free of cost. Only participating institutions can use the data.
Management and storage of data requirements for NSQIP are not as clearly specified on the ACS website as for the Medicare data or the NCDB datasets. However, given that these datasets are compliant with the Health Insurance Portability and Accountability Act, we recommend that users should store them on a secure drive with encrypted access to prevent unauthorized access to these datasets. The ACS website does not specify any limits to the time the data can be kept with institutions. However, if the user moves out of the institution enrolled in NSQIP to one not enrolled in the program, that individual loses access to the data.
Data IncludedThe primary dataset includes several variables about pre-operative workup, diagnosis codes, type of surgery performed, operative time, length of stay, and 30-day outcomes measures, such as surgical complications, readmissions, and death. Basic pre-operative comorbidity information is also available in the NSQIP dataset.
The NSQIP hysterectomy datasets contain additional information specific to hysterectomy procedures, such as the subspecialty of the surgeon, the size of the uterus, and whether or not cystoscopy was performed. More details of these site-specific variables are available on the ACS website by accessing the participant user file guides.70 Unfortunately, the official linkage of NSQIP to any other dataset is unavailable. Moreover, attempts to join NSQIP and NCDB to study the impact of short-term (30-day) outcomes on long-term survival outcomes have failed in the past.71 No other linkage has been widely used for this dataset.
Database HighlightsThe NSQIP Database is uniquely not administrative. In other words, the data are not collected from the insurance claims (unlike Medicare or the NIS datasets). The fact that every sampled patient is followed up for 30 days to inquire about the outcomes makes the validity of this dataset far superior to that of many other administrative datasets. However, in recent years, individual departments have tried to validate the NSQIP data with their internal registries and found that the dataset has several inconsistencies, especially in the disease-specific participant user files.72 Nevertheless, given the paucity of large registries linking surgical outcomes in the USA, NSQIP is the best available dataset for monitoring the quality of participating institutions. A large sample size (close to 900 000 cases in the 2020 file) gives the dataset adequate statistical power to study rare events and complications.
Database LimitationsLike any sizeable national dataset, NSQIP has several limitations. The NSQIP dataset focuses solely on the 30-day outcomes of a procedure. The appropriateness of surgery is not shown in this dataset, as no variables capture the indication of surgery and the alternatives offered. Moreover, the dataset does not obtain any patient-reported outcomes to elucidate the success of surgery. For example, a patient undergoing a hysterectomy for pelvic pain might not have experienced any complications, but there might not have been any improvement of her pain as result of surgery. NSQIP, or any other large dataset available in the USA, cannot answer this question.
NSQIP variables frequently change, evolve, or get dropped completely from the dataset—researchers should carefully refer to the yearly participant user file guides before linking multiple years for trend analysis. NSQIP does not provide hospital-level data. Therefore, researchers seeking answers about the relationships between the procedure volumes and outcomes should use something other than the NSQIP dataset. Finally, one of the most significant limitations of the ACS NSQIP dataset is that no prospective study has shown that participation in NSQIP translates to better patient outcomes at those hospitals. A 2015 quasi-experimental study showed that surgical quality improvement and Medicare payments were unrelated to participation in NSQIP. In other words, hospitals not participating in NSQIP performed just as well as those participating in this program.73
Comments (0)