Validity Evidence for Procedure-specific Competence Assessment Tools in Orthopaedic Surgery: A Scoping Review

Changes in orthopaedic surgery residency training brought on by work-hour restrictions and reduced surgical caseloads have resulted in programs incorporating new evaluation techniques of residents.1–3 The assessment of technical skills in the age of competency-based medical education relies on frequent evaluations by multiple observers over time and is turning from subjective toward objective assessments.1

Current objective assessment tools can be classified as global rating scales, procedure-specific tools, or hybrid scales.4 Global rating scales are generic tools that can be used to assess performance for multiple different procedures, whereas procedure-specific tools can best address the specificity required for competency-based medical education and generate specific feedback for trainees.4 Hybrid scales combine task-specific checklists with global rating scales and enjoy the benefits of both but as a result take longer to complete.4

Although numerous assessment tools in orthopaedic surgery have been developed,4 the validity evidence supporting these tools is lacking.3,4 Other surgical specialties including general surgery and cardiothoracic and vascular surgery have used a validity framework based on content, response process, internal structures, relation to other variables, and consequences to critically appraise assessment tools, with good interrater reliability.5–10 Although other orthopaedic surgery assessment tools have been previously evaluated in the literature,3,4,11 no review studies have specifically examined procedure-specific tools. The purpose of this study was to systematically review the literature on procedure-specific assessment tools in orthopaedic surgery and to assess validity evidence and educational utility for each tool, as well as to appraise the methodology of the identified studies. We hypothesize that there are few procedure-specific assessment tools supported by robust validity evidence.

Methods

This study adhered to the Preferred Reporting Items for Systematic Review and Meta-analysis extension for Scoping Reviews.12 The Preferred Reporting Items for Systematic Review and Meta-analysis extension for Scoping Reviews checklist is available in Supplemental Figure 1, https://links.lww.com/JG9/A313. A detailed description of the search methodology used has been reported elsewhere.9

Search Strategy, Study Selection, and Data Extraction

A health sciences librarian conducted a systematic search in May 2023 on the following eight databases: OVID Medline, Ovid EMBASE, OVID PsycInfo, OVIDHealth and Psychosocial Instruments, SCOPUS, ProQuest Dissertations and Theses Global, Cochrane Library, and PROSPERO. The concepts of ‘validation’ and ‘competence’ and ‘surgeons’ were used, and no limits were applied. Results were managed with the Covidence systematic review software. Reference lists of included studies were hand-searched for additional studies. At least two independent reviewers conducted initial title and abstract screening. Two reviewers (Y.L., R.C.) screened full-text articles. All conflicts were resolved by consensus decision. The inclusion criterion was assessment of validity evidence for procedure-specific orthopaedic surgery competency assessment instruments. Exclusion criteria were assessment of global rating scales (unless modified to be procedure-specific) or bedside procedures (eg, joint aspiration, closed reduction of a fracture, and physical examination), non–English studies, and conference abstracts and theses. Two reviewers (Y.L., R.C.) extracted information on each assessment tool using a Microsoft Excel (Microsoft Corp) template created by the authors at the beginning of the study (Appendix 1; https://links.lww.com/JG9/A314).

Validity Evidence, Methodological Rigor, and Educational Utility Assessment

Two independent reviewers (Y.L., R.C.) assessed validity evidence, methodological rigor, and educational utility for each study. Disagreements were resolved by consensus decision. Validity evidence was scored using the five domains of the framework of Ghaderi et al8 (content, response process, internal structure, relation to other variables, and consequences), with a maximum score of 15. Methodological rigor was assessed using the eight items of the Medical Education Research Study Quality Instrument framework, which assessed study design, sampling, type of data, data analysis, and outcome, with a maximum score of 18.l3 Educational utility was assessed using four domains of the Accreditation Council for Graduate Medical Education (ACGME) educational utility framework (ease of use, resources required, ease of interpretation, and educational impact).14

Results

Database search identified 4,450 studies. After 1,894 duplicates were removed, 2,556 studies underwent title and abstract screening, excluding 2,266 studies. Full text of 290 studies were reviewed, and 17 studies met inclusion criteria (Figure 1). Additional review of reference sections from these 17 studies identified another five studies meeting inclusion criteria, totaling 22 studies included in the analysis.

F1Figure 1:

Preferred Reporting Items for Systematic Review and Meta-analysis flow diagram for study screening and inclusion. Reproduced with permission from Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021;372:n71. doi: 10.1136/bmj.n71.

Study and Assessment Tool Characteristics

We identified 22 studies using 24 procedure-specific surgical assessment tools (Table 1).15–36 These tools assessed a variety of orthopaedic surgery procedures, including diagnostic knee arthroscopy and partial meniscectomy,21,31,36 arthroscopic hip labral repair,26 diagnostic shoulder arthroscopy,21,29 arthroscopic rotator cuff repair and labral repair,15,16,19,22–25,27 open surgical approaches to the shoulder (deltopectoral, lateral deltoid-splitting, and posterior),28 shoulder arthroplasty,20 arthroscopic hamstring anterior cruciate ligament reconstruction,30 open carpal tunnel release,34,35 trigger finger release,34 percutaneous transforaminal endoscopic diskectomy,17 and fracture fixation.18,21,32–34 All tools included a checklist of critical steps that were graded categorically. All tools except the Arthroscopic Bankart Metric and percutaneous transforaminal endoscopic diskectomy 10-step checklist were part of hybrid tools that also included a global rating scale.17,22–25 All but five studies assessed tools in a simulation environment only; of the five, three assessed live operations21,29 and two assessed arthroscopic recordings of operations.15,16 Study participants included residents, fellows, and fellowship-trained attendings. Twenty-one tools were designed to evaluate resident performance, two tools were designed to distinguish between novice and experienced orthopaedic surgeons,15,16,22,25 and one tool was designed to evaluate spine surgeons learning a new technique.17 Only two of the studies studying four different tools specified that the tool was intended for formative assessment20,28; other studies did not distinguish whether the tool was meant for formative or summative assessment.

Table 1 - Studies Assessing Procedure-specific Surgical Assessment Tools in Orthopaedic Surgery Author Year Procedure Setting Number of Assessment Tools Study Participants Target Population Formative/Summative Demirel 2022 Arthroscopic rotator cuff repair Operating room (video recordings) 1 2 novice surgeons and 2 expert surgeons Expert vs. novice surgeons Not stated Demirel 2017 Arthroscopic rotator cuff repair Operating room (video recordings) 0a Expert surgeons (number not specified) Surgeons Not stated Gadjradj 2022 Percutaneous transforaminal endoscopic diskectomy Operating room 1 Spine surgeons Surgeons Not stated Hoyt 2022 Long bone open reduction and internal fixation Simulation (animal model) 1 20 residents and attendings Residents Not stated Hauschild 2021 Arthroscopic Bankart repair Simulation (cadaver) 1 38 residents Residents Not stated Lohre 2020 Reverse shoulder arthroplasty Simulation (cadaver) 1 18 senior residents Residents Not stated Wagner 2019 Shoulder arthroscopy, knee arthroscopy, ankle open reduction and internal fixation) Operating room 3 8 residents in one study phase and 22 resident in subsequent study phase Residents Formative Gallagher 2018 Arthroscopic Bankart repair Simulation (video recordings of cadaver) 1 44 senior residents Experienced vs. novice surgeons Not stated Angelo 201523 Arthroscopic Bankart repair Simulation (video recordings of cadaver) 0b None Experienced vs. novice surgeons Not stated Angelo 201524 Arthroscopic Bankart repair Simulation (video recordings of cadaver) 0b 12 senior residents and 10 shoulder surgeons Experienced vs. novice surgeons Not stated Angelo 201525 Arthroscopic Bankart repair Simulation (video recordings of dry model) 0b 7 senior residents and 12 shoulder surgeons Experienced vs. novice surgeons Not stated Phillips 2017 Arthroscopic hip labral repair Simulation (dry model) 1 37 residents, 5 sports medicine fellows, 5 attendings Residents Not stated Dwyer 2017 Arthroscopic rotator cuff repair and labral repair Simulation (dry model) 2 Rotator cuff repair: 39 residents, 7 sports medicine fellows, 5 sports medicine fellowship-trained attendings. Labral repair: 35 residents, 6 sports medicine fellows, 5 sports medicine fellowship-trained attendings. Labral repair: 35 residents, 6 sports medicine fellows, 5 sports medicine fellowship-trained attendings Residents Not Stated Bernard 2016 3 open surgical approaches to shoulder (deltopectoral, lateral deltoid-splitting, posterior) Simulation (cadaver) 3 23 residents Residents Not stated Talbot 2015 Diagnostic shoulder arthroscopy Operating room 1 6 residents Residents Formative Dwyer 2015 Arthroscopic hamstring anterior cruciate ligament reconstruction Simulation (dry model) 1 40 residents Residents Not stated Cannon 2014 Diagnostic knee arthroscopy Simulation (virtual simulator) 1 48 postgraduate year (PGY)-3 residents Residents Not stated LeBlanc 2013 Ulnar fracture fixation Simulation (virtual simulator and Sawbones) 1 22 residents Residents Not stated Yehyawi 2013 Complex tibial plafond articular fracture surgery Simulation (dry model) 1 12 residents Residents Not stated Van Heest 2012 Trigger finger release, open carpal tunnel release, and distal radius fracture fixation Simulation (cadaver) 3 27 residents Residents Not stated Van Heest 2009 Carpal tunnel release Simulation (cadaver) 0c 26 residents and 2 hand fellows Residents Not stated Insel 2009 Diagnostic knee arthroscopy and partial meniscectomy Simulation (cadaver) 1 59 residents, 3 sports medicine fellows, 6 sports medicine fellowship-trained attendings Residents Not stated

aThis study evaluated the same tool as the other Demirel study.

bThese studies all evaluated the same tool as the Gallagher study.

cThis study evaluated one of the same tools as the other Van Heest study.


Validity Evidence Assessment (Framework of Ghaderi et al)

Validity evidence was low across all studies, ranging from 1 to 9 of a maximum score of 15 (Table 2).

Table 2 - Detailed Validity Evidence for Procedure-specific Surgical Assessment Tools Tool Article(s) Content (Max 3) Response Process (Max 3) Internal Structure (Max 3) Relation to Other Variables (Max 3) Consequences (Max 3) Total Score (Max 15) Arthroscopy rotator cuff repair metrics Demirel 2017 and 2022 2 1 1 1 0 5 Percutaneous transforaminal endoscopic diskectomy 10-step checklist Gadjradj 2022 1 0 0 0 0 1 OSATS checklist for long bone ORIF Hoyt 2022 1 0 1 1 1 4 Procedure-specific checklist for arthroscopic Bankart repair Hauschild 2021 1 0 0 0 0 1 OSATS checklist for reverse shoulder arthroplasty Lohre 2020 1 1 0 0 0 2 Task-specific checklist for shoulder arthroscopy Wagner 2019 2 2 1 1 2 8 Task-specific checklist for knee arthroscopy Wagner 2019 2 2 1 1 2 8 Task-specific checklist for ankle ORIF Wagner 2019 2 2 1 1 2 8 Arthroscopic Bankart Metric Gallagher 2018, Angelo 2015 and 2015 and 2015 3 1 2 1 0 6 Task-specific checklist for arthroscopic hip labral repair Phillips 2017 2 0 1 1 0 4 Task-specific checklist for arthroscopic rotator cuff repair Dwyer 2017 2 1 2 3 0 8 Task-specific checklist for arthroscopic labral repair Dwyer 2017 2 1 2 3 0 8 OSATS checklist for deltopectoral approach to shoulder Bernard 2016 2 0 2 3 0 7 OSATS checklist for lateral deltoid-splitting approach to shoulder Bernard 2016 2 0 2 3 0 7 OSATS checklist for posterior approach to shoulder Bernard 2016 2 0 2 3 0 7 Shoulder Objective Practical Assessment Tool for diagnostic shoulder arthroscopy Talbot 2015 3 1 2 2 1 9 Task-specific checklist for arthroscopic anterior cruciate ligament reconstruction Dwyer 2015 2 1 2 2 0 7 Procedural checklist for diagnostic knee arthroscopy Cannon 2014 3 2 1 2 0 8 OSATS checklist for ulnar fracture fixation LeBlanc 2013 1 1 1 1 1 5 Procedure-specific checklist for complex tibial plafond articular fracture surgery Yehyawi 2013 1 0 0 1 0 2 OSATS checklist for carpal tunnel release Van Heest 2012 and 2019 1 0 2 2 1 6 OSATS checklist for trigger finger release Van Heest 2012 1 0 1 2 1 5 OSATS checklist for distal radius fixation Van Heest 2012 1 0 1 2 1 5 Basic Arthroscopic Knee Skill Scoring System checklist for diagnostic knee arthroscopy and partial meniscectomy Insel 2009 2 0 0 2 0 4

OSATS = Objective Structured Assessment of Technical Skills; ORIF = open reduction and internal fixation

Overall, tools scored highest in the content validity domain. Three tools scored 3 (12.5%), 12 tools scored 2 (50.0%), and nine tools scored 1 (37.5%). A list of items was available for all but one tool.19 All tools except five were developed by content experts (not specified in five tools).15–20 Fourteen tools (58.3%) underwent the modified Delphi technique for revision.

Tools scored poorly in the response process domain. Four tools scored 2 (16.7%), eight tools scored 1 (33.3%), and 12 tools scored 0 (50.0%). Rater training (4/24, 16.7%), pilot testing (7/24, 29.2%), participant familiarity with the tool (3/24, 12.5%), and qualitative analysis of thought process (1/24, 4.2%) were sources of evidence in this category.

The internal structure domain scores were moderate, with nine tools scoring 2 (37.5%), 10 tools scoring 1 (41.7%), and five tools scoring 0 (20.8%). Most tools (19/24, 79.2%) were assessed by intertest reliability. Other forms of evidence presented included measures of interrater reliability (16/24, 66.7%), intrarater reliability (1/24, 4.2%), internal consistency (14/24, 58.3%), and item analysis (2/24, 8.3%).

Tools scored better in the relation to other variables domain, with five tools scoring the maximum of 3 (20.8%), seven tools scoring 2 (29.2%), nine tools scoring 1 (37.5%), and three tools scoring 0 (12.5%). Most tools were correlated with postgraduate level of training (18/24, 75.0%) and a global rating scale (12/24, 50.0%). Other variables correlated with the tools included pass/fail assessments (6/24, 25.0%), self-reported previous number of the assessed procedure performed (6/24, 25.0%), number of months spent in relevant subspecialty rotations (3/24, 12.5%), novice or expert status (1/24, 4.2%), knowledge test (1/24, 4.2%), and various other specialized tests (visualization scale, probing scale, and Precision Score, each 1/24, 4.2%).

Tools scored very poorly in the consequences domain, with three tools scoring 2 (12.5%), six tools scoring 1 (25.0%), and 15 tools scoring 0 (62.5%). Only one tool (4.2%) provided a cut score well supported by data, and only six tools (25.0%) demonstrated support from users for their educational utility and value as determined by postsurvey data.

Methodological Quality (Medical Education Research Study Quality Instrument Framework)

Methodological quality of studies was moderate, with scores ranging from 5.5 to 16.5. Most studies scored 11.5 (6/22, 27.3%) or 12.5 of a maximum score of 18 (9/22, 40.9%) (Table 3). One study (4.5%) designed to assess face and content validity for the Arthroscopic Bankart Metric tool scored 5.5 because it did not assess implementation of the tool.14 Most studies (20/22, 90.9%) lost points for study design because they were single-group cross-sectional studies, and all studies lost points for outcome because no studies assessed a change in physician behaviors or patient or healthcare outcomes after the use of the tool.

Table 3 - Medical Education Research Study Quality Instrument (MERSQI) Scores Study Year Study Design (Max 3) Sampling Institutions (Max 1.5) Sampling Response Rate (Max 1.5) Types of Data (Max 3) Validity Evidence (Max 3) Data Sophistication (Max 2) Data Analysis (Max 1) Outcomes (Max 3) Total Score (Max 16.5) Demirel 2022 1 Not specified n/a 3 1 2 1 1.5 9.5 Demirel 2017 1 Not specified n/a 3 1 2 1 1.5 9.5 Gadjradj 2022 1 1.5 1.5 3 0 2 1 1.5 11.5 Hoyt 2022 1 1 0.5 3 0 2 1 1.5 10 Hauschild 2021 2 0.5 1.5 3 0 2 1 1.5 11.5 Lohre 2020 3 1.5 0.5 3 0 2 1 1.5 12.5 Wagner 2019 1 0.5 0.5 3 2 2 1 1.5 11.5 Gallagher 2018 1 1.5 0.5 3 1

Comments (0)

No login
gif