Use of human predictive patch test (HPPT) data for the classification of skin sensitization hazard and potency

The HPPT database

As noted above, the work presented here utilized a database of 2277 test results for 1366 unique test substances, which is described in detail in another paper from this group (Strickland et al. 2023). This database contains data generated with two major HPPT designs (the HMT and the HRIPT), both of which are explicitly mentioned in the GHS text. Unless noted otherwise, the evaluations presented in this review are based on this full database.

Importantly, we introduced a “Relative Reliability Score” [RRS, cf. OECD (2021b) and Strickland et al. (2023)] to assess the reliability of individual study results from “highly reliable” (RRS = 1) to “not reliable” (RRS = 5). We only included study results with an RRS of 1–4 (2255 of the total of 2277 results in the database) in the assessment reported here. The reasons for assigning an RRS = 5 are explained in detail in Strickland et al. (2023).

The reference substance list for the OECD DA project initially consisted of the “Cosmetics Europe” reference list of 128 substances, described in detail in Hoffmann et al. (2018). In the course of the OECD work, some of these substances were removed because of variable or ill-defined composition, while other substances were added, mainly to broaden the set of LLNA-negative reference chemicals. The final OECD DA reference list contains 196 reference substances with in vitro data [Direct Peptide Reactivity Assay (DPRA, OECD TG 442C), KeratinoSens (OECD TG 442D), human Cell Line Activation Assay (h-CLAT, OECD TG 442E)] and varying degrees of coverage by LLNA and HPPT data (OECD 2021a, d). In this review, the subset of the full HPPT database associated with these chemicals is referred to as the “OECD DA reference list”.

Modified classification criteria for single HPPT results

To overcome the limitations of the current GHS criteria regarding the sub-categorization based on single HPPT results, we developed a more sophisticated system of classification to answer the following two questions:

1.

If a positive test result (≥ 1 sensitized individual) is obtained at an induction DSA > 500 µg/cm2, (how) can the likelihood of a positive outcome at ≤ 500 µg/cm2 be determined?

2.

Is there a DSA or test concentration < 100% at or above which a negative test result (no sensitized individual) can be accepted as such without confirmation that this DSA or concentration represented the highest achievable value, and (how) can this concentration be determined?

DSA1+

To address Question 1 above, we needed to estimate whether the minimum induction DSA causing a positive test result (i.e., the DSA sensitizing exactly one test subject) is less than, equal to or greater than 500 µg/cm2. We started by introducing the “DSA1+” (i.e., the hypothetical DSA that sensitizes exactly one test subject). If test data at different DSAs were available, the number of sensitized individuals could be plotted versus the DSA and a benchmark dose could be derived. However, in practice this is not possible for most HPPT data, which are usually generated using only one test concentration (i.e., only one dose–response data point is available). In this situation, one approach to estimate the DSA1+ is by linear extrapolation of the induction DSA causing the number of positive responses observed in the test to the hypothetical DSA resulting in exactly one positively tested individual, i.e., to calculate the DSA1+ as DSA1+  = DSA/(number of sensitized individuals).

For example, if five individuals were tested positive with an induction DSA of 300 µg/cm2 in a given test, the DSA1+ hypothetically resulting in only one sensitized individual would be one fifth of that DSA, i.e., 60 µg/cm2. Note that this is not at all meant to imply that the dose–response is actually linear, it is just an approximation in the absence of relevant dose–response information similar to linear inter-/extrapolation used in determining EC3 values from LLNA dose–response information.

The DSA1+ can then be used for classification under the GHS in the same way as the DSA:

If DSA1+  ≤ 500 µg/cm2, then this test result results in classification as Skin Sens. 1A.

If DSA1+  > 500 µg/cm2, then classification as Skin Sens. 1B is appropriate.

As an example, Fig. 1 shows two substances, one (x) causing exactly one sensitized individual, at a DSA (which is also the DSA1+) slightly below the 500 µg/cm2 cut-off, and consequentially sub-categorized as Skin Sens. 1A. The other substance (o), tested at a slightly higher DSA above 500 µg/cm2, would be sub-categorized as Skin Sens. 1B under the current GHS criteria, regardless of the fact that many more individuals (six) were sensitized. To compare the potency of both substances, the DSA for the second substance is converted to the DSA1+ by extrapolation which now clearly falls into the 1A range.

Fig. 1figure 1

Comparison of classification results based on positive test results for two different substances (x and o), using the current GHS dose descriptor (DSA) and the newly proposed descriptor DSA1+ (see text for details)

To differentiate classification outcomes obtained in this way from "standard" GHS classifications, we will call them "extrapolated" classifications.

DSA05

As an alternative to the DSA1+, a dose descriptor called "DSA05" (i.e., the induction DSA resulting in 5% of the test panel being sensitized), has been proposed in the literature (Griem et al. 2003). This parameter is obtained by a linear extrapolation approach in much the same way as the DSA1+ (i.e., DSA05 = [(DSA/% incidence) × 5%].

The current GHS criteria for HPPT data do not use a percent incidence cut-off or threshold (such as 5% sensitized). Instead, classification of the test substance as a skin sensitizer results from the occurrence of one or more sensitized members of the test panel. It is obvious that this may relate to very different sensitization rates depending on panel size [e.g., 1/25 or 4% in the case of the standard HMT designs (Kligman 1966; Kligman and Epstein 1975) and from 1/50 (2%) to 1/200 (0.5%) in the standard HRIPT designs (Draize 1959; Griffith 1969; Jordan and King 1977; Marzulli and Maibach 1973; Marzulli and Maibach 1980; Politano and Api 2008; Shelanski and Shelanski 1953; Voss 1958)].

Thus, the GHS' convention of choosing an absolute incidence threshold limits the comparability of test results obtained with test panels of different sizes. Two substances can be considered equipotent, if under identical test conditions the same dose of both substances leads to the same magnitude and incidence of effect in a given population (Chiu and Slob 2015). Although magnitude of effect is usually not reported with HPPT results, the DSA05 could be a better choice for the direct comparison of relative skin sensitization potencies of two substances than the DSA, as it allows for comparing doses that caused the same incidence of sensitization in humans under comparable test designs, regardless of panel size. In this review, therefore, we compare the classification outcomes based on both, DSA1+ and DSA05.

Limit for acceptance of negative results

To address Question 2 above, we also wanted to estimate a minimum concentration or dose, above which it would be unlikely that a test result would be falsely regarded as negative. As the OECD working group analyzed both HPPT and LLNA reference data in parallel, we wanted to conduct both assessments as consistently as possible. Therefore, we chose an approach suitable for both data types, which was based on the test concentration, and not the DSA (normally not available for LLNA test results).

In a first step, we determined for each positive result in the database the hypothetical concentration leading to exactly one sensitized individual (CONC1+) by linear extrapolation from the observed number of sensitized individuals at the test concentration applied (CONC). In analogy to the approach chosen to calculate the DSA1+ above, CONC1+ is calculated as CONC1+  = CONC/(number of sensitized individuals). We then analyzed the distribution of CONC1+ values for all positive test results in our HPPT database, for which a CONC1+ value could be calculated (n = 592/605 positive test results in totalFootnote 2). The median CONC1+ was determined to be 1.3%, the 95th percentile 10% and the 99th percentile 25%.

These numbers suggest that if a negative test result were obtained at a test concentration > 25%, the substance could still be a sensitizer, but its potency would be lower than that of 99% of the substances in the database with a positive result. Based on these findings, we decided to use 25% (the 99th percentile) as the cut-off (i.e., minimum test concentration) for test results to be accepted as negative.

Borderline or ambiguous classifications

Given the variability and uncertainty associated with the HPPT data as well as the corresponding ambiguity in classification outcome, we defined a borderline range near the 1A/1B cut-off of 500 µg/cm2. For substances with a DSA1+ in this borderline range, there is a higher likelihood of incorrect sub-categorization than for those with DSA1+ values at a greater distance from the cut-off. Since variability and uncertainty around the HPPT data cannot be reliably quantified [cf. OECD (2021b)], the width of the borderline range can only be chosen somewhat arbitrarily. Still, this allows for a more uniform, transparent, and reproducible classification mechanism and is therefore preferable to a subjective, "expert judgment"-based case-by-case approach.

For DSA1+ and DSA05, we chose a borderline range of ± 25% around the 500 µg/cm2 cut-off (i.e., from 375 to 625 µg/cm2). Negative results with a test concentration < 25% were considered ambiguous (see next section for details).

Modified classification approach

Based on the considerations above, we applied the following modified classification approach:

For negative test results with CONC ≥ 25%, the classification outcome was NC (not classified), regardless of the induction DSA value. This means that according to the criteria of the GHS, this test result does not call for a classification of the test substance as a skin sensitizer; it does not, however, mean that this test result proves that the substance is not a sensitizer.

Negative test results with CONC < 25% obtained at an induction DSA > 625 µg/cm2 (i.e., above the upper boundary of the borderline range around the cut-off between sub-categories 1A and 1B) were assigned the ambiguous classification outcome NC/1B. NC/1B indicates that it is not possible to decide between two GHS classification outcomes (NC or 1B) with sufficient certainty. However, for test results with this outcome, the likelihood that sub-categorization as 1A would be appropriate is considered very low (whereas 1B cannot be excluded).

In the case of a negative test result obtained with CONC < 25% and an induction DSA ≤ 375 µg/cm2 (i.e., below the lower boundary of the borderline range around the 500 µg/cm2 cut-off), or for which CONC was not available, the ambiguous classification outcome “NC/1” was assigned. This effectively means that the test result cannot provide any decisive information on whether the substance is a skin sensitizer or not. Therefore, such results were excluded from the overall classification process.

For positive test results, the extrapolated classification was 1B, if DSA1+  > 625 µg/cm2, and 1A, if DSA1+  ≤ 375 µg/cm2

Positive test results with 500 µg/cm2 < DSA1+ ≤ 625 µg/cm2 received an extrapolated classification as “1B+”. These test results were interpreted as borderline 1B, showing a moderate sensitization potential (1B), but with some (non-quantifiable) likelihood of underclassification (i.e., of assigning a less strict sub-category than appropriate).

For some of the positive test results, a DSA1+ value was not available. To such cases, we assigned the classification outcome “1”,Footnote 3 in order to reflect that a reliable GHS sub-categorization (1A or 1B) was not possible.

Positive test results with 375 µg/cm2 < DSA1+ ≤ 500 µg/cm2 were classified as “1A−”. These test results were interpreted as borderline to showing a strong sensitization potential (1A), but with some (non-quantifiable) likelihood of overclassification (i.e., assigning a stricter sub-category than necessary).

Figure 2 shows schematic representations of the current GHS approach (a) and the modified approach for obtaining extrapolated classifications (b).

Fig. 2figure 2

Schematic representation of the logic of the classification process for individual HPPT results according to a the current GHS criteria and b the modified approach discussed herein for generating “extrapolated classifications”. na = not available

Combination of multiple HPPT results into a WoE classification result

For a considerable number of substances in the HPPT database, multiple HPPT results were available. In cases in which these results were not fully concordant for a given substance (i.e., where they pointed at different classification outcomes), the overall classification outcome had to be determined using a WoE approach. We applied and compared three different WoE approaches for this purpose. Examples of their application are also provided in OECD (2021b).

The three approaches differed from each other with respect to some variation of underlying rules and assumptions. In this way, we were able to perform a sensitivity analysis regarding the influence of those variations on the overall classification outcome and, hence, the robustness of the overall WoE conclusion.

WoE score method

In this approach, we first determined the extrapolated classification outcomes for the individual test results based on Fig. 2b. Next, each outcome received a numerical score based on the scheme in Table 1.

Table 1 Numerical scores assigned to extrapolated classification outcomes from individual test results

The scores assigned to each extrapolated classification outcome were chosen intuitively (but also—to a degree—arbitrarily) to reflect a possible way in which a risk assessor could combine multiple HPPT results in a WoE assessment. Test results with the ambiguous outcome NC/1 did not receive a numerical score, nor did we consider them for the overall WoE classification.

We then added up the individual scores from all tests and divided the sum by the number of test results to obtain an overall WoE score for each substance, which was rounded to the second decimal.

The “median-like location parameter” (MLLP) method

In addition to the WoE score method, we applied the MLLP approach described for the analysis of LLNA data by Hoffmann and co-workers:

“This parameter was defined as the median for substances with repeat studies with an EC3 in more than 50% of the repeats. For substances with at least 50% negative repeat studies, i.e. no EC3 value was available, the parameter was defined as the modified median. The first step in deriving the modified median was to review the negative studies in detail: when the maximum concentration tested in a given study was lower than the median EC3 of the positive studies for the same chemical, the respective negative study was excluded, because it was considered a limited validity as tested concentrations were too low. From the remaining negative and all positive studies, the median was used as a location parameter (modified median). In the case of 50% of repeat studies being negative and 50% being positive, the highest EC3 value was defined as the modified median.” (Hoffmann et al. 2018).

For adaptation to the extrapolated HPPT-based classification outcomes, however, we needed to further interpret their approach as follows:

Test results with the ambiguous classification outcomes NC/1 or NC/1B were considered negatives in this WoE approach, but only if the database for the chemical under consideration also included studies with a positive outcome (1A, 1B, or 1) and the DSA applied in the NC/1B study was greater than or equal to the median DSA1+ of the positive studies (for which a DSA1+ was available). This rule also implies that if there are only NC/1B outcomes, no MLLP is available.

If 50% or more of the study results remaining after the previous step were positive, the substance was considered a sensitizer, but if they were negative, the overall reference classification was NC.

For GHS sub-categorization, test results with a positive, but ambiguous classification outcome 1 were excluded (in addition to the NC/1 and NC/1B studies with test concentrations that were too low, as described above). Then, the MLLP of the remaining study results was calculated. In the case where a substance had an even number of HPPT results factoring into the WoE assessment, and if the median fell between two test results with DSA1+ values, we calculated the MLLP as the average of those two values. If it fell between the highest negative study and the lowest test result with a DSA1+ value, that DSA1+ value was the MLLP.

If the available individual test outcomes were only NC or NC/1B, the overall MLLP was NC.

The “median sensitization potency estimate” (MSPE) method

The MLLP approach as published in Hoffmann et al. (2018) and further interpreted by us has certain weaknesses. In some cases, it produces less strict WoE results than would seem appropriate or intuitive, compared to how different test results might be brought together in a WoE assessment by a regulator tasked with classifying the respective substance for skin sensitization.

To address these weaknesses, we further modified the MLLP approach as follows to obtain a "Median Sensitization Potency Estimate" (MSPE):

As for the WoE score method, we began by excluding all NC/1 test results from the assessment, since they do not add any relevant information (instead they add noise to the median determination).

Positive test results with the outcome 1 (i.e., without an available DSA1+ value) were included when determining the position of the median.

All test outcomes, whether numerical (DSA1+ values) or categorical (1, NC/1B), were called “Sensitization Potency Estimates” (SPEs, in analogy to the “Acute Toxicity Estimate (ATE) values of the GHS) and the median test result was therefore called the “Median Sensitization Potency Estimate” (MSPE).

All SPE values were finally arranged in the following order of ascending potency:

NC → NC/1BFootnote 4 → Numerical SPE (DSA1+) values > 500 µg/cm2 in descending order → 1 → Numerical SPE DSA1+ values ≤ 500 µg/cm2 in descending order.

The value of the MSPE was then determined as follows:

If there were one or more positive results in addition to one or more NC/1B results, but there was no clear NC result, the median DSA1+ of the positive results with numerical values was taken as the MSPE. However, when the number of 1A (including 1A−) study results equaled that of the 1B (including 1B+) results, the MSPE was 1.

If there were one or more NC results and all other test outcomes were NC/1B, the MSPE was NC.

In all other cases (i.e., those in which clear positives and negatives were both present and where the median fell between a numerical and a non-numerical result), the numerical result was taken as the MSPE.

Determination of the overall classification outcome

The outcome from the individual WoE methods was translated into three different overall classification modes, as shown in Table 2:

GHSBIN: Binary classification scheme [i.e., 1 (sensitizer) or NC (not classified)];

GHSSUB: Sub-categorization scheme including the two GHS skin sensitization sub-categories [i.e., 1A (strong sensitizers) and 1B (other skin sensitizers), as well as NC (not classified)];

GHSBORDER: Same as GHSSUB, with the two additional ambiguous classification outcomes (1 and NC/1B). Again, neither 1, nor NC/1B are potency sub-categories; they characterize a limited data situation, where the uncertainty in the assignment of the test substance to a GHS sub-category (or the outcome NC) is high.

Table 2 Translation of WoE score, MLLP and MSPE values into overall GHS reference classifications (na = not applicable)

It is noted that for the actual classification/sub-categorization of the OECD DA reference data, we only used GHSBIN and GHSSUB directly. GHSBORDER was only considered as additional information pointing out that some GHSBIN or GHSSUB classification outcomes were associated with higher uncertainty than others. The GHSBORDER information may be important when weighing the outcome from HPPTs against those from other data sources on skin sensitization, such as the LLNA.

If all three WoE approaches agreed, the overall outcome was considered robust and was used for further evaluation. The same held if one of the three approaches (WoE score, MLLP or MSPE) did not provide a result, but the outcomes based on the remaining two agreed with each other. If only one of the approaches returned a result, the classification outcome from that approach was used.

Where the three WoE approaches disagreed regarding the GHSSUB outcome (i.e., both 1A and 1B outcomes were present), the overall outcome for the respective substance was decided by rule-guided expert judgment on a case-by-case basis. For eight of the 196 OECD DA reference substances, this decision was made following a detailed discussion in the OECD expert group [for details, see Table 17 in OECD (2021b)]. For two (DSA05), or five (DSA1+, including the two substances for DSA05) additional substances in the full database, the decision was made according to the following rules:

1.

If the available data for that substance contained one or more positive HPPT result with a DSA ≤ 500 µg/cm2 (i.e., a clear 1A outcome according to the current GHS criteria), the outcome was set to 1A.

2.

If an overall GHSSUB classification outcome could be obtained using DSA05, but not using DSA1+, the DSA05 outcome was also used for DSA1+.

3.

If neither rule 1, nor rule 2 was applicable, the overall outcome was decided on a case-by-case basis.

Discordant GHSBORDER outcomes from the individual WoE approaches (i.e., MLLP, MSPE or WoE score) were first examined to determine whether a consensus approach could be applied. The rationale for this was that the ambiguous GHSBORDER outcomes 1 and NC/1B obtained via the individual WoE approaches do not, on their own, allow for a clear decision on the overall classification outcome. However, such results may still support—or at least be compatible (i.e., not in contradiction) with—the outcome of other WoE approaches. For example, the outcome 1 is compatible with the outcomes 1A, 1B, or NC/1B, but incompatible with the outcome NC. Likewise, the ambiguous outcome NC/1B is compatible with the outcomes 1, 1B and NC, but not with the outcome 1A.

As a consequence of these considerations, if the outcome from one of the available WoE approaches (WoE score, MLLP, MSPE) was 1A or 1, while another was NC, or if one was 1A and another NC/1B, the results were considered incompatible, and an expert call was required in analogy to the procedure for GHSSUB above. In all other cases, the consensus approach shown in detail in Table 3 was applied.

Table 3 Overview of the decision scheme applied, where possible, to obtain an overall GHSBORDER consensus outcome in case of disagreement between the available individual outcomes (▪ = present, empty = absent) based on WoE score, MLLP, and/or MSPE. In all other cases of disagreement, the overall classification was decided based on expert judgment

It is noted that this rule-based system was developed after publication of OECD (2021b) and therefore some of the overall GHSBORDER outcomes given here may differ from those provided in Table 17 of that publication.

Originally, overall classifications were determined only for the OECD DA reference substances. For the present review, we coded the above rules into a script (supplementary files “HPPT-classification.R” and “HPPT-classification.nb”) using the statistical software R v.4.2.2 (R Core Team 2022). The corresponding overall classifications were then calculated for the full HPPT database of 1366 substances based on 2255 test results with RRS < 5 and can be found in the supplementary file “HPPT-classification.xslx” (tabs “DSA1+” and “DSA05”).

Reproducibility of HPPT-based WoE classifications

If we could perform an infinite number of HPPTs with a given substance, this would allow for a determination of the “true” HPPT sensitization potency of that substance and, on that basis, its “true” HPPT-based classification or sub-categorization according to the UN GHS. Reproducibility of that classification or sub-categorization could then be measured by a statistical evaluation of all individual HPPT results against the “true” HPPT-based classification. Unfortunately, this is not possible, and reproducibility must instead be estimated from a limited number of test results, using the WoE-based overall classification described above as a surrogate for the “true” classification. Reproducibility can then be understood as the likelihood that the classification outcome derived from an individual HPPT result matches the outcome from the WoE assessment of all results available for the substance in question.

As in OECD (2021b), we therefore determined the reproducibility of the overall classification result in the following way:

GHSBIN: Reproducibility was calculated as the fraction of all individual HPPT results yielding an unambiguous classification result (1 or NC) for a given chemical that correctly predicted the WoE call based on the three approaches. Studies resulting in an SPE of NC/1 or NC/1B were excluded from this evaluation, since for them GHSBIN was not applicable.

GHSSUB: Reproducibility was calculated as the fraction of all HPPT results yielding an unambiguous classification result (1A, 1B, or NC) for a given chemical that correctly predicted the WoE call based on the three approaches. Studies resulting in an SPE of 1 or NC/1 were excluded from this evaluation, since for them GHSSUB was not applicable. For the same reason, studies resulting in NC/1B were omitted, if the overall classification was 1B or NC. They were, however, counted as incorrect predictions if it was 1A.

WoE assessment of HPPT- and LLNA-based reference classifications

In the OECD DA project, LLNA-based reference classifications were established using an approach analogous to the one described here for the HPPT data (OECD 2021c). Although outside the scope of the OECD DA project, we compared both classifications, where available for the same substance. It is important to note that we only performed a WoE assessment of the HPPT and LLNA data available to us in the frame of the OECD project. A true overall WoE assessment for the substances in question would need to consider all available data relevant for skin sensitization classification including non-HPPT human data, data from tests in guinea pigs and data obtained from in vitro tests, DAs or in silico models.

HPPT- and LLNA-based results were considered concordant, if they were identical or at least not in contradiction to each other:

The outcome NC/1B was considered concordant with the outcomes 1, 1B and NC.

The outcome 1 was considered concordant with the outcomes 1A and 1B.

Discordant results were resolved by applying the decision logic shown in Fig. 3, which, again, aims to represent regulatory practice as closely as possible.

Fig. 3figure 3

Decision scheme for obtaining an overall classification based on all available LLNA and HPPT data, in cases where the three individual classifications based on LLNA, DSA1+ and DSA05 did not fully agree

In short, the basis for this stepwise scheme was as follows:

The presence of one or more clear 1A results in the LLNA database led to classification of the substance as 1A.

A positive HPPT result with a DSA ≤ 500 µg/cm2 in the absence of a clear LLNA 1A result led to classification of the substance as 1A.

If these rules did not apply, and there was disagreement between DSA1+ and DSA05, classification was decided according to the LLNA result. This can also be seen as a majority vote, since in these cases the LLNA agreed with either DSA1+ or DSA05.

For substances not classifiable by any of the three preceding steps, all LLNA and HPPT results (except for those resulting in the outcome NC/1) were combined to determine the overall MSPE according to the rules given in the “The “median sensitization potency estimate” (MSPE) method” section. This was done in parallel using DSA1+ and DSA05 values, with the resulting sub-categorization accepted only if the two MSPE results obtained in this way agreed with each other.

If sub-categorization was still not possible, determination of the MSPE was repeated in parallel based on DSA1+ and DSA05 as described above, using only test results with unambiguous outcomes (1A, 1B, NC).

Finally, if still no sub-categorization was obtained, the LLNA-based classification was compared to the HPPT-based classification determined using the current GHS scheme rather than the extrapolated classification outcomes introduced in this manuscript. The stricter of the two classifications (including subclassification) was then applied as the overall WoE outcome. For example, if the LLNA-based classification was Skin Sens. 1B and the HPPT-based classification was Skin Sens. 1A using DSA1+ or DSA05, but Skin Sens. 1B using the DSA according to the current GHS classification scheme, the overall WoE classification was 1B.

It is noted that the above rules pertain to the WoE-based determination of GHSBIN and GHSSUB. In all cases, GHSBORDER was chosen to reflect the disagreement between LLNA, DSA1+ and/or DSA05.

Comments (0)

No login
gif