Distributions for variables modeling phenomena linked to vulnerabilities exploitation have in general heavy right tails, and therefore there is a tendency to believe that power laws are adequate models. Good references on this subject are Clauset et al. [25] and Stumpf and Porter [26]. For a detailed discussion on the use of heavy-tailed models for vulnerabilities data, more specifically, on vulnerabilities lifecycle variables, see Brilhante et al. [10], and for the use of mixture models, namely about the hyperexponential, consult Feldmann and Whitt [27]. The lognormal law has also been used in this context as a model because it mimics the linear signature of power laws (see Mitzenmacher [28]). Moreover, the Generalized Pareto (GP) model and, in general, slowly varying tailed models have been studied as well. For a comprehensive view on the topic, see Brilhante et al. [10].
In statistical analysis, asymptotic results can be useful to support model choices. For example, the extremal limit theorem and EVT give the necessary tools when heavy-tailed models are reasonable candidates to fit a data set. Therefore, Section 3.1 gives a brief account of EVT in the traditional IID framework, but also in the geometrically thinned scheme, since healthcare data breaches may be underreported due to reputational concerns. On this subject, it is worth mentioning that, contrasting with the fast rate of convergence in the central limit theorem when dealing with central order statistics, the rate of convergence in the extremal limit theorem for extreme order statistics can be very slow, which is a well-known fact from Fisher and Tippett’s [29] remark on the penultimate behavior of normalized maxima. This suggests that it is legitimate to find alternative fits to the traditionally used heavy-tailed models. The results outlined in Section 3.1 are simply intended to indicate that the GEV, more specifically the Fréchet, and loglogistic models, can provide useful fits to the data. Readers less familiar with these concepts can skip the more technical details in Section 3.1, taking only into consideration the features of the models, since they are used in Section 5.
3.1 Extreme value theoryThe bulk of central order statistics and the central limit theorem play a key role in the majority of the decisions that are based on data analysis. On the other hand, extreme order statistics and the extremal limit theorem are crucial in the analysis of outstanding risks, namely in the analysis of large claims in insurance (see Gomes and Pestana [30], Embrechts et al. [31], and Beirlant et al. [32]).
Fréchet [33] transposed Lévy’s [34] stability theory for sums to maxima, obtaining the Fréchet distribution, with the standardized form
$$\begin \Phi _\alpha (x)= \exp \left( -x^ \right) \,}_ (x), \;\alpha >0. \end$$
Soon after, Fisher and Tippett [29] established the initial form of the extremal limit theorem obtaining the (later called) Gumbel and max-Weibull types, respectively,
$$\begin \Lambda (x)= \exp \left( -\textrm^\right) \,}_} (x), \end$$
and
$$\begin \Psi _\alpha (x) = \left\ \exp \left[ -(-x)^\alpha \right] & , \; x<0 \\ 1 & , \; x \ge 0 \end \right. , \; \alpha >0. \end$$
Gnedenko [35] demonstrated that the Fréchet, Gumbel and max-Weibull distributions are the only possible limit types for normalized maxima \(M_n=\max \\) of a IID sequence of random variables \(\left( X_n\right) _}\), with a cumulative distribution function F, i.e., they are the only solutions of the stability equation \(F^n(A_n x+B_n)=F(x)\), for appropriate attraction coefficients \(A_n >0\) and \(B_n\in \mathbb \), thus fully characterizing the domains of attraction of the Fréchet and of the max-Weibull types. As for the characterization of the domain of attraction of the Gumbel type, this was done by de Haan [36]. Therefore, Gnedenko and de Haan’s definitive form of the extremal limit theorem provides the adequate framework for the asymptotic choice of extremal models for \(M_n\) when \(n\rightarrow \infty \).
On the other hand, von Mises [37] and Jenkinson [38] unified into a single expression the standardized stable limit distribution of normalized maxima, called the general extreme value (GEV) distribution, with cumulative distribution function
$$\begin G_\xi (x) = \exp \left[ -(1+\xi x)^/} \right] \, }_} (x), \, \xi \in }. \end$$
(1)
If, in Eq. 1, the shape parameter \(\xi >0\), the Fréchet-\(\alpha \) type, with \(\alpha =1/\xi \), is obtained, and if \(\xi <0\), the max-Weibull-\(\alpha \) type, with \(\alpha =-1/\xi \), is obtained. When \(\xi \rightarrow 0\), \(G_\xi \) defined in Eq. 1 converges to the standard Gumbel distribution, i.e., \(G_0(x) = \exp \left( -\textrm^\right) \) for any real x. More details on EVT can be found in Gomes and Guillou [39], among other review papers.
Fig. 3Power law linear signature in a log-log plot
Table 5 ML parameter estimates and goodness-of-fit results for the OCR penalties data in Table 1A relevant issue in what regards healthcare vulnerability and extreme losses due to malicious breaches is the possibility that some data will not be reported. Rachev and Resnick [17] developed a straightforward theory under the plausible assumption of geometric thinning of the full sequence of IID random variables \(\left( X_n\right) _}\), i.e., that each original term of the sequence is reported with probability \(\theta \) or discarded with probability \(1-\theta \), independently of any other.
Table 6 ML parameter estimates and goodness-of-fit results for the yearly maxima of the OCR penalties data in Table 2In case the IID sequence \(\left( X_n\right) _}\) is Geometrically(\(\theta )\), \(0\,\,\theta \,\,1\), thinned, the geo-max-stable possible distributions \( ^g\! G_\xi \) satisfy the relationship \( ^g\! G_\xi (x)=\frac\) (Rachev and Resnick [17]), and therefore the general (standardized) geo-max-stable distribution is
$$\begin ^g\! G_\xi (x) = \frac=\frac}\,,\quad 1+\xi x>0\, . \end$$
(2)
Using \(\alpha =\frac\) in Eq. 2, \( ^g\! G_\xi \) can be split into the following three families:
Loglogistic distributions, whose natural logarithm follows the logistic distribution, from the classical max-stable Fréchet-\(\alpha \) distribution,
$$\begin ^g \Phi _\alpha (x) =\frac^~}\, }_(x),~\alpha >0\,; \end$$
Logistic distribution, from the classical max-stable Gumbel distribution,
$$\begin ^g\! \Lambda (x)=\frac^~}\, }_}(x) \,; \end$$
Backward loglogistic distributions, from the max-stable Weibull-\(\alpha \) distribution,
$$\begin ^g \Psi _\alpha (x) =\left\ \frac^} & , \; x<0 \\ 1 & , \; x\ge 0 \end \right. , \quad \alpha >0\, . \end$$
Notice that in the limit distributions (1) and (2) (and in their particular cases) a location parameter \(\lambda \, \, \mathbb \) and a scale parameter \(\delta \,\, 0\) can be considered. Moreover, the characterizations of the domains of attraction of geo-max-stable thinned laws are similar to the characterizations of the corresponding max-stable laws.
Table 7 ML parameter estimates and goodness-of-fit results for the Attorneys General HIPAA fines data in Table 3
Comments (0)