Next, we describe the clustering component, what components it includes, and explain the choices behind each of them. The clustering is integrated in a Web-tool described in [20].
Distance calculationK-means is the most common clustering algorithm. It minimizes squared Euclidean distances between the data (patient locations) and their nearest centroid. Squared distance is widely used even if it does not inherently correspond to real-world geographic phenomena. Some attempts have been made to make the connection, though. For example, Zhou and Li [21] modeled the cost of disaster losses as a quadratic function of the distance from the emergency facility to the disaster location.
Another common approach is to maximize the number of locations that are within a given distance from its nearest facility. Church [22] defined this as a maximal covering location problem. It is also referred to as threshold distance [23], where the distance cost is 0 if the distance to the centroid is less than a predefined threshold value, otherwise, it is 1. Fränti et al. [24] minimized the number of myocardial infarction patients at risk by defining at-risk individuals based on their proximity to the nearest percutaneous coronary intervention (PCI)-capable hospital, specifically considering whether a patient resides beyond a predetermined time limit.
Absolute (non-squared) distance is the most used and natural choice in the case of location-based applications. It has a slight difference from its squared variant in the Euclidean distance case. The optimal centroid location of a cluster is its geometric center in the case of squared distance but its median in the case of absolute distance. The average (or median) can be calculated for each coordinate independently. The median is also known as the spatial median [25], and the corresponding k-means variant is known as the p-median [26, 27].
Simple Euclidean distance, however, was shown to cause bias in the facility optimization [24]. For this reason, travel distance (or time) is recommended. Calculating the shortest distance via road network is straightforward but requires lots of computation. A single shortest path calculation is fast, but facility optimization requires millions of such calculations. The locations of the stations are also dynamically changing during the optimization process, which prevents the use of a pre-calculated distance matrix. The consequence is that the optimization processing may take even days.
Boscoe [28] showed that Euclidean distance is an adequate approximation for travel distance in the United States when multiplied by a constant factor of 1.4. They call this factor the detour index. It virtually equals the diagonal of a unit square corresponding to the Manhattan distance. However, our data is mostly in areas where lakes and rivers make the road network more complex, and a simple Manhattan distance would not be accurate. In city areas, the factor can be smaller than 1.4 (see Fig. 1), but in areas containing rivers and lakes, the factor can be much bigger. In general, there are large differences.
Fig. 1Example of the detour index between Euclidean and road network distances from the patient location (Huvilakatu) to the nearest health station (Suvikatu station)
Mariescu-Istodor and Fränti [29] adapted this detour index locally by constructing a so-called overhead graph. The graph was built as pre-processing using patient locations in the data and the road network to identify traffic points. These points constitute the nodes of the graph. The overhead ratio was then calculated for all pairs of nodes in the graph as the ratio of their true travel distance and the corresponding Euclidean distance. The larger the overhead value, the longer the detour. The values are stored in the graph (represented by a matrix) and used in the optimization process.
During the optimization, when we need to estimate the travel distance between a patient’s location and a given health station, we first calculate their Euclidean distance. We then find the nearest graph nodes of these two locations and obtain the stored overhead ratio between these two locations. The Euclidean distance is then multiplied by this constant to obtain an estimation of the travel distance. Travel times are derived directly from the travel distances using the average speed of each road segment. Dynamics like rush hours are not considered, and optimization is made merely to minimize the average.
This process is extremely efficient, requiring only a single lookup table and one multiplication operation (overhead ratio × Euclidean distance). With our data, this reduces the processing time from 2 weeks to about 15 min when using 10,000 iterations of the clustering algorithm. The huge speed-up is achieved at the cost of minor inaccuracy in the distance estimation, 2%, according to [29], and with additional memory of 512 × 512 = 0.25 MB for storing the lookup table. The process with two sample graphs is shown in Fig. 2.
Fig. 2Overhead graph with 256 nodes constructed for the North Karelia region and an example of its use
Optimization functionK-means is by far the most common clustering algorithm and would be the most obvious choice for use here as well. However, it minimizes the sum of squared Euclidean distances, and it is unclear whether geographical distances should be squared in this application. A more common choice is to calculate the sum of Euclidean distances as such (without squaring). However, neither of these satisfied our needs as we are also interested in the travel costs of the patients. We, therefore, consider also the sum of the travel costs as the optimization function.
The three optimization functions can be written mathematically as follows:
$$\text: _}}_\left(p,\underset}}}_\left(h,p\right)\right)$$
$$\text: _}}_}}}_^\right)}^$$
$$\text: _}}_\left(p,\underset}}}_\left(h,p\right)\right)$$
where HS refers to the health stations.
For estimating the travel costs, we adopt the cost model presented by [30] tailored for the local region. The model assumes that patients use the bus when the distance to the nearest bus stop is less than 200 m, otherwise, own car is used. Exceptions are patients living within 1 km from the hospital who are expected to walk to the health station with 0 € cost. People 80 years or older are assumed to use taxis.
The model's key elements are summarized in Table 1. Our model is slightly simplified; instead of considering different zones for bus fares, we apply a flat rate of 5.1€ per bus trip. Leminen et al. model also considers the cost of the patient's time during travel based on the average hourly gross wage in the respective zip code area. We have omitted this component to streamline the optimization process.
Table 1 Travel cost model as presented in Leminen et al. [30]The ideal goal is to maximize accessibility, but it is not clear how to measure it. We use distance, travel time, and travel costs. Travel time is derived directly from travel distance based on the average speeds of the road segments used and is the most obvious measure. However, people also tend to minimize costs in case of non-urgent visits, so the travel cost is also a relevant indicator of accessibility.
Clustering algorithmK-means and its variants like p-median would be the most obvious choices for the clustering algorithm, but they can be inaccurate, see Fig. 3. It was shown in [31] that k-means works worse when the number of clusters is high, and the clusters are well separated. Part of these problems can be compensated by repeating the algorithm multiple times [32] at the cost of increased processing time or by better initialization using methods like Maxmin [33] or its variant called k-means++ [34]. Despite k-means working well for most data, neither of these alternatives was able to cluster all benchmark datasets correctly [32].
Fig. 3Incorrectly detected clusters (+ sign signifies too many centroids,—sign too few) happen in k-means when there are many clusters and some of them are well separated. K-means may fail even with seemingly easy datasets (named A2, S2, Unbalance) to find all clusters correctly because the algorithm is incapable of moving the centroids across empty areas (deserts). Repeating the algorithm compensates for this only partly but relies too much on luck
P-median [26, 27] suffers from the same problems as k-means. P-median uses a median for the cluster centroid instead of a mean. It is potentially more robust on noise, but recent results have shown Medoid performing poorly in the context of averaging GPS trajectories [35]. Our data is from patients' real home locations, which is much less noisy than the result of some medical measurement processes.
The accuracy of k-means is usually sufficient when applied as a data processing component within a more complex pattern recognition system. Kinnunen et al. [36] reported that the choice of the algorithm was negligible on the overall speaker recognition performance as long as a reasonably good algorithm was chosen (including repeated k-means). However, clustering is the core component of our analysis, and high accuracy is required to avoid any bias caused by the algorithm. For this reason, we have selected a more robust algorithm.
Many potentially good clustering algorithms exist including Ward's agglomerative clustering method [37], its enhanced variant called iterative shrinking [38], splitting algorithm [39], global k-means [40], and evolutionary algorithms of which the genetic algorithm (GA) [41] and the self-adaptive genetic algorithm (SAGA) [42] have shown to be the most accurate in terms of minimizing the clustering objective function.
Among the many good choices, we select random swap [43]. It performs virtually as well as the more complex genetic algorithms while having the benefit of straightforward implementation, see the pseudo-codes of random swap and the genetic algorithm in Fig. 4. Its simplicity is important because it allows easier adaptation of the algorithm to work with different distance functions such as travel cost. Several implementations with different programming languagesFootnote 1 are also publicly available, including a version supporting parallel processing [44].
Fig. 4Pseudo codes of two good clustering algorithms. We selected RandoSwap due to its simplicity
The random swap algorithm works as follows. It starts with random initial locations for the stations and then uses two k-means iterations to reallocate the patients to their nearest station and then re-optimize the stations' locations. Random swap is a wrapper around the k-means. It selects a random station, relocates it to a new (random) location, and then iterates k-means twice. If the new solution improves the cost function, it is kept; otherwise, the previous solution is restored. While seemingly naïve, this simple trial-and-error approach is effective as the trial swaps can be implemented efficiently.
Here, we use T = 5000 trial iterations, which is the original recommendation. The algorithm is not particularly sensitive to this parameter, and the exact value is not even important in offline optimization as we can easily run the algorithm as long as we want, e.g., let the algorithm run 100,000 iterations overnight just to be sure. A theory about how to set this value more accurately can be found in [43]. Visual animation of the algorithm can be found here https://cs.uef.fi/ml/software/ and tutorial here https://www.youtube.com/@pasifranti541.
Figure 5 shows the benefit of random swap over k-means. The average squared distance (MSE) is 10% smaller for data with lots of clusters (Bridge). Repeating k-means 100 times can improve accuracy, but it cannot reach the same accuracy level. For data with 100 separate clusters (Birch1), the difference is 15%. K-means locates 7 clusters wrongly. If repeated 2500 times, the number will be reduced to 3. Even with a small number, it is too much when accuracy is important.
Fig. 5Effect of the optimization by random swap algorithm versus k-means
For the centroid, we use arithmetic means of all the patient locations in the cluster. This is not the optimal choice for minimizing the travel cost, but it is the best we can think of, and it can be calculated fast. Finding a better location could theoretically be obtained by a local search around the current location, but re-calculating all distances would be time-consuming. Instead, we consider fine-tuning the centroid location to the nearest existing building. While new buildings can (and probably should) be constructed, this at least restricts the centroids from being located on lakes and inaccessible places without any infrastructure nearby.
Comments (0)