J. Imaging, Vol. 8, Pages 324: Detecting Audio Adversarial Examples in Automatic Speech Recognition Systems Using Decision Boundary Patterns

5.1. Target Models and Data SetsAs with similar research in the ASR domain [9,10,39,44], we used DeepSpeech [1] as one of the target models for our experiments. DeepSpeech 0.8.2 (DeepSpeech 0.8.2 was implemented by Mozilla https://github.com/mozilla/DeepSpeech (accessed on 1 November 2020)), which is the latest release at the time of writing, was used in this research. It should be noted that DeepSpeech 0.1, which was used in previous studies [9,39], has been superseded with newer versions. In addition, DeepSpeech2 [2], which is an improved version of DeepSpeech that employs an end-to-end architecture, was also used. We used DeepSpeech2 V2 (DeepSpeech2 V2 was implemented and released by Sean Naren https://github.com/SeanNaren/deepspeech.pytorch (accessed on 1 November 2020)).LibriSpeech [58] was employed as the data set because DeepSpeech and DeepSpeech2 both provide pre-trained models on LibriSpeech. In the experiments, we used audio from the test-clean and dev-clean data sets. For targeted AEs, one of the target phrases “power off”, “turn on airplane mode”, “visit danger dot com”, “call malicious number”, and “turn off lights” was selected at random to mimic malicious voice commands. The generation of untargeted AEs was deemed to be successful if the edit distance between the transcript and the ground truth was larger than 40% of the ground truth.In previous work by Carlini and Wagner [9], they generated audio AEs using the first 100 test instances of the Mozilla Common Voice data set [59]. Most of this audio was short, between 1 and 8 seconds in duration. Carlini and Wagner [9] empirically observed that the generation of targeted AEs was easier the longer the source phrase, while the generation would be more difficult the longer the target phrase. Since our target phrases were relatively short, we used audio below 5 seconds to balance the difficulty of generating targeted audio AEs.

All experiments in this paper were performed with an Intel i7-8750H CPU and an Nvidia GeForce GTX 1060 graphic card. Using randomly selected audio from the test-clean data set of DeepSpeech and DeepSpeech2, respectively, we generated 150 targeted AEs, 150 untargeted AEs using FGSM, and 150 untargeted AEs based on our proposed method. For simplicity, in the remainder of this paper, we refer to untargeted AEs using our proposed method as untargeted AEs and untargeted AEs using FGSM as FGSM AEs. To obtain a balanced data set, we also extract 150 correctly transcribed and 150 incorrectly transcribed audio from the test-clean data set of each model. In addition, we generated 150 noisy audio signals by applying Gaussian noise with a standard deviation of 0.01.

To generate targeted AEs, we ran 350 epochs for DeepSpeech and 300 epochs for DeepSpeech2 to suppress noise during the second stage since we observed that it is easier for DeepSpeech2 to suppress the noise without destroying adversarial perturbations. Noise suppression in all targeted AEs against DeepSpeech2 was successful. However, some AEs against DeepSpeech failed to lower noise within the 350 epochs. As such, we individually fine-tuned these noisy AEs by running extra epochs until the masking loss (lθ() in Equation (5)) was below a specific threshold. The smaller the masking loss, the smaller the distortion perturbations caused. We set the threshold to the masking loss calculated using the −20 dB distortion set published by [9] (https://nicholas.carlini.com/code/audio_adversarial_examples (accessed on 1 November 2020)).The masking losses of our AEs were compared with the −20 dB distortion, −35 dB distortion, and −50 dB distortion sets published by [9] and the first set of the imperceptible adversarial examples published by [36] (http://cseweb.ucsd.edu/~yaq007/imperceptible-robust-adv.html (accessed on 1 November 2020)). Figure 1 and Figure 2 show these results. Smaller dB values mean lower distortion. Carlini and Wagner [9] reported that the distortion of 95% of their targeted AEs ranging between −15 dB and −45 dB. Thus, the resulting distortion in our targeted AEs is comparable with the results of related work. It should be mentioned that we can further lose masking loss by running more epochs, which will require a longer generation time. We have made examples of AEs generated in this work available at https://drive.google.com/drive/folders/1Ffed7xHmP5oKCuypEgJxQ80p35-vSIBm?usp=sharing (accessed on 20 October 2022).Table 3 shows a comparison of the time taken for generating the audio AEs. FGSM was the fastest approach, but it had the lowest success rate. On average, it took 2.4 and 7.0 min to generate targeted audio AEs for DeepSpeech and DeepSpeech2, respectively. On the other hand, our proposed method required an average of 4.4 and 4.9 min to generate untargeted audio AEs. While we generated AEs one at a time, the generation process can be accelerated by generating multiple AEs in parallel. As a loose comparison, Carlini and Wagner [9] reported that their approach took about one hour to generate a single targeted audio AE on commodity hardware, while Zeng et al. [27] reported a time of 18 min on an 18-core CPU with dual graphic cards. While we cannot conclude that our generation process is statistically faster as the source audio and target phrases were different, intuitively, our method should speed up the generation of AEs because we do not limit the max-norm of perturbations. 5.2. Visualizing Decision BoundariesAs described in Section 3, the proposed method represents the decision boundaries of ASR models using heat maps of loss-function values and normalized edit distances. The Mloss and Medit were calculated for correctly transcribed benign audio, targeted, and untargeted audio AEs. It was empirically observed that good results could be produced using a matrix of 128×128 and a step size s of 0.07. Figure 3 shows examples of resulting heat maps.In the heat maps shown in Figure 3, the horizontal axis represents the direction of the gradient of the loss function the input audio, while the vertical axis represents a random direction that is perpendicular to the gradient. The heat maps were generated by modifying input audio along these two directions and recording the changes. The center of the heat maps represents unmodified audio. In the experiments, we set y in Equation (1) to the transcript of the unmodified audio, because we wanted to calculate the changes in loss values and transcripts when modifying audio. For example, y is set to the target phrase of a targeted audio AE or the incorrect transcript of an untargeted audio AE.

It is evident from the resulting patterns that changes in loss function values and normalized edit distances are correlated. This aligns with the intuition that loss function values returned by an ASR model should increase as the difference between the transcript and y increases and vice versa. Furthermore, we can see that when a targeted audio AE is modified slightly, the resulting loss function value and normalized edit distance change significantly. This is true for both DeepSpeech and DeepSpeech2 and is consistent with our observation that adversarial perturbations in the generated targeted audio AEs are not robust. The significant changes in loss function values and normalized edit distances when we modify AEs are an indication of the non-robust property of adversarial perturbations.

In contrast, changes in loss function values and normalized edit distances for correctly transcribed benign audio are significantly smaller than for targeted audio AEs when audio is slightly modified. This implies that correctly transcribed benign audio is much more robust against perturbations than targeted audio AEs. This is consistent with our observation that some correctly transcribed benign audio could still be correctly transcribed even when a large amount of noise is present. Another observation is that slightly modifying untargeted audio AEs also results in large changes in loss function values and normalized edit distances. However, while this change appears to be less severe than targeted audio AEs, the resulting patterns are different when compared with the results of correctly transcribed benign audio.

5.3. Dimensionality ReductionBased on the different patterns in loss function values and normalized edit distances in relation to targeted and untargeted audio AEs and benign audio shown in Section 5.2, it is logical to consider the possibility of differentiating audio AEs from benign audio-based differences in their patterns. Thus, we extracted features from the audio and projected them into 2D space using the PCA and t-SNE methods, using the method described in Section 3. It should be noted that if audio AE and benign audio features can clearly be differentiated into 2D space, this indicates that they can also be separated in the original higher-dimensional space.In the experiment, benign audio was grouped as correctly and incorrectly transcribed audio. This was carried out to investigate whether there was a difference between them. In addition, noisy audio was also included. The features were normalized using their mean values and standard deviation before projecting them in 2D space. These results are shown in Figure 4.

The PCA projection results were almost the same for DeepSpeech and DeepSpeech2. Correctly and incorrectly transcribed audio clustered around the origin, while the other audio types were spread away from the origin. The correctly and incorrectly transcribed audio almost overlapped, indicating that there is little difference between their features. As previously discussed, the changes in loss function values and normalized edit distances for correctly transcribed benign audio are small, which explains why correctly and incorrectly transcribed audio cluster around the origin. In contrast, targeted audio AEs are far away from the origin. This is because small modifications will result in significant changes for targeted audio AEs, as discussed in the previous section. Untargeted audio AEs, FSGM audio AEs, and noisy audio all spread slightly away from the origin in the same direction. This implies that the features of these three audio types are similar.

Compared with PCA results, t-SNE projection was better at visualizing relationships between the data samples. In Figure 4, t-SNE projection again shows similar results for DeepSpeech and DeepSpeech2. Three clusters, excluding noisy audio, can be identified as follows: targeted audio AEs are clearly grouped in the first cluster; the second cluster mainly contains correctly and incorrectly transcribed benign audio; and the third cluster consists of untargeted audio AEs and FGSM AEs, i.e., untargeted attacks. The results of t-SNE projection are promising, since the various audio types are clustered according to their categories. An interesting observation is that incorrectly transcribed audio does not overlap with untargeted audio AEs or FGSM AEs, although all of them lead to incorrect transcriptions. A potential explanation is that incorrectly transcribed audio from the test-clean data set does not cause severe errors such as untargeted audio AEs and FGSM AEs. In addition, noisy audio is contained in both the second cluster (benign audio) and the third cluster (untargeted attack). This may be because some noisy audio is like benign audioin that it can be transcribed correctly or with little error, while some noisy audio behaves like untargeted attacks, which lead to significant errors in transcriptions. Upon closer inspection, the untargeted AEs and FGSM AEs are separate from each other in the case of DeepSpeech2, but the same is not true for DeepSpeech. 5.4. Anomaly Detection

Visualization results presented in the previous subsection indicate the possibility of detecting audio AEs based on their features. Hence, instead of training a classifier on benign audio and audio AEs, we experimented with using anomaly detection to detect audio AEs. In practice, audio AEs generated by adversaries are unlikely to be previously seen. Anomaly detection is appropriate for defending against previously unknown attacks.

In the experiments, we used audio from the dev-clean data set to train an anomaly detection model. This model was then used to detect audio AEs generated using the test-clean data set. In particular, audio features from dev-clean were extracted using the method described in Section 3. These features were used to train an EllipticEnvelope model implemented by scikit-learn [60]. This model detects outliers in a Gaussian distributed data set. We use the default parameters so that our experiment results can serve as a lower bound for anomaly detection. We report true positive (TP), false positive (FP), true negative (TN), false negative (FN), and detection rate (DR) for each category of benign audio and audio AEs together with overall precision (Pre), recall (Rec), and accuracy (Acc). Specifically, precision=TPTP+FP, recall=TPTP+FN, accuracy=TP+TNTP+FP+TN+FN. For audio AEs, DR=TPTP+FP. For benign audio, DR=TNTN+FN.Table 4 presents the experimental anomaly detection results for DeepSpeech and DeepSpeech2. Overall, the detection results are similar for both ASR models. As expected, targeted AEs are easily detected at detection rates of 100%. This is in line with the observation that targeted AEs can clearly be separated from other audio types in lower-dimensional space. It is reasonable that the detection rates of untargeted AEs were loan-targeted AEs since some untargeted AEs were mixed with benign audio in the PCA projection, as previously shown in Figure 4. The detection rates of FGSM AEs were surprisingly lone untargeted audio AEs, although these two AEs were clustered together in the t-SNE projection. This indicates that the simple anomaly detection model that was used is too basic for detecting FGSM AEs. In addition to benign audio, noisy audio could also be correctly identified at high detection rates. This was not as expected, since some of the noisy audio was mixed with untargeted AEs and FGSM AEs in low-dimensional space. This suggests that noisy audio is actually clustered with benign audio in the original higher-dimensional space, even though the 2D projection did not show this.In a study by Samizade et al. [44], they generated white-box and black-box targeted audio AEs against DeepSpeech. They trained a neural network on white-box targeted audio AEs to detect black-box targeted audio AEs and vice versa. Our detection accuracy for the two ASR models of 87.44% and 82.22% is overall higher than their reported results of 82.07% and 48.76%, respectively. While this may not be a fair comparison, as they used a different approach, we mainly want to emphasize that the detection of previously unknown audio AEs is a challenging task. It is anticipated that if we extract more sophisticated features and utilize a more advanced anomaly detection method, it is highly likely that the detection results can be improved.

Comments (0)

No login
gif