对抗攻击学习记录（二）—— FGSM

学习记录：最经典的白盒攻击方法——快速梯度符号法 (Fast Gradient Sign Method) (Photo by Alex Chumak on Unsplash )

一、摘要

许多机器学习模型（包括神经网络）总是会错误地分类某一类输入样本——对抗样本 ( Adversarial Example ) 。这类样本通常是由正常的训练输入样本加入一些扰动 ( perturbation ) 产生的。早期对这种现象的解释主要聚焦于非线性 ( nonlinearity ) 和 过拟合 ( overfitting )。而 Goodfellow 等人对此现象予以了新的解释，他们认为出现这种现象的主要原因是神经网络的线性特征 ( linear nature )，并据此提出了一种用于简便快速生成对抗样本的方法——FGSM ( Fast Gradient Sign Method )

二、相关工作

Box-constrained L-BFGS 能够可靠地产生对抗样本
一些对抗样本与原始样本的区别用肉眼难以区分
同样的对抗样本对不同结构或不同参数的分类器都能起效果
Shallow Softmax Regression 同样容易被对抗样本攻击
在对抗样本上进行训练可以使模型正规化——但是需要在内部循环中进行昂贵的约束优化，所以不具有实操性

三、对抗样本的线性解释

Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples.

Goodfellow 等认为高维空间中的线性行为足以产生对抗样本 ，并对此做出了解释。

对于常见的 8 位像素图像，像素值取 $0\text{～}255$ 的整数，而值小于 1 / 255 的信息会被忽略。

Because the precision of the features is limited, it is not rational for the classifier to respond differently to an input x than to an adversarial input $\tilde{x}=x+\eta$ if every element of the perturbation $\eta$ is smaller than the precision of the features.

因此，由于特征精度受限，在扰动 $\eta$ 的每个元素都小于精度的情况下，分类器不应该对 $\tilde{x}$ 和 $x$ 做出不同相应

考虑一个权重向量 $\omega$ 和一个对抗样本 $\tilde{x}$ 的点积：

$\omega^T\tilde{x} = \omega^Tx + \omega^T\eta.$

如果让 $\eta = \text{sign}(\omega)$ ，就可以将扰动项最大化为权重向量的 L-1 范数 $\lVert\omega\lVert_1$ ，$\eta$ 是最大范数约束扰动 ( max norm constraint perturbation ) 。如果 $\omega$ 的维度是 $n$ ，所有元素的平均值为 $m$ ，那么 $\lVert\omega\lVert_1 = mn$ ，对抗样本的激活函数 (activation) 将以 $mn$ 的步长增长。

Since $\lVert\eta\lVert_\infty$ does not grow with the dimensionality of the problem but the change in activation caused by perturbation by $\eta$ can grow linearly with $n$, then for high dimensional problems, we can make many infinitesimal changes to the input that add up to one large change to the output.

因为$\lVert\eta\lVert_\infty$ 并不会随着维度 $n$ 变化，而由 $\eta$ 造成的扰动会随 $n$ 线性增长。那么在高维问题上，对输入的许多微小扰动会累加为对输出的较大扰动。

四、非线性模型的线性扰动

对抗样本的线性解释表明了存在一个快速产生对抗样本的方法。

LSTMs、ReLUs、maxout networks 都被有意地设计为“以线性的方式运作”，而如 sigmoid networks 等非线性模型也更倾向于关注非饱和、更线性的区域，以便于被更好地优化。我们假设神经网络具备很强的线性特征，因此难以抵抗对抗样本的攻击。

设 $\boldsymbol{\theta}$ 是模型的参数、$\boldsymbol{x}$ 是模型的输入、$y$ 是模型的输出目标，$J(\boldsymbol{\theta}, \boldsymbol{x}, y)$ 是训练神经网络的 cost 函数。在某个 $\boldsymbol{\theta}$ 值附近我们将 cost 函数线性化，可以得到最优的最大范数约束扰动 $\eta = \epsilon \text{sign}(\nabla_x J(\boldsymbol{\theta}, \boldsymbol{x}, y)).$

这就是用来快速生成对抗样本的 FGSM 方法。

以下是使用 FGSM 方法的对抗样本生成实例。在加入 $\epsilon \eta$ 扰动后生成的对抗样本成功干扰到了模型的分类，将原本的 “panda” 错误分类为了 “gibbon”，并且置信概率高达 99.3% 。

The fact that these simple, cheap algorithms are able to generate misclassified examples serves as evidence in favor of our interpretation of adversarial examples as a result of linearity.

简单、低能耗的算法能够实现模型误导的事实也成为了支撑 Goodfellow et al 以线性解释对抗样本的证据。

五、线性模型对抗训练 VS 权重衰减

1*. 逻辑回归（Logistic Regression）

sigmoid函数

逻辑回归是一种简单的线性模型，主要用于二分类问题。

考虑一个待分类的自变量 $\boldsymbol{x} = (x_0,x_1,\dots,x_n)$ 以及二分类标签 $y \in \{0,1\}$ ，对于分类问题，我们应该找到一个映射函数 $f(\boldsymbol{x})$ 把 $\boldsymbol{x}$ 映射到二分类标签 y ，一般我们首先考虑线性函数 $\boldsymbol{\omega}^T \boldsymbol{x} + b$ ，而线性函数的值域为无穷，映射的范围太广，并且对二分类的区分度较低，最好找一个能够把大部分自变量映射到某两个值附近的函数，在此便引入 sigmoid 函数 $\sigma(\boldsymbol{x}) = \frac{1}{1+e^{-\boldsymbol{x}}}$ 其图像大致为

可以看出 sigmoid 函数能够把大部分自变量映射到 $\{0,1\}$ 二值附近，因此 logistic regression 便采用了 sigmoid 函数对样本进行分类。

逻辑回归 cost 函数推导——极大似然估计

对于某个样本，如果它的真值标签为 $y=1$ ，那么它被分类模型分类为 $y=1$ 的概率应该最大，这是极大似然理论，也依此进行模型的训练。训练时，不断修正模型参数 $\boldsymbol{\omega}$ ，使得概率 $P(\boldsymbol{x} | y=y^{true})$ 最大

逻辑回归如何进行训练呢？

用线性函数作为激活函数和 sigmoid 函数结合为

$f(\boldsymbol{x}) = \frac{1}{1+e^{-(\boldsymbol{\omega}^T\boldsymbol{x}+b)} }.$

并将此视为 $\boldsymbol{x}$ 被分类到标签 $y=1$ 的概率，即

$P(\boldsymbol{x}|y=1) = \sigma(\boldsymbol{\omega}^T\boldsymbol{x}+b).$

相应的 $\boldsymbol{x}$ 被分类到标签 $y=0$ 的概率即为

$P(\boldsymbol{x}|y=0) = 1-P(\boldsymbol{x}|y=1) = \sigma(-(\boldsymbol{\omega}^T\boldsymbol{x}+b)).$

接下来推导一下逻辑回归的 cost 函数：

假设有训练样本集 $\boldsymbol{X}$ 和标签集 $\boldsymbol{Y}$ ，对于训练集中的每一对数据$(\boldsymbol{x}_i, y_i) \in (\boldsymbol{X}, \boldsymbol{Y})$，均计算概率

$P(\boldsymbol{x}_i|y_i=y_i^{true}) = \begin{cases} \sigma(-(\boldsymbol{\omega}^T\boldsymbol{x}+b)), & y_i^{true}=0 \\ \sigma(\boldsymbol{\omega}^T\boldsymbol{x}+b), & y_i^{true}=1.\end{cases}$

所有样本均分类正确，概率为

$P=\prod_i P(\boldsymbol{x}_i|y_i=y_i^{true}).$

利用 $y\in\{0,1\}$ 可以方便地表示为

$P=\prod_i \sigma^{1-y_i^{true}}(-(\boldsymbol{\omega}^T\boldsymbol{x}+b)) \sigma^{y_i^{true}}(\boldsymbol{\omega}^T\boldsymbol{x}+b).$

以最大似然法的原理，训练模型时，应该不断调整 $\boldsymbol{\omega}$ ，使 $P$ 取得最大值，为方便计算，对 $P$ 取对数

$\log P = \sum_i [-(1-y_i^{true})\log(1+e^{\boldsymbol{\omega}^T\boldsymbol{x+b}})-y_i^{true}\log(1+e^{-(\boldsymbol{\omega}^T\boldsymbol{x}+b)})].$

一般的优化过程需要最小化 cost 函数，因此在上式中添加负号，同时为了尽可能防止训练时梯度爆炸，还需要再除以训练样本的数量 $m$ 。因此，对于分类标签为 $y\in \{0,1\}$ 的逻辑回归模型，其 cost 函数定义为

$C(\boldsymbol{x}, \boldsymbol{\omega}) = \frac{1}{m} \sum_i [(1-y_i^{true})\log(1+e^{\boldsymbol{\omega}^T\boldsymbol{x+b} })+y_i^{true}\log(1+e^{-(\boldsymbol{\omega}^T\boldsymbol{x}+b)})].$

对于此处梯度爆炸的理解，可以这样理解：原始的 $P(x) \in [0,1]$ ，取负对数后 $-\log P(x) : +\infty \rightarrow 0 $ ，考虑 $y = -\log (x)$ 在 $(0,1)$ 的图像，靠近 0 的区域的导数值很大，对于靠近 0 的 x 的初始取值可能导致梯度爆炸。

设 $\zeta(x) = \log(1+e^x)$ 为 softplus 函数，则可简洁表示为

$C(\boldsymbol{x}, \boldsymbol{\omega}) = \frac{1}{m} \sum_i [(1-y_i^{true})\zeta({\boldsymbol{\omega}^T\boldsymbol{x+b}})+y_i^{true}\zeta(-(\boldsymbol{\omega}^T\boldsymbol{x}+b))].$

上式仅是对于分类标签为 $y \in \{0,1\}$ 的特例，logistic regression 更普遍 cost 函数形式其实为 $C =-\frac{1}{m} \log \prod_i P(\boldsymbol{x}_i|y_i=y_i^{true}).$ 这也被称为交叉熵损失函数 。如果分类标签为 $y\in \{-1,1\}$ 且 $P(\boldsymbol{x}|y=1) = \sigma(\boldsymbol{\omega}^T\boldsymbol{x}+b)$，那么 cost 函数可以更加简洁 $C(\boldsymbol{x},\boldsymbol{\omega}) = \frac{1}{m} \sum_i \zeta(-y(\boldsymbol{\omega}^T\boldsymbol{x}+b)).$

这也是作者在论文本部分采用的形式。

2*. 权重衰减（Weight Decay）

权重衰减是一项正则化 ( regularization ) 技术，能够抑制模型的过拟合。

一般认为模型权重越大，模型越复杂。下图中的蓝色线便出现了过拟合的线性，模型比较复杂，一般这类模型往往权重也更大。

因此有模型权重数值越小，模型的复杂度越低。

而权重衰减便能够在模型训练的过程中尽量减小权重大小，降低复杂度。

一般的权重衰减方法是在训练时在原本的 cost 函数上再添加一个模型权重的范数项（L-1 范数$\lVert\omega\lVert_1$ 、L-2范数$\lVert\omega\lVert^2$），以 L-2 范数为例，$C_0$ 为原始 cost 函数： $C = C_0 + \frac{\lambda}{2}\lVert\omega\lVert^2$

在模型训练的过程中，范数项随着 $C$ 不断减小而减小，可以有效起到减小权重参数的作用。$\lambda$ 项决定范数减小的速度，$\lambda$ 越大，范数减小越快。不过，如果 $\lambda$ 过大，原本 cost 函数的占比就会过低，梯度下降时模型只顾着让范数项减小，会导致拟合效果变差。

一般权重衰减优先考虑 L-2 范数$\lVert\omega\lVert = \sqrt{\omega \cdot \omega^T}$ ，而不是 L-1 范数$\lVert\omega\lVert_1=\sum_i |\omega_i|$。这是因为 L-1 范数约束会使模型权重中某些项的值区域 0 ，如下图所示（左侧是 L-1 范数约束，右侧是 L-2 范数约束）。因此 L-1 范数一般常用来产生稀疏的权重向量（矩阵）以及进行特征选择。

对线性模型的对抗训练

在论文中作者使用了 FGSM 对 logistic regression 进行了攻击。

作者使用了分类标签为 $y\in -1,1$ 且 $P(\boldsymbol{x}|y=1) = \sigma(\boldsymbol{\omega}^T\boldsymbol{x}+b)$ 的逻辑回归模型，得到了训练模型时的 cost 函数

$\Bbb{E}_{\boldsymbol{x},y\text{～}p_{data} }\zeta(-y(\boldsymbol{\omega}^T\boldsymbol{x}+b))$

首先计算梯度 $\begin{align*}& \nabla_x J(\boldsymbol{\theta},\boldsymbol{x},y) \newline = & \nabla_x \zeta(-y(\boldsymbol{\omega}^T\boldsymbol{x}+b)) \newline = & -y\boldsymbol{\omega} \cdot \sigma(-y(\boldsymbol{\omega}^T\boldsymbol{x}+b)). \end{align*}$

那么扰动 $\begin{align*} \eta & = \epsilon ~ \text{sign}(\nabla_x J (\boldsymbol{\theta},\boldsymbol{x},y)) \newline & = \epsilon ~\text{sign} (-y\boldsymbol{\omega} \cdot \sigma(-y(\boldsymbol{\omega}^T\boldsymbol{x}+b))) \newline &= \epsilon ~\text{sign}(-y\boldsymbol{\omega}). \end{align*}$

将 $\tilde{x} = x + \eta$ 代入 cost 中，得到

$\Bbb{E}_{\boldsymbol{x},y\text{～}p_{data}} \zeta(-y(\boldsymbol{\omega}^T\boldsymbol{x}+b)+\epsilon \lVert \boldsymbol{\omega} \rVert_1)$

原论文中，作者给出的结果是 $\text{sign}~(\nabla_xJ)=-\text{sign}~(\boldsymbol{w})$ 、$\Bbb{E}_{\boldsymbol{x},y\text{～}p_{data}} \zeta(-y(\boldsymbol{\omega}^T\boldsymbol{x}+b-\epsilon \lVert \boldsymbol{\omega} \rVert_1))$ 但是我发现不管怎样理解都无法在 $\eta$ 的表达式中刨去 $y$ 的影响，同时也参考了他人的解读，认为论文这里可能是出错了，因此便按照自己的理解进行计算，得出了此处了表达式。

通过此 cost 训练得到对抗样本，结果为

与 L-1 正则化的对比

这是我们得到的 cost ： $\zeta(-y(\boldsymbol{\omega}^T\boldsymbol{x}+b)+\epsilon \lVert \boldsymbol{\omega} \rVert_1)$

这是L-1正则化的 cost ：$\zeta(-y(\boldsymbol{\omega}^T\boldsymbol{x}+b))+\lambda \lVert \boldsymbol{\omega} \rVert_1$

观察以上两个式子我们发现二者确实很像，但是对抗攻击的 cost 中 $\lVert\boldsymbol{\omega}\rVert_1$ 项是加在激活函数之后，而正则化则是加在整个 cost 函数之后。

↑ 这块实在看不懂了…

深度网络的对抗训练

提出对抗训练目标函数

Szegedy 等人发现“在对抗样本和干净样本上训练的神经网络在一定程度上被正则化了”。在对抗样本上训练不同于其他数据增强的方式，对抗样本上训练的数据增强形式所采用的输入是真实世界不可能自然产生的。在当时，这个 procedure 并没有被证明超越了 dropout 正则化方法。作者认为这可能是因为使用昂贵的基于L-BFGS的对抗样本很难开展实验。

因此作者基于FGSM的方法创造了一个目标函数，这是一个高效的正则化器：

$\tilde{J}(\boldsymbol{\theta},\boldsymbol{x},y)=\alpha J(\boldsymbol{\theta},\boldsymbol{x},y)+(1-\alpha)J(\boldsymbol{\theta},\boldsymbol{x}+\epsilon\text{sign}(\nabla_xJ(\boldsymbol{\theta},\boldsymbol{x},y)),y)$

作者的实验中设置 $\alpha = 0.5$ ，并训练了一个 maxout network （该网络同样也应用了dropout正则化），在使用了对抗训练后将错误率从 0.94% 降到了 0.84%。

作者发现对抗样本在训练集上并没有达到零错误率。作者通过两个改变修改了这个问题：

增加模型在隐藏层中的神经元数量，让模型规模更大
在模型训练中使用 early stopping，帮助选择训练 epoch 的数量

在使用以上标准重新训练后，实现了 0.782% 的平均错误率，是当时不改变排列的情况下，在MNIST数据集上最好的结果。

对抗训练一些性质的研究

经过对抗训练的模型同样表现出了对对抗样本的抵抗性，未经过对抗训练的模型在对抗样本上达到了 89.4 %的错误率，而经过对抗训练的模型的错误率为 17.9%。

Adversarial examples are transferable between the two models but with the adversarially trained model showing greater robustness. Adversarial examples generated via the original model yield an error rate of 19.6% on the adversarially trained model, while adversarial examples generated via the new model yield an error rate of 40.9% on the original model.

对抗样本在不同模型之间同样具有迁移性。

在经过对抗训练后的模型展现出更好的鲁棒性。由原始模型产生的对抗样本在模型经过对抗训练后产生对其造成了 19.6% 的错误率。而由新模型产生的对抗样本对原始模型造成了 40.9 % 的错误率。（可以这么比较吗？）

在对抗训练后的网络误分类一个对抗样本后，其产生的预测结果仍然有很高的置信率，平均达到了 81.4%。

We also found that the weights of the learned model changed significantly, with the weights of the adversarially trained model being significantly more localized and interpretable.

我们也发现了对抗训练后的模型其权重产生了较大变化，对抗训练后的模型权重变得更加局部化、更具可解释性。

对对抗训练的解释

对抗训练可以看作是降低 数据受到扰动的所带来最坏影响 。可以这么理解：对抗训练降低了由随机扰动（服从均匀分布 $U(-\epsilon,\epsilon)$）带来的影响的上界 。对抗训练也可以是看作一种学习，模型能够向新的样本数据点（adversarial example）索要分类标签，而这些标签是模型从新数据点周围的数据点（clean examples ) 复制来的。

对此我是这么理解的，参考对抗训练的目标函数：
$\tilde{J}(\boldsymbol{\theta},\boldsymbol{x},y)=\alpha J(\boldsymbol{\theta},\boldsymbol{x},y)+(1-\alpha)J(\boldsymbol{\theta},\boldsymbol{x}+\epsilon\text{sign}(\nabla_xJ(\boldsymbol{\theta},\boldsymbol{x},y)),y)$
前一项是 clean example ；后一项是 adversarial example 。二者拥有同样的标签，而且对抗扰动很小，二者在多维空间中很靠近，在对方周围。

我们也可以在以训练样本点为中心的 $\epsilon$ max norm 范围内的所有点（或者若干点）进行训练，以让模型对其参数中小于 $\epsilon$ 的特征不敏感。不过零均值和零方差的噪声（训练）对于抵抗对抗样本是非常低效的，因为任何参考向量与其的点积均为 0 。这就说明，在绝大多数情况下，噪声并不能产生一个（对于模型）更加困难的输入（个人理解为更具对抗性的输入）。

We can think of adversarial training as doing hard example mining among the set of noisy inputs, in order to train more efficiently by considering only those noisy points that strongly resist classification. As control experiments, we trained training a maxout network with noise based on randomly adding $\pm \epsilon$ to each pixel, or adding noise in $U(-\epsilon,\epsilon)$ to each pixel. These obtained an error rate of 86.2% with confidence 97.3% and an error rate of 90.4% with a confidence of 97.8% respectively on fast gradient sign adversarial examples.

作者训练了在对输入图像的每个像素分别添加随机 $\pm \epsilon$ 噪声和 $U(-\epsilon,\epsilon)$ 噪声 的两个 maxout 网络，最终在对抗样本下分别产生了 86.2% 错误率、97.3%置信率和 90.4%错误率、97.8%置信率的结果。因此随机噪声的训练并不能增加模型对对抗样本的抵抗性。我们可以将对抗训练看成一个“发掘难例” (do hard example mining) 的过程，对抗训练（相比用随机噪声训练）在众多噪声中选出了（对模型）最具抵抗性的一批用于训练，因此能有更好的效果。

试图弥补符号函数的缺陷

Because the derivative of the sign function is zero or undefined everywhere, gradient descent on the adversarial objective function based on the fast gradient sign method does not allow the model to anticipate how the adversary will react to changes in the parameters. If we instead adversarial examples based on small rotations or addition of the scaled gradient, then the perturbation process is itself differentiable and the learning can take the reaction of the adversary into account. However, we did not find nearly as powerful of a regularizing result from this process, perhaps because these kinds of adversarial examples are not as difficult to solve.

符号函数在定义域上的导数要么不存在要么值为 0 。梯度下降方法在基于FGSM的目标函数上并不能知晓对抗是如何改变模型参数的（无法获取梯度信息）。作者用 small rotations 和增加比例梯度 ( addition of the scaled gradient ) 代替对抗样本，这样在学习的过程中就可以把攻击的行为考虑进去，但是并没有得到较好的结果，可能是这类对抗样本（small rotations 或 addition of the scaled gradient 产生的对抗样本）比较容易拟合。

扰动“输入层”还是“隐藏层”的讨论和解释

Szegedy 等人指出当对抗扰动添加到隐藏层时带来的正则化效果最好，这是基于 sigmoidal network 的结论。

在作者使用FGSM的实验中，发现隐藏层神经元使用无界激活函数的网络会在隐藏神经元产生极大的激活值，所以在原始输入中添加扰动更好。

而对于一些 saturating 模型（如 Rust 模型）我们发现在输入中添加扰动和在隐藏层中添加扰动效果相当。基于旋转隐藏层的扰动解决了无界激活函数的增长问题。作者成功用隐藏层的旋转扰动训练了一个 maxout 网络，但是并没有发现和直接在输入层加入扰动相比更好的正则化效果。

作者以上现象的解释是，在模型没有能力去学习对抗样本时，对抗训练并不有明显作用，而仅在 universal approximator theorem 使用的情况下有明显作用。因为神经网络的最后一层（ linear-sigmoid 或者 linear-softmax 层）并不是对最后一个隐藏层函数的 universal approximator ，这表明当对这类模型的最后一个隐藏层施加对抗扰动后，很可能会遇到欠拟合的问题。这一点在作者的实验中也有发现（作者在对隐藏层施加扰动的实验中，获得最好结果的模型中没有一个对最后一个隐藏层施加了扰动）。

这一块没有太读懂，个人理解是，对抗训练的正则化效果较强，适用于学习能力较强的模型（容易过拟合），但是“从最后一个隐藏层到输出层的部分”学习能力很弱，并不容易达到过拟合，而对此部分施加对抗扰动便会产生欠拟合的效果。

不同类型的模型能力(capacity)

许多人认为 capacity 低的模型很难做出许多不同置信概率的预测，但是错误的。一些低 capacity 的模型确实能够做到的，比如 shallow RBF 网络：

$P(y=1 \vert \boldsymbol{x})=\exp ((\boldsymbol{x}-\mu)^T\beta(\boldsymbol{x}-\mu))$

仅能有信心预测存在于 $\mu$ 附近的正例，超出范围外的部分便无法预测或者预测的置信率很低。

不过 RBF 天生对于对抗样本具有免疫能力，即使它们被欺骗了，得到的置信率也很低。一个没有隐藏层的 shallow RBF 网络对由 MNIST 数据集通过 FGSM 方法 ( $\epsilon = 0.25$ ) 产生的对抗样本的预测结果错误率为 55.4 %，对错误样本的置信率只有 1.2% ，对干净样本的置信率有 60.6%。我们不能期望一个低 capacity 的模型能够在空间中所有点产生正确的结果，但是它确实对它不“理解”的点响应了低的置信率。

线性单元响应某条线上的每一个输入，因此它有较高的回召率 ( recall )，但是对于陌生的情况，它的响应过于剧烈，因此精确度较低；而 RBF 单元仅对空间中特殊的点做出响应，因此它的准确度较高，但也牺牲了回召率。受此启发，作者打算探索一类具有二次单元的模型（包括深度 RBF 网络）。作者发现这也是一个困难的认为，这些模型拥有对对抗样本强烈的二次抵抗特性，在使用SGD训练时，产生了较高的错误率。

回召率：样本中的正例有多少被预测正确了（找得全）

精确度：预测为正的样本中有多少是真正的正样本（找得对）

为什么对抗样本会泛化

对抗样本的迁移性、一个对抗样本被误分类的标签在另一个（不同结构、不同训练集）的模型上同样产生了很高的置信率，这类现象使用传统的非线性-过拟合的假设是很难解释的（非线性-过拟合的假设认为对抗样本在空间中是像有理数一样分布，对抗样本广泛存在，但只出现在某些精确的位置）。

在作者的线性观点下，对抗样本分布的方向只需要与 cost 函数的梯度方向拥有正的点积即可（ $\epsilon$ 也要足够大）。

作者做了实验来验证这个现象，作者在较大的 $\epsilon$ 范围内遍历，绘制出了 MNIST 数据集中 10 个类的未归一化概率 ( unnormalized log probabilities )与 $\epsilon$ 的关系。

从这个实验中我们可以看出对抗样本在由 FGSM 定义的一维子空间中是连续存在的。这个结果解释了为什么对抗样本是充足的、为什么一个被一个分类器错误分类的样本被另一个分类器错误分类的先验概率相当高。

非常不理解，这里用到的对抗样本已经被严重扰动了，人眼都显然无法分类，这为什么能证明对抗样本是连续分布的呢（严重失真的输入能否被成为是对抗样本？）。 “为什么一个被一个分类器错误分类的样本被另一个分类器错误分类的先验概率相当高” 这个问题是如何通过这幅图解释的呢？

为了解释为什么许多模型都将对抗样本归为同一类，作者提出了一个假设：用当前方法训练的神经网络都类似于一种在同一训练集上学习的线性分类器。当在训练集的不同子集上训练时，这个参考分类器能够学习到近似相同的分类权重，这是因为机器学习算法具有泛化能力。底层分类权重的稳定性反过来导致对抗样本的稳定性。

对此，我的理解是：当前的神经网络都在努力地接近一个假设的完全正确的分类器——人类真正的分类标准，而这个所谓的 “同一训练集” 指的是由真实世界中的一切构成的训练集。而在训练模型中使用的任何训练集都来自于真实世界，是真实世界的子集。神经网络通过在这些子集上训练，都在不约而同地靠近这个“完全正确的分类器”，因此总是具有一些相似的特征（即学习到相似的分类权重）。

这是我对机器学习算法“泛化能力”的理解。

总结

对抗例可以解释为高维点积的一种性质。它们是模型过于线性的结果，而不是模型过于非线性的结果。
对抗样本在不同模型之间的泛化可以解释为对抗扰动与模型的权重向量高度匹配以及不同模型在训练执行相同任务时学习相似函数的结果。
扰动的方向才是最重要的。空间并不充满像有理数那样精细地平铺实数的对抗样本。
因为方向是最重要的，对抗扰动在不同的干净样本中泛化。
我们介绍了一族快速生成对抗样本的方法。
我们已经证明，对抗训练可以导致正则化；甚至比dropout效果更好。
我们使用更简单但效率更低的正则化器（包括L-1权重衰减和添加噪声）进行了控制实验，但未能重现这种效果。
容易优化的模型容易被扰动。
线性模型缺乏抵抗对抗扰动的能力；只有具有隐藏层（万能逼近定理适用）的结构才能被训练以抵抗对抗扰动。
RBF 网络具有抗对抗样本的能力。
对输入分布建模得到的模型（应该是$U(-\epsilon,\epsilon)$ 和 $\pm \epsilon$ 的实验）对对抗样本没有抵抗性。
集成体不抵抗对抗样本。（没看懂）

代码示例——FGSM攻击MNIST

代码参考 https://github.com/HAI4831/fgsm_tutorial ，并进行了整合与改动。

其中与训练的模型 .pth 点击下载 lenet_mnist_model.pth

# https://github.com/HAI4831/fgsm_tutorial
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import os
from torchvision import datasets, transforms
import numpy as np
import matplotlib.pyplot as plt

epsilons = [0, .05, .1, .15, .2, .25, .3]
pretrained_model = "lenet_mnist_model.pth"
use_cuda = True
# Set random seed for reproducibility
torch.manual_seed(42)


# LeNet Model definition
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


# MNIST Test dataset and dataloader declaration
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('/data',
                   train=False,
                   download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,)),
                   ])
                   ),
    batch_size=1, shuffle=True)

# Define what device we are using
print("CUDA Available: ", torch.cuda.is_available())
device = torch.device("cuda" if use_cuda and torch.cuda.is_available() else "cpu")

# Initialize the network
model = Net().to(device)

# Load the pretrained model
model.load_state_dict(torch.load(pretrained_model, map_location=device))

# Set the model in evaluation mode. In this case this is for the Dropout layers
model.eval()


# FGSM attack code
def fgsm_attack(image, epsilon, data_grad):
    # Collect the element-wise sign of the data gradient
    sign_data_grad = data_grad.sign()
    # Create the perturbed image by adjusting each pixel of the input image
    perturbed_image = image + epsilon * sign_data_grad
    # Adding clipping to maintain [0,1] range
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    # Return the perturbed image
    return perturbed_image


# restores the tensors to their original scale
def denorm(batch, mean=[0.1307], std=[0.3081]):
    """
    Convert a batch of tensors to their original scale.

    Args:
        batch (torch.Tensor): Batch of normalized tensors.
        mean (torch.Tensor or list): Mean used for normalization.
        std (torch.Tensor or list): Standard deviation used for normalization.

    Returns:
        torch.Tensor: batch of tensors without normalization applied to them.
    """
    if isinstance(mean, list):
        mean = torch.tensor(mean).to(device)
    if isinstance(std, list):
        std = torch.tensor(std).to(device)

    return batch * std.view(1, -1, 1, 1) + mean.view(1, -1, 1, 1)


def test(model, device, test_loader, epsilon):
    # Accuracy counter
    correct = 0
    adv_examples = []

    # Loop over all examples in test set
    for data, target in test_loader:

        # Send the data and label to the device
        data, target = data.to(device), target.to(device)

        # Set requires_grad attribute of tensor. Important for Attack
        data.requires_grad = True

        # Forward pass the data through the model
        output = model(data)
        init_pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability

        # If the initial prediction is wrong, don't bother attacking, just move on
        if init_pred.item() != target.item():
            continue

        # Calculate the loss
        loss = F.nll_loss(output, target)

        # Zero all existing gradients
        model.zero_grad()

        # Calculate gradients of model in backward pass
        loss.backward()

        # Collect ``datagrad``
        data_grad = data.grad.data

        # Restore the data to its original scale
        data_denorm = denorm(data)

        # Call FGSM Attack
        perturbed_data = fgsm_attack(data_denorm, epsilon, data_grad)

        # Reapply normalization
        perturbed_data_normalized = transforms.Normalize((0.1307,), (0.3081,))(perturbed_data.squeeze(dim=0))
        perturbed_data_normalized = perturbed_data_normalized.unsqueeze(dim=0)
        # Re-classify the perturbed image
        output = model(perturbed_data_normalized)

        # Check for success
        final_pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
        if final_pred.item() == target.item():
            correct += 1
            # Special case for saving 0 epsilon examples
            if epsilon == 0 and len(adv_examples) < 5:
                adv_ex = perturbed_data.squeeze().detach().cpu().numpy()
                adv_examples.append((init_pred.item(), final_pred.item(), adv_ex))
        else:
            # Save some adv examples for visualization later
            if len(adv_examples) < 5:
                adv_ex = perturbed_data.squeeze().detach().cpu().numpy()
                adv_examples.append((init_pred.item(), final_pred.item(), adv_ex))

    # Calculate final accuracy for this epsilon
    final_acc = correct / float(len(test_loader))
    print(f"Epsilon: {epsilon}\tTest Accuracy = {correct} / {len(test_loader)} = {final_acc}")

    # Return the accuracy and an adversarial example
    return final_acc, adv_examples


accuracies = []
examples = []

# Run test for each epsilon
for eps in epsilons:
    acc, ex = test(model, device, test_loader, eps)
    accuracies.append(acc)
    examples.append(ex)

plt.figure(figsize=(5, 5))
plt.plot(epsilons, accuracies, "*-")
plt.yticks(np.arange(0, 1.1, step=0.1))
plt.xticks(np.arange(0, .35, step=0.05))
plt.title("Accuracy vs Epsilon")
plt.xlabel("Epsilon")
plt.ylabel("Accuracy")
plt.show()

# Plot several examples of adversarial samples at each epsilon
cnt = 0
plt.figure(figsize=(8, 10))
for i in range(len(epsilons)):
    for j in range(len(examples[i])):
        cnt += 1
        plt.subplot(len(epsilons), len(examples[0]), cnt)
        plt.xticks([], [])
        plt.yticks([], [])
        if j == 0:
            plt.ylabel(f"Eps: {epsilons[i]}", fontsize=14)
        orig, adv, ex = examples[i][j]
        plt.title(f"{orig} -> {adv}")
        plt.imshow(ex, cmap="gray")
plt.tight_layout()
plt.show()

本文采用署名-非商业性使用-相同方式共享 4.0 国际许可协议，转载请注明出处。