《动手学深度学习》核心知识梳理

前言

本文系统梳理《动手学深度学习》(Dive into Deep Learning)的核心内容,包括关键概念、数学公式和实用技术。适合作为深度学习知识的快速复习手册。

一、深度学习基础

1.1 感知机与神经元

单个神经元的输出:

y=σ(wTx+b)y = \sigma(w^T x + b)

其中:

  • xx 是输入向量
  • ww 是权重向量
  • bb 是偏置
  • σ\sigma 是激活函数

1.2 常用激活函数

Sigmoid:

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

导数:σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))

Tanh:

tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

导数:tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x)

ReLU(常用):

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

Leaky ReLU:

LeakyReLU(x)=max(αx,x),α(0,1)\text{LeakyReLU}(x) = \max(\alpha x, x), \quad \alpha \in (0, 1)

二、前向传播与反向传播

2.1 前向传播

对于 LL 层神经网络:

z[l]=W[l]a[l1]+b[l]a[l]=σ[l](z[l])\begin{aligned} z^{[l]} &= W^{[l]} a^{[l-1]} + b^{[l]} \\ a^{[l]} &= \sigma^{[l]}(z^{[l]}) \end{aligned}

其中 a[0]=xa^{[0]} = x(输入层)

2.2 损失函数

均方误差(MSE):

L=1ni=1n(yiy^i)2L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

交叉熵损失(分类):

L=1ni=1nc=1Cyiclog(y^ic)L = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})

二元交叉熵:

L=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]L = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]

2.3 反向传播

链式法则核心:

LW[l]=La[l]a[l]z[l]z[l]W[l]\frac{\partial L}{\partial W^{[l]}} = \frac{\partial L}{\partial a^{[l]}} \cdot \frac{\partial a^{[l]}}{\partial z^{[l]}} \cdot \frac{\partial z^{[l]}}{\partial W^{[l]}}

梯度计算:

δ[l]=Lz[l]=La[l]σ[l](z[l])LW[l]=δ[l](a[l1])TLb[l]=δ[l]\begin{aligned} \delta^{[l]} &= \frac{\partial L}{\partial z^{[l]}} = \frac{\partial L}{\partial a^{[l]}} \odot \sigma'^{[l]}(z^{[l]}) \\ \frac{\partial L}{\partial W^{[l]}} &= \delta^{[l]} (a^{[l-1]})^T \\ \frac{\partial L}{\partial b^{[l]}} &= \delta^{[l]} \end{aligned}

三、优化算法

3.1 批量梯度下降(BGD)

W:=WηWLW := W - \eta \nabla_W L
  • 优点:稳定
  • 缺点:计算量大

3.2 随机梯度下降(SGD)

W:=WηWLiW := W - \eta \nabla_W L_i
  • 优点:快速
  • 缺点:不稳定

3.3 小批量梯度下降(Mini-batch GD)

W:=Wη1mi=1mWLiW := W - \eta \frac{1}{m} \sum_{i=1}^{m} \nabla_W L_i

最常用的折中方案。

3.4 动量法(Momentum)

vt=βvt1+ηWLW:=Wvt\begin{aligned} v_t &= \beta v_{t-1} + \eta \nabla_W L \\ W &:= W - v_t \end{aligned}

典型值:β=0.9\beta = 0.9

3.5 AdaGrad

st=st1+(WL)2W:=Wηst+ϵWL\begin{aligned} s_t &= s_{t-1} + (\nabla_W L)^2 \\ W &:= W - \frac{\eta}{\sqrt{s_t + \epsilon}} \nabla_W L \end{aligned}

3.6 RMSprop

st=βst1+(1β)(WL)2W:=Wηst+ϵWL\begin{aligned} s_t &= \beta s_{t-1} + (1-\beta)(\nabla_W L)^2 \\ W &:= W - \frac{\eta}{\sqrt{s_t + \epsilon}} \nabla_W L \end{aligned}

3.7 Adam(最常用)

结合动量和 RMSprop:

mt=β1mt1+(1β1)WLvt=β2vt1+(1β2)(WL)2m^t=mt1β1t,v^t=vt1β2tW:=Wηv^t+ϵm^t\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) \nabla_W L \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2) (\nabla_W L)^2 \\ \hat{m}_t &= \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \\ W &:= W - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \end{aligned}

典型超参数:β1=0.9,β2=0.999,ϵ=108\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 10^{-8}

四、正则化技术

4.1 L2 正则化(权重衰减)

Ltotal=Loriginal+λ2nlW[l]F2L_{total} = L_{original} + \frac{\lambda}{2n} \sum_{l} \|W^{[l]}\|_F^2

梯度更新变为:

W:=WηWLηλnWW := W - \eta \nabla_W L - \eta \frac{\lambda}{n} W

4.2 L1 正则化

Ltotal=Loriginal+λnlW[l]1L_{total} = L_{original} + \frac{\lambda}{n} \sum_{l} \|W^{[l]}\|_1

能产生稀疏权重。

4.3 Dropout

训练时:以概率 pp 随机丢弃神经元

htrain=11pmask(h)h_{\text{train}} = \frac{1}{1-p} \cdot \text{mask}(h)

测试时:使用完整网络(不丢弃)

4.4 Batch Normalization

μB=1mi=1mxiσB2=1mi=1m(xiμB)2x^i=xiμBσB2+ϵyi=γx^i+β\begin{aligned} \mu_B &= \frac{1}{m} \sum_{i=1}^{m} x_i \\ \sigma_B^2 &= \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2 \\ \hat{x}_i &= \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \\ y_i &= \gamma \hat{x}_i + \beta \end{aligned}

其中 γ\gammaβ\beta 是可学习参数。

五、卷积神经网络(CNN)

5.1 卷积运算

二维卷积:

(XK)i,j=abXi+a,j+bKa,b(X * K)_{i,j} = \sum_{a} \sum_{b} X_{i+a, j+b} \cdot K_{a,b}

5.2 输出尺寸计算

Output Size=n+2pks+1\text{Output Size} = \left\lfloor \frac{n + 2p - k}{s} \right\rfloor + 1

其中:

  • nn:输入尺寸
  • pp:填充(padding)
  • kk:卷积核大小
  • ss:步幅(stride)

5.3 池化层

最大池化(Max Pooling):

yi,j=max(a,b)Ri,jxa,by_{i,j} = \max_{(a,b) \in R_{i,j}} x_{a,b}

平均池化(Average Pooling):

yi,j=1Ri,j(a,b)Ri,jxa,by_{i,j} = \frac{1}{|R_{i,j}|} \sum_{(a,b) \in R_{i,j}} x_{a,b}

5.4 经典 CNN 架构

LeNet-5: Conv → Pool → Conv → Pool → FC → FC

AlexNet: 更深、使用 ReLU、Dropout

VGG: 多个 3×3 卷积堆叠

ResNet: 残差连接 H(x)=F(x)+xH(x) = F(x) + x

Inception: 多尺度并行卷积

六、循环神经网络(RNN)

6.1 基础 RNN

ht=tanh(Whhht1+Wxhxt+bh)yt=Whyht+by\begin{aligned} h_t &= \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) \\ y_t &= W_{hy} h_t + b_y \end{aligned}

问题: 梯度消失/爆炸

6.2 LSTM(长短期记忆网络)

遗忘门:

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

输入门:

it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)

候选记忆单元:

C~t=tanh(WC[ht1,xt]+bC)\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

更新记忆单元:

Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

输出门:

ot=σ(Wo[ht1,xt]+bo)ht=ottanh(Ct)\begin{aligned} o_t &= \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\ h_t &= o_t \odot \tanh(C_t) \end{aligned}

6.3 GRU(门控循环单元)

简化版 LSTM:

更新门:

zt=σ(Wz[ht1,xt])z_t = \sigma(W_z \cdot [h_{t-1}, x_t])

重置门:

rt=σ(Wr[ht1,xt])r_t = \sigma(W_r \cdot [h_{t-1}, x_t])

候选隐藏状态:

h~t=tanh(W[rtht1,xt])\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])

最终隐藏状态:

ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

七、注意力机制与 Transformer

7.1 注意力机制

加性注意力:

score(ht,hˉs)=vTtanh(W1ht+W2hˉs)\text{score}(h_t, \bar{h}_s) = v^T \tanh(W_1 h_t + W_2 \bar{h}_s)

点积注意力(更常用):

score(ht,hˉs)=htThˉs\text{score}(h_t, \bar{h}_s) = h_t^T \bar{h}_s

注意力权重:

αts=exp(score(ht,hˉs))sexp(score(ht,hˉs))\alpha_{ts} = \frac{\exp(\text{score}(h_t, \bar{h}_s))}{\sum_{s'} \exp(\text{score}(h_t, \bar{h}_{s'}))}

上下文向量:

ct=sαtshˉsc_t = \sum_s \alpha_{ts} \bar{h}_s

7.2 自注意力(Self-Attention)

查询、键、值:

Q=XWQ,K=XWK,V=XWVQ = XW_Q, \quad K = XW_K, \quad V = XW_V

缩放点积注意力:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

缩放因子 dk\sqrt{d_k} 防止梯度消失。

7.3 多头注意力(Multi-Head Attention)

headi=Attention(QWiQ,KWiK,VWiV)MultiHead(Q,K,V)=Concat(head1,,headh)WO\begin{aligned} \text{head}_i &= \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) \\ \text{MultiHead}(Q, K, V) &= \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O \end{aligned}

7.4 位置编码(Positional Encoding)

PE(pos,2i)=sin(pos100002i/d)PE(pos,2i+1)=cos(pos100002i/d)\begin{aligned} PE_{(pos, 2i)} &= \sin\left(\frac{pos}{10000^{2i/d}}\right) \\ PE_{(pos, 2i+1)} &= \cos\left(\frac{pos}{10000^{2i/d}}\right) \end{aligned}

八、生成模型

8.1 自编码器(Autoencoder)

z=fθ(x)(编码器)x^=gϕ(z)(解码器)L=xx^2\begin{aligned} z &= f_\theta(x) \quad \text{(编码器)} \\ \hat{x} &= g_\phi(z) \quad \text{(解码器)} \\ L &= \|x - \hat{x}\|^2 \end{aligned}

8.2 变分自编码器(VAE)

编码器输出: μ(x),σ(x)\mu(x), \sigma(x)

重参数化技巧:

z=μ+σϵ,ϵN(0,I)z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

损失函数:

L=xx^2重建损失+DKL(q(zx)p(z))KL散度L = \underbrace{\|x - \hat{x}\|^2}_{\text{重建损失}} + \underbrace{D_{KL}(q(z|x) \| p(z))}_{\text{KL散度}}

8.3 生成对抗网络(GAN)

生成器: G(z)G(z) 生成假样本

判别器: D(x)D(x) 判断真假

损失函数(min-max 博弈):

minGmaxDExpdata[logD(x)]+Ezpz[log(1D(G(z)))]\min_G \max_D \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

九、实用技巧

9.1 学习率调度

步进衰减:

ηt=η0γt/T\eta_t = \eta_0 \cdot \gamma^{\lfloor t / T \rfloor}

余弦退火:

ηt=ηmin+12(ηmaxηmin)(1+cos(tπT))\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t\pi}{T}))

9.2 数据增强

  • 图像:旋转、翻转、裁剪、颜色抖动
  • 文本:同义词替换、回译、随机插入/删除
  • 音频:时间伸缩、音高变换、添加噪声

9.3 初始化技巧

Xavier 初始化(tanh/sigmoid):

WU(6nin+nout,6nin+nout)W \sim U\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)

He 初始化(ReLU):

WN(0,2nin)W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)

9.4 梯度裁剪

防止梯度爆炸:

g:={θggif g>θgotherwiseg := \begin{cases} \frac{\theta}{\|g\|} g & \text{if } \|g\| > \theta \\ g & \text{otherwise} \end{cases}

十、评估指标

10.1 分类指标

准确率:

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

精确率:

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

召回率:

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

F1 分数:

F1=2PrecisionRecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

10.2 回归指标

均方误差(MSE):

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

平均绝对误差(MAE):

MAE=1ni=1nyiy^i\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

R² 分数:

R2=1i(yiy^i)2i(yiyˉ)2R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}

总结

本文梳理了深度学习的核心知识体系,从基础的神经网络到现代的 Transformer 架构。建议按以下路径复习:

  1. 基础巩固: 前向传播、反向传播、梯度下降
  2. 优化提升: Adam、正则化、Batch Normalization
  3. 架构演进: CNN → RNN/LSTM → Attention → Transformer
  4. 实战技巧: 学习率调度、数据增强、超参数调优

深度学习是理论与实践并重的领域,建议配合代码实现加深理解。

参考资源