Cousera吴恩达机器学习笔记

https://www.bilibili.com/video/BV164411S78V

线性回归(Linear Regression)与梯度下降(Gradient Descent)

记号

\(m\) = 训练样本数,\(n\) = 特征数,\(x\) = 输入变量/特征,\(y\) = 输出变量/目标变量

\((x, y)\) = 训练样本。第i个: \((x^{(i)},y^{(i)})\)

\(h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n\)

\(x_0\)\(1\),则\(h_\theta(x) = \sum_{i=0}^{n}\theta_ix_i=\theta^T x\)

\(Minimize_{\theta}\ \ J(\theta) = \frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\)

(最小二乘线性回归)

初始 \(\theta = \boldsymbol{0}\),注意:\(\theta,x,y\) 均为向量

批量梯度下降(Batch Gradient Regression)(使用全部样本,循环直到收敛,复杂度\(knm\)):

\(\theta_i := \theta_i - \alpha\frac{\partial}{\partial\theta_i}J(\theta) = \theta_i - \frac{\alpha}{m}(h_\theta(x)-y)x_i = \theta_i - \frac{\alpha}{m} \sum_{j = 1}^m (h_\theta(x^{(j)}) - y^{(j)})x_i^{(j)}\)

随机梯度下降(Stochastic Gradient Descent)(一步只使用一对\((x,y)\) ):

For j:=1 to m \(\theta_i := \theta_i - \frac{\alpha}{m}(h_\theta(x^{(j)})-y^{(j)})x_i^{(j)}\ (For\ all\ i)\)

正则化方法(复杂度\((nm)^3\)):\(\theta = (X^TX)^{-1}X^Ty\)

向量缩放\(x_i = \frac{x_i - \mu_i}{s_i}\)\(\mu_i\)\(x_i\)平均数,\(s_i\) 为极差或标准差)

逻辑回归(Logistic Regression)

二分类

\(h_\theta(x) = \frac{1}{1 + e^{-\theta^TX}}\),但若仍使用原先代价函数会得到非凸图像,容易收敛至非最值点。

原先代价函数:\(Cost(h_\theta(x), y) = \sum_{i = 1}^m \frac{1}{2} (h_\theta(x^{(i)})-y^{(i)})^2\)

重新定义代价函数:

\(Cost(h_\theta(x), y) = \begin{cases} -\log(h_\theta(x)) & if\ y = 1 \\ -\log(1 - h_\theta(x)) & if\ y = 0 \end{cases}\)

代入得\(J(\theta) = \frac{1}{m} \sum_{i = 1}^m Cost(h_\theta(x),y) = -\frac{1}{m}\sum_{i = 1}^m [y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))]\)

\(\theta_i := \theta_i - \frac{\alpha}{m} \sum_{j = 1}^m (h_\theta(x^{(j)}) - y^{(j)})x_i^{(j)}\)

(形式与线性回归完全相同)

多拟合分类器\(h_\theta^{(i)}(x) = P(y = i|x; \theta)\ \ \ (i = 1, 2, \cdots)\)

对每个样本寻找:\(\max_i h_\theta^{(i)}(x)\)

正则化(Regularization):为避免过拟合(Overfitting),对于某些高次项系数\(\theta_i\),将\(1000\theta_i^2\)加入\(J(\theta)\),以使得此系数尽量小,从而消除此系数(显然\(\theta_0\)不需要)。

\(J(\theta) = \frac{1}{m} [\sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j = 1}^n \theta_j^2]\)

\(\theta_0 := \theta_0 - \frac{\alpha}{m} \sum_{i = 1}^m (h_\theta(x^{(0)}) - y^{(0)})x_0^{(0)}\)

\(\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \frac{\alpha}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\)

\(\theta = \Big{(}X^TX + \lambda \begin{bmatrix} 0 & & & & \\ & 1 & & & \\ & & 1 & & \\ & & & \ddots & \\ & & & & 1 \end{bmatrix}\Big{)}^{-1}X^Ty\) \((n+1)\times(n+1)\)

\(m \leqslant n\)\(\lambda > 0\),那么此矩阵可逆。

神经网络(Neural Networks)

\(L\):网络层数,\(s_l\):神经元数量(不包括偏差单元bias unit)

二分类(Binary Classification)\(y = 0\ \text{or}\ 1\)\(s_L = K = 1\)

多分类(Multi-class Classification)\(y \in \mathbb{R}^K\)\(S_L = K\)

\(J(\theta) = -\frac{1}{m}[\sum_{i = 1}^m \sum_{k = 1}^K y_k^{(i)} \log(h_\theta(x^{(i)}))_k + (1 - y_k^{(i)}) \log(1 - (h_\theta(x^{(i)}))_k)] \\ + \frac{\lambda}{2m}\sum_{l = 1}^{L - 1}\sum_{i = 1}^{s_l}\sum_{j = 1}^{s_l + 1}(\theta_{ji}^{(l)})^2\)

\(a_j^{(l)}\):第\(l\)层第\(j\)个节点的激活值(Activation)\(z^{(l+1)} = \theta^{(l)} \cdot a^{(l)}\)\(a^{(l+1)} = g(z^{(l+1)})\),此例中\(g(z) = \frac{1}{1 - e^{-z}}\)

\(cost(i) = y^{(i)}\log h_\theta(x^{(i)}) + (1 - y^{(i)}) \log h_\theta(x^{(i)})\)

\(\delta_j^{(l)}\)\(a_j^{(l)}\)的误差代价,\(\delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}}cost(i)\ (j \geqslant 0)\)

复杂推导得:\(\delta^{(l)} = (\theta^{(l)})^T \delta^{(l+1)} \cdot g'(z^{(l)})\)\(g'(z^{(l)}) = a^{(l)} \cdot (1 - a^{(l)})\)\(\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}\)

对于每个样本\((x^{(i)}, y^{(i)})\)正向传播(Forward Propagation)得到\(a\),再计算输出层的\(\delta^{(L)}\),再反向传播(Back Propagation)得到\(\delta^{(2 \sim L-1)}\)\(\Delta^{(2 \sim L-1)}\),最后得到代价函数的偏导数:

\(D_{ij}^{(l)} := \frac{1}{m}\Delta_{ij}^{(l)} + \lambda \theta_{ij}^{(l)}\ \ \ if\ j \neq 0\)

\(D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)}\ \ \ if\ j = 0\)

梯度检测(Gradient Check)\(\frac{\partial}{\partial \theta_i} \approx \frac{J(\theta_1, \cdots, \theta_i+\epsilon, \cdots, \theta_n) - J(\theta_1, \cdots, \theta_i - \epsilon, \cdots \theta_n)}{2\epsilon}\)

随机初始化:每个\(\theta_{ij}^{(l)}\)都在\([-\epsilon, \epsilon]\)范围内随机。

机器学习诊断法(Diagnostics):

0/1分类错误(0/1 Misclassfication error)

\(err(h_\theta(x), y) = \begin{cases} 1 & \text{if } h_\theta(x) \geqslant 0.5 & , y = 0 \\ & \text{or if } h_\theta(x)<0.5 &, y = 1 \\ 0 & \text{otherwise} \end{cases}\)

\(Test\ error = \frac{1}{m_{test}} \sum_{i = 1}^{m_{test}} err(h_\theta(x_{test}^{(i)}), y_{test}^{(i)})\)

训练集(Trainning set) 60%,交叉验证集(Cross validation set) 20%,测试集(Test set) 20%

选择误差小的、泛化能力强的多项式次数\(d\)作为最终拟合结果。

先训练最小化\(J_{train}(\theta)\),再选取\(J_{cv}(\theta)\)最小的次数\(\theta^{(i)}\),最后在测试集上测试其泛化能力。

偏差值(Bias)过高:欠拟合。\(J_{train}(\theta)\)\(J_{cv}(\theta)\)都很高

方差值(Variance)过高:过拟合。\(J_{train}(\theta)\)很低,\(J_{cv}(\theta)\)很高

可以用同样的方法决定正则化系数\(\lambda\)

偏斜类问题(skew classes)的评估方法:

查准率(True Positive)\(\frac{\text{True positives}}{\text{ predicted positives}} = \frac{\text{True pos}}{\text{True pos + Fake pos}}\)

召回率(Fake Positive)\(\frac{True\ positives}{\# actual\ positives} = \frac{True\ pos}{True\ pos + False\ neg}\)

常用评估值:\(F_1\ Score: \frac{2PR}{P + R}\)

支持向量机(Support Vector Machine)

逻辑回归:

\(J(\theta) = \frac{1}{m}[\sum_{i = 1}^m (-\log h_\theta(x^{(i)})) + (1 - y^{(i)})(-\log (1 - h_\theta(x^{(i)})))] + \frac{\lambda}{2m}\sum_{j = 1}^n \theta_j^2\)

\(SVM\)

\(\min_\theta C\sum_{i = 1}^m[y^{(i)}cost_1(\theta^Tx^{(i)}) + (1 - y^{(i)})cost_0(\theta^Tx^{(i)})] + \frac{1}{2}\sum_{i = 1}^n \theta_j^2\)

若希望预测结果\(y = 1\),则需要\(\theta^Tx^{(i)} \geqslant 1\),若希望\(y = 0\),则需要\(\theta^Tx^{(i)} \leqslant -1\)

核函数(Kernel)\(f_i = similarity(x, l^{(i)}) = exp(-\frac{\parallel x - l^{(i)}\parallel^2}{2\sigma^2})\)

训练:\(\min_\theta C\sum_{i = 1}^m [y^{(i)}cost_1(\theta^T f^{(i)}) + (1 - y^{(i)})cost_0(\theta^Tf^{(i)})] + \frac{1}{2}\sum_{j = 1}^n \theta_j^2\)\(n = m\)