10-Logistic-Regression


Logistic Regression

binary classification: ideal f(x) = sign(P(+1|x)-1/2) {-1,+1}
‘soft’ binary classification: f(x) = P(+1|x) {[0,1]} —> target function
Logistic Hypothesis $h(x)=\theta (w^Tx)$ with $\theta (s)=\frac{1}{1+e^{-s}}$
Logistic Regression use $h(x)=\frac{1}{1+exp(-w^Tx)}$ to approximate target function f(x)=p(+1|x)
Error Function LR的输出将是概率 $P(y|x)$ ,为了定义损失函数,我们引入Likelihood. 意即我们认为数据 $D$ 是 $target\ function\ f(x)$ 以某种概率生成的,这个概率是:

$likehood(f)=p(x_1)f(x_1)\times p(x_2)(1-f(x_2))\times…\times p(x_N)(1-f(x_N))$

认为使用 $f$ 生成的 $ D$ 的这个概率总是很大. 相应的,如果 $h\approx f$,那么 $likehood(h)\approx likehood(f)$,也就是 $likehood(h)$ 也总是很大.

$likehood(h)=p(x_1)h(x_1)\times p(x_2)(1-h(x_2))\times…\times p(x_N)(1-h(x_N))$
$g=\mathop{argmax}\limits_h\ likelihood(h)\ with \ h(x)=\theta (x)\ and\ 1-h(x)=h(-x)$
忽略掉式子中 $p(x_1)$ 等常数项. 得到:
$likelihood(logistic\ h)\propto \prod\limits_{n=1}^Nh(y_nx_n)$
加上对数运算,带入h(x). 得到:
$\mathop{max}\limits_w\ ln\prod\limits_{n=1}^{N}\theta(y_nw^Tx_n)$
添加负号转换成求最小值,再把对数展开,代入$\theta(x)$得到:
$\mathop{min}\limits_w\frac{1}{N}\sum_{n=1}^Nln(1+exp(-y_nw^Tx_n))$
$err(w,x,y)=ln(1+exp(-ywx)) –>(Cross-Entropy Error) $


Gradient of Logistic Regression

minimizing $E_{in}(w)=\frac{1}{N}\sum\limits_{n=1}^Nln(1+exp(-y_nw^Tx_n))$
derive $\bigtriangledown E_{in}(w)=\frac{1}{N}\sum\limits_{n=1}^N\theta(-y_nw^Tx_n)(-y_nx_n)$
我们希望 $\bigtriangledown E_{in}(w)$ 等于0,一种可能的解是所有的 $\theta(·)$ 等于0,也就是 $y_nw^Tx_n>>0$ ,这就要求 $D$ 线性可分. 然而在有noise的情况下显然不可行. 因此该式没有 close-form 的解.


Gradient Descent

类似于之前的PLA,我们寻求一种迭代优化方法:$w_{t+1}\leftarrow w_t+\eta v$ with direction $v$ of unit length and step size $\eta$ of positive.

一种可能的(贪婪)方法是对于某个给定的 $\eta$ 我们去求出单位长度的 $v$ 使得一次更新后的 $E_{in}(w)$ 最小,但这件事也不比求解使得原梯度等于0来的简单.

local approximation by linear formula makes problem easier.
$\qquad E_{in}(w_t+\eta v)\approx E_{in}(w_t)+\eta v^T\bigtriangledown E_{in}(w_t)$
if $\eta$ really small (Taylor expansion)

an approximate greedy approach for some given small $\eta$ :
$\qquad \min\limits_{\left | v \right |=1}v^T\bigtriangledown E_{in}(w_t)$
optimal $v$ : opposite direction of $\bigtriangledown E_{in}(w_t)$
$\qquad v=-\frac{\bigtriangledown E_{in}(w_t)}{\left | \bigtriangledown E_{in}(w_t) \right |}$
gradient descent: for small $\eta$, $w_{t+1}\leftarrow w_t-\eta \frac{\bigtriangledown E_{in}(w_t)}{\left | \bigtriangledown E_{in}(w_t) \right |}$

(gradient descent : a simple & popular optimization tool.)

About $\eta$ : better be monotonic of $\left | \bigtriangledown E_{in}(w_t) \right |$,意即当梯度较大时可以学得快一点,当梯度较小时要学慢一点. 也就是我们可以把 $\frac{\eta}{\left | \bigtriangledown E_{in}(w_t) \right |}$ 这一项合成为新的常数项 $\eta$.

Logistic Regression Algorithm
initialize $w_0$
For t=0,1,…
compute $\bigtriangledown E_{in}(w)=\frac{1}{N}\sum\limits_{n=1}^N\theta(-y_nw^Tx_n)(-y_nx_n)$
update by $\eta$, $w_{t+1}\leftarrow w_t-\eta {\bigtriangledown E_{in}(w_t)}$
…until ${\bigtriangledown E_{in}(w_t)}=0$ or enough iterations
return last $w_{t+1}$ as $g$.