02-Learning-to-Answer-Yes-or-No


Perceptron Hypothesis Set

  • For $x=(x_1,x_2,…,x_d)$ ‘features of customer’, compute a weighted ‘score’ and:
    $\quad \quad \quad$ approve credit if $\quad \sum_{i=1}^{d}{w_ix_i} > threshold$
    $\quad \quad \quad \quad \ $ deny credit if $\quad \sum_{i=1}^{d}{w_ix_i} < threshold$
  • $y$: {+1(good), -1(bad)}, linear formula $h \epsilon H$ are:
    $\quad \quad \quad$ $h(x)=sign((\sum_{i=1}^{d}{w_ix_i})-threshold)$
  • Vector form:
    $
    \begin{eqnarray}
    h(x) &=& sign((\sum_{i=1}^{d}{w_ix_i})-threshold)\\
    &=&sign((\sum_{i=1}^{d}{w_ix_i})+(-threshold)·(+1))\\
    &=&sign(\sum_{i=0}^{d}{w_ix_i})\\
    &=&sign(w^Tx)
    \end{eqnarray}
    $

Perceptrons in $R^2$: $h(x)=sign(w_0+w_1x_1+w_2x_2)$

  • customer features $x$: points on the plane (or points in $R^d$)
  • lables $y$: 。(+1), x(-1)
  • hypothesis $h$: lines(or hyperplanes in $R^d$)
  • different line classifies customers differently
    perceptrons $\Leftrightarrow$linear (binary) classifiers

Percepton Learning Algorithm

Select $g$ from $H$, where $H$=all possible perceptrons.

  • want: $g\approx f$ on $D$, ideally $g(x_n)=f(x_n)=y_n$
  • diffcult: $H$ is of infinite size
  • idea: start from some $g_0$, and ‘correct’ its mistakes on $D$

in PLA, we can represent $g_0$ by its weight vector $w_0$.

For t = 0,1,…
$\quad$find the next mistake of $w_t$ called $(x_{n(t)}, y_{n(t)})$
$\quad\quad\quad\quad sign(w_t^Tx_{n(t)})\neq y_{n(t)}$
$\quad$(try to) correct the mistake by
$\quad\quad\quad\quad w_{t+1}\leftarrow w_t+y_{n(t)}x_{n(t)}$
… until a full cycle of not encountering mistakes
return last $w$ (called $w_{PLA}$) as $g$
(next can follow naive cycle(1,…,N) or precomputed random cycle)


Guarantee of PLA

if(linear separable), after ‘enough’ corrections, any PLA variant halts.

PLA Fact: $w_t$ Gets More Aligned with $w_f$

linear separable $D\ \Leftrightarrow$ exists perfect $w_f$ such that $y_n=sign(w_f^Tx_n)$

  • $w_f$ perfect hence every $x_n$ correctly away from line:
    $\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad y_{n(t)}w_f^Tx_{n(t)}\geqslant \mathop{min}\limits_ny_nw_f^Tx_n>0$
  • $w_f^Tw_t\uparrow$ by updating with any $(x_{n(t)},y_{n(t)})$
    $
    \begin{eqnarray}
    w_f^Tw_{t+1}&=&w_f^T(w_t+y_{n(t)}x_{n(t)})\\
    &\geq&w_f^Tw_t+\mathop{min}\limits_ny_nw_f^Tx_n\\
    &>&w_f^Tw_t+0
    \end{eqnarray}
    $

$w_t$ appears more aligned with $w_f$ after update.

PLA Fact: $w_t$ Does not Grow too Fast

$w_t$ changed only when mistake $\Leftrightarrow sign(w_f^Tx_{n(t)})\neq y_{n(t)}\Leftrightarrow y_{n(t)}w_f^Tx_{n(t)}\leqslant 0$

  • mistake ‘limits’ $|w_t|^2$ growth, even when updating with ‘longest’ $x_n$
    $
    \begin{eqnarray}
    |w_{t+1}|^2&=&|w_t+y_{n(t)}x_{n(t)}|^2\\
    &=&|w_t|^2+2y_{n(t)}w_t^Tx_{n(t)}+|y_{n(t)}x_{n(t)}|^2\\
    &\leq&|w_t|^2+0+|y_{n(t)}x_{n(t)}|^2\\
    &\leq&|w_t|^2+\mathop{max}\limits_n|y_nx_n|^2
    \end{eqnarray}
    $

start from $w_0=0$, after $T$ mistake corrections:
$\frac {w_f^T}{|w_f|} \frac {w_T}{|w_T|} \geq \sqrt{T}·constant$
PLA告诉我们只要线性可分并且通过错误修正,那么PLA会停下来. 这里PLA实现简单,而且适用于任何维度$d$.
问题在于我们不知道是否$D$线性可分,并且不确定到停止需要多长时间(虽然实际上很快).


Non-Separable Data(Learning with Noisy data)

  • assume ‘little’ noise: $y_n=f(x_n)$ usually
  • if so, $g\approx f$ on $D\Leftrightarrow y_n=g(x_n)$ usually
  • how about
    $w_g\leftarrow \mathop{argmin}\limits_w \sum_{n=1}^N [y_n\neq sign(w^Tx_n)]$
    ——NP-hard to solve, unfortunately.

Pocket Algorithm: modify PLA algorithm (black lines) by keeping best weights in pocket
initialize pocket weights $\hat{w}$
For t=0,1,…
$\quad\quad$find a (random) mistake of $w_t$ called $(x_{n(t)}, y_{n(t)})$
$\quad\quad$(try to) correct the mistake by
$\quad\quad\quad\quad w_{t+1}\leftarrow w_t+y_{n(t)}x_{n(t)}$
$\quad\quad$if $w_{t+1}$ makes fewer mistakes than $\hat{w}$, replace $\hat{w}$ by $w_{t+1}$
…until enough iterations
return $\hat{w}$ (called $w_{POCKET}$) as $g$