In statistics and related fields, a similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects. Although no single definition of a simlarity measure exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero of a negative value for very dissimilar objects. E.g., in the context of cluster analysis, Frey and Dueck suggest defining a similarity measure $s(x,y)=-||x-y||_2^2$ where the inverse is the squared Euclidean distance.
16-Three-Learning-Principles
Occam’s Razor
The simplest model that fits the data is also the most plausible .
Simple Model
simple hypothesis $h$ : small $\Omega(h)$ , specified by few parameters .
simple model $H$ : small $\Omega(H)$ , contains small number of hypotheses .
small $\Omega(h)\Leftarrow small\ \Omega(H)$
simple : small hypothesis/model complexity
15-Validation
Model Selection Problem
There are so many models learned, even just for binary classification :
$$A\epsilon \{PLA,\ pocket,\ linear\ regression,\ logistic\ regression\}\times$$
$$T\epsilon \{100, 1000, 1000\}\times$$
$$\eta\epsilon\{1,0.01,0.0001\}\times$$
$$\phi\epsilon\{linear,\ quadratic,\ poly-10,\ Legendre-poly-10\}\times$$
$$\Omega(w)\epsilon\{L2\ regularizer,\ L1\ regularizer,\ symmetry\ regularizer\}\times$$
$$\lambda\epsilon\{0,0.01,1\}$$
Adrein-Movies
14-Regularization
Regularization Hypothesis Set
idea : ‘step back’ from $H_{10}$ to $H_2$
E.g.
hypothesis $w$ in $H_{10}$ : $w_0+w_1x+w_2x^2+…+w_{10}x^{10}$
hypothesis $w$ in $H_2$ : $w_0+w_1x+w_2x^2$
that is, $H_2=H_{10}$ AND ‘constraint that $w_3=w_4=…=w_{10}=0$ ‘ .
Regular-Expression-Matching
Question
Implement regular Expression matching with support for ‘.’ and ‘*’ .
‘.’ Matches any single character .
‘*’ Matches zero or more of the preceding element .The matching should cover the entire input string (not partial) .
The function prototype should be :
bool isMatch(const char s, const char p)
13-Hazard-of-Overfitting
Bad Generalization & Overfitting
Bad Generalization : low $E_{in}$ and high $E_{out}$
Overfitting : lower $E_{in}$ and higher $E_{out}$
Cause of Overfitting : use excessive $d_{vc}$ , boise & limited data size $N$ .
12-Nonlinear-Transformation
Quadratic Hypotheses
有时候在某些 $D$ 上总会出现所有的 lines $E_{in}$ 都很大,这种时候线性假设就存在表达限制. 但是也许是 circular separable 的,例如对于假设 $h(x)=sign(-x_1^2-x_2^2+0.6)$ 如何将线性的一些求解算法应用到这种二次可分的假设上呢?
这里我们可以把 $x_1^2$ 等作为整体 $z_1$ 来看待,这样就映射到了熟悉的线性中的情况.
$\{(x_n,y_n)\}\ circular\ separable\ \Rightarrow\ \{(z_n,y_n)\}\ linear\ separable$
$x\epsilon X\ \mathop{\mapsto}\limits^{\Phi}\ z\epsilon Z\ with\ nonlinear\ feature\ transform\ \Phi$
11-Linear-Models-for-Classification
Linear Models for Binary Classification
Visualizing Error Functions ($s=w^Tx\ ;\ y\epsilon\{-1,+1\}$)
- linear classification:
$\qquad h(x)=sign(s)\ ;\ err(h,x,y)=[h(x)\neq y]$
$\qquad err_{0/1}(s,y)=[sign(s)\neq y]=[sign(ys)\neq1]$ - linear regresion:
$\qquad h(x)=s\ ;\ err(h,x,y)=(h(x)-y)^2$
$\qquad err_{SQR}(s,y)=(s-y)^2=(ys-1)^2$ - logistic regression:
$\qquad h(x)=\theta(s)\ ;\ err(h,x,y)=-ln\ h(yx)$
$\qquad err_{CE}(s,y)=ln(1+exp(-ys))$
10-Logistic-Regression
Logistic Regression
binary classification: ideal f(x) = sign(P(+1|x)-1/2) {-1,+1}
‘soft’ binary classification: f(x) = P(+1|x) {[0,1]} —> target function
Logistic Hypothesis $h(x)=\theta (w^Tx)$ with $\theta (s)=\frac{1}{1+e^{-s}}$
Logistic Regression use $h(x)=\frac{1}{1+exp(-w^Tx)}$ to approximate target function f(x)=p(+1|x)
Error Function LR的输出将是概率 $P(y|x)$ ,为了定义损失函数,我们引入Likelihood. 意即我们认为数据 $D$ 是 $target\ function\ f(x)$ 以某种概率生成的,这个概率是:
$likehood(f)=p(x_1)f(x_1)\times p(x_2)(1-f(x_2))\times…\times p(x_N)(1-f(x_N))$