Summary-MLT-3

Veröffentlicht am 2018-07-05

ConvexOptimization

基本概念: 凸集合, 凸函数, 凹函数, 凸函数的一阶条件, 二阶条件, 保持凸性的数学操作(非负加权求和, 与仿射函数复合, 最大值或上确界), 对偶问题.

凸优化问题标准形式:
\begin{align}
\min_x\qquad & f_0(x)\\
s.t.\qquad & f_i(x)\leq0,\quad i=1,…,n\\
& a_i^Tx=b_i,\quad i=1,…,p
\end{align}
其中$f_0,f_1,…,f_n$是凸函数. 凸优化问题的任意一个局部最优解就是全局最优解.

凸优化最优解条件:
$$\nabla f_0(x^)^T(y-x^)\geq0,\quad\forall y\in\cal X$$
对于没有约束条件大的凸优化问题, 该条件可以退化为$\nabla f_0(x^*)=0$.

优化算法的收敛速率是和优化问题的性质密切相关, 对于一般的凸优化问题, 可以采用梯度下降达到$O(1/\sqrt T)$的收敛速率, 当函数为强凸时, 梯度下降的变种(epoch-GD)可以达到$O(1/T)$的收敛速率, 对于平滑函数, Nesterov的加速梯度下降可以达到$O(1/T^2)$的收敛速率, 当函数强凸且平滑时, 梯度下降可以达到线性收敛速率.

More info: Convex_optimization, Convex Optimization – Boyd and Vandenberghe

随机优化

样本平均近似(ERM)vs.随机近似(SGD)

完全信息在线学习

在线凸优化(OGD, ONS), OnlineToBatch

赌博机在线学习

多臂赌博机, 线性赌博机, 凸赌博机

Summary-MLT-2

Veröffentlicht am 2018-07-04

Inequality

信息熵不等式

K-L距离$D(P||Q)$; 联合熵$H(X,Y)$, 互信息$I(X,Y)$, 条件熵$H(X|Y)$; Han不等式.

基本不等式

Markov不等式: 给定$\epsilon>0$, X是一个非负随机变量, 有:
$$Pr[X\geq\epsilon]\leq\frac{E[X]}{\epsilon}$$. 另由该不等式可以直接导出Chebyshev不等式.

Jensen不等式: 对凸函数$f$, 有:
$$E[f(X)]\geq f(E[X])$$

Chernoff不等式: $X_1,…,X_n$是n独立同分布随机变量且满足$X_i\in[\alpha,\beta]$, 令$\bar X=\sum_{i=1}^nX_i/n$, 有:
$$Pr[\bar X-\mu\geq\epsilon]\leq exp(\frac{-2n\epsilon^2}{(\beta-\alpha)^2})$$, 该不等式可以从Markov不等式出发, 并利用Moment-Generating-Function证明.

McDiarmid不等式: $X_1,…,X_m$是空间$\cal X$的m个独立随机变量, 令$f:\cal X^m\rightarrow\mathbb R$是一个关于$X_1,…,X_m$的实值函数, 并满足对任意的$X_1,…,X_m, X_i^{‘}\in\cal X$, 都有:
$$|f(X_1,…,X_i,…,X_m)-f(X_1,…,X_i{‘},…,X_m)|\leq c_i$$, 那么对任意$\epsilon>0$, 有:
$$Pr[f(X_1,…,X_m)-E[f(X_1,…,X_m)]\geq\epsilon]\leq exp(-2\epsilon^2/\sum_{i=1}^mc_i^2)$$.

AdaBoost

集成学习是一类著名机器学习方法, 通过使用一系列学习器进行学习, 并使用某种规则将各个学习器进行整合从而获得比单个学习器显著优越的泛化性能, 常用方法包括Boosting, Bagging, random forest等, 其中, Boosting族算法最著名的代表是AdaBoost. 从基学习器的线性组合去优化指数损失函数出发, 可以直接得到AdaBoost算法中的权重更新公式.

若从VC维角度去分析AdaBoost的泛化性, 只会得到, 随着迭代轮数的增加, AdaBoost所产生的学习器函数空间复杂度将增加, 从而导致AdaBoost学习方法过拟合风险. 事实并非如此, 即使训练轮数增加, 仍不易陷入过拟合, 甚至当训练错误率为零时还可以降低泛化错误率, 为解释这一现象, 需要新的理论: 著名的间隔理论. 事实上, AdaBoost优化的是间隔分布, 当轮数增加, 泛化错误率越小.

Consistency

贝叶斯(最优)分类器和贝叶斯风险分别为:
$$g^(x)=\mathbb I[\eta(x)>1/2]\qquad R^=E_{x\sim\cal D_{\cal X}}[min{\eta(x),1-\eta(x)}]$$

称学习方法F满足一致性(贝叶斯一致性, 弱一致性), 如果有:
$$E_{S_n}[R(F_{S_n})]\rightarrow R^*$$, 当$n\rightarrow\infty$. 一致性反映了学习函数的错误率随样本的增加而趋于贝叶斯错误率.

替代损失函数一致性

对于常见的二分类问题, 由于0-1损失函数是非凸与不连续的, 直接优化是一个NP-hard问题, 从而导致计算上不可行. 在实际的算法设计过程中, 一般会对分类错误率损失函数进行凸放松, 即对$\mathbb I[\cdot]$的上界进行凸放松. e.g. AdaBoost算法优化指数损失函数$l(g(x),y)=exp(-yg(x))$, SVM采用hinge损失函数等等. 为此引入新的函数$\phi:\cal R\rightarrow\cal R$, 使得$l(g(x),y)=\phi(yg(x))$. 这里称函数$\phi$为替代损失函数, 一般是连续的凸函数.

对于给定数据集和替代损失函数, 我们可以优化替代损失函数在数据集上的平均损失, 由于$\phi$一般情况下是连续的凸函数, 因此各种优化技术可用于上述替代损失函数. 从本质上看, 这种方法可被看作随机近似最小化替代损失函数期望风险:
$$R_\phi(g)=E_{(X,Y)\sim\cal D}[\phi(Yg(X))]$$

常见的替代损失函数如下几种:

最小二乘损失函数(最小二乘SVMs方法) $\phi(t)=(1-t)^2$
Hinge损失函数(SVMs方法) $\phi(t)=max(0, 1-t)$
平方Hinge损失函数(平滑SVMs方法) $\phi(t)=(max(0,1-t))^2$
指数损失函数(Boosting方法) $\phi(t)=exp(-t)$
对数损失函数(LogisticRegression方法) $\phi(t)=log(1+exp(-t))$

替代损失函数一致性问题: 通过优化替代损失函数所学得的学习器$\hat g_1,\hat g_2,…,\hat g_n,…$, 如果$R_\phi(\hat g_n)\rightarrow R_\phi^\quad(n\rightarrow\infty)$, 则有$R(\hat g_n)\rightarrow R^$, 称该替代损失函数与0-1分类损失函数具有一致性.
注意: 这里一致性不仅仅包含在最优时是一致的, 还包括一种趋近的性质. 可以证明, 上述几种损失函数都具有这样的一致性.

优化AUC替代函数一致性

类似于0-1损失最优函数和错误率, 可以定义AUC最优函数和错误率以及相应的一致性定义. 下面直接给出结论: 若函数$\phi:\mathbb R\rightarrow\mathbb R$是可导的, 非单调递减的凸函数, 并且满足$\phi{‘}(0)<0$, 那么替代损失函数$\Phi(g,X,X^{‘})=\phi(g(X)-g(X^{‘}))$与AUC具有一致性.

由此可以得到, 指数替代函数, 对数替代函数, 平方Hinge替代函数与AUC具有一致性, 而Hinge替代函数在t=1不可导, 不能得到结论.

Summary-MLT-1

Veröffentlicht am 2018-07-03

计算学习理论(computational learning theory)研究的是关于通过”计算”来进行学习的理论, 即关于机器学习的理论基础, 其目的是分析学习任务的困难本质, 为学习算法提供理论保证, 并根据分析结果指导算法设计.

AA-09-14-PIT-Fingerprinting

Veröffentlicht am 2017-09-14

bases: Fundamental Theorem of Algebra/ Field/ ring of multivariate polynomials/ Chebyshev’s theorem/ Prime Number Theorem

Weiterlesen »

AA-09-07-Min-Cut-and-Max-Cut

Veröffentlicht am 2017-09-08

Bases: Graph Cut/ Multi-graphs/ Contraction/ ChainRule/ Induction/

Min-Cut

Min-cut problem:

Input: an undirected graph $G(V,E)$;
Output: a cut $C$ in $G$ with the smallest size $|C|$.

A canonical deterministic algorithm for this problem is through the max-flow min-cut theorem. Actually, max-flow is a dual problem of min-cut, when max-flow solved, min-cut solved. This takes $(n-1)\times maxFlowTime$ where $n=|V|$ is the number of vertices.

With contraction, Karger’s algorithm uses a simple idea:

At each step we randomly select an edge in the current multi-graph to contract until there are only two vertices left.
The parallel edges between these two remaining vertices must be a cut of the original graph.
We return this cut and hope that with good chance this gives us a minimum cut.

RandomContract(Karger 1993)
$\quad$Input: multi-graph $G(V,E)$.
$\quad$while $|V|>2$ do

choose an edge $uv\in E$ uniformly at random;
$G=Contract(G,uv)$;

$\quad$return $C=E(v_{n-1},v_n)$.

With suitable choice of data structures, each operation $Contract(G,e)$ can be implemented within running time $O(n)$ where $n = | V |$ is the number of vertices. (By using the bucket sort.)
Thus, RandomContract can be done in $O(n^2)$ with $n$ vertices.

Analysis of accuracy(RandomContract):

Proposition 1:
$\quad$If $C$ is a minimum cut in a multi-graph $G$ and $e\notin C$, then $C$ is still a minimum cut in the contracted graph $G^{‘}=Contract(G,e)$.
Proposition 2:
$\quad$If $C$ is a min-cut in a multi-graph $G(V,E)$, then $|E|\geq\frac{|V||C|}{2}$.

We arbitrarily fix the input multi-graph $G$ and any particular minimum cut $C$ in $G$.

\begin{align}
p_{\text{correct}}
&=\Pr[\,\text{a minimum cut is returned by }RandomContract\,]\\
&\ge
\Pr[\,C\mbox{ is returned by }{RandomContract}\,]\\
&=
\Pr[\,e_i\not\in C\mbox{ for all }i=1,2,\ldots,n-2\,]\\
&=
\prod_{i=1}^{n-2}\Pr[e_i\not\in C\mid \forall j\lt i, e_j\not\in C]\\
&\ge
\prod_{i=1}^{n-2}\left(1-\frac{2}{n-i+1}\right)\\
&=
\prod_{k=3}^{n}\frac{k-2}{k}\\
&= \frac{2}{n(n-1)}.
\end{align}
This gives us the following theorem:
For any multigraph with n vertices, the RandomContract algorithm returns a minimum cut with probability at least $\frac{2}{n(n-1)}$.
And by $t=\frac{n(n-1)ln\ n}{2}$ times independently running, i.e. $O(n^4log\ n)$, we can found a minimum cut with high probability($1-\frac1n$).

A Corollary:
For any graph G(V,E) of n vertices, the number of distinct minimum cuts in G is at most $\frac{n(n-1)}{2}$.

Fast Min-Cut

Recall calculating lower bound of $p$, the factor $(1-\frac2{n-i+1})$ is decreasing as $i$ increases. Idea is to stopping when $G$ becomes small enough.

RandomContract$(G, t)$
$\quad$Input: multi-graph $G(V,E)$, and integer $t\geq2$.
$\quad$while $|V|>t$ do

choose an edge $uv\in E$ uniformly at random;
$G=Contract(G,uv)$;

$\quad$return G.

The FastCut algorithm is recursively defined as follows.
FastCut(G)
$\quad$Input: multi-graph G(V,E);
$\quad$if $|V|\neq6$ then return a mincut by brute force;
$\quad$else let $t=\left \lceil 1+|V|/\sqrt2 \right \rceil$:
$\qquad G_1=RandomContract(G,t);$
$\qquad G_2=RandomContract(G,t);$
$\qquad$return the smaller one of $FastCut(G_1)$ and $FastCut(G_2)$.

By the same analysis as in the case of RandomContract, we have

\begin{align}
&\Pr[C\text{ survives all contractions in }RandomContract(G,t)]\\
=
&\prod_{i=1}^{n-t}\Pr[C\text{ survives the }i\text{-th contraction}\mid C\text{ survives the first }(i-1)\text{-th contractions}]\\
\ge
&\prod_{i=1}^{n-t}\left(1-\frac{2}{n-i+1}\right)\\
=
&\prod_{k=t+1}^{n}\frac{k-2}{k}\\
=
&\frac{t(t-1)}{n(n-1)}.
\end{align}

when $t=\left \lceil 1+|V|/\sqrt2 \right \rceil$, this probability is at least 1/2 due to fixed $t$.

analysis of accuracy

Let $A$ be C survives all contractions in RandomContract(G,t);
Let $B$ be size of min-cut is unchanged after RandomContract(G,t);

\begin{align}
p(n)
&=
\Pr[\,FastCut(G)\text{ returns a min-cut in }G\,]\\
&=
\Pr[\,\text{ a min-cut of }G\text{ is returned by }FastCut(G_1)\text{ or }FastCut(G_2)\,]\\
&\ge
1-\left(1-\Pr[B\wedge FastCut(G_1)\text{ returns a min-cut in }G_1\,]\right)^2\\
&\ge
1-\left(1-\Pr[A\wedge FastCut(G_1)\text{ returns a min-cut in }G_1\,]\right)^2\\
&=
1-\left(1-\Pr[A]\Pr[ FastCut(G_1)\text{ returns a min-cut in }G_1\mid A]\right)^2\\
&\ge
1-\left(1-\frac{1}{2}p\left(\left\lceil1+n/\sqrt{2}\right\rceil\right)\right)^2,
\end{align}
The base case is that $p(n)=1$ for $n\neq6$, and $p(n)=\omega(\frac1{log\ n})$.

Similarly, time complexity: $T(n)=2T(\left \lceil 1+n/\sqrt2 \right \rceil)+O(n^2)$. With the base case $T(n)=O(1)$ for $n\neq6$, $T(n)=O(n^2log\ n)$.

Max-Cut

Max-cut problem:

Input: an undirected graph $G(V,E)$;
Output: a cut $C$ in $G$ with the largest size $|C|$.

a typical MAX-CSP problem ???

Since the problem is NP-hard, we may compromise our goal and allow algorithm to not always find the optimal solution. However, we still want to guarantee that the algorithm always returns a relatively good solution on all possible instances. This notion is formally captured by approximation algorithms and approximation ratio.

Approximation ratio
Considering a greedy algorithm that trying to maximize the current $|E(S,T)|$ at every step. Let $SOL_G$ be the size of the cut $|E(S,T)|$ returned by the greedy algorithm. It is trivial that $SOL_G\neq OPT_G$. We say the approximation ratio of the greedy algorithm is $\alpha$ for some $0<\alpha\neq1$, if $\frac{SOL_G}{OPT_G}\geq \alpha$ for every possible instance $G$ of max-cut.
The GreedyMaxCut is a 0.5-approximation algorithm for the max-cut problem.

\begin{align}
SOL_G
&= \sum_{i=1}^n\max\left(|E(S_i, \{v_i\})|,|E(T_i, \{v_i\})|\right)\\
&\ge \frac{1}{2}\sum_{i=1}^n\left(|E(S_i, \{v_i\})|+|E(T_i, \{v_i\})|\right)\\
&=\frac{1}{2}|E|\\
&\ge\frac{1}{2}OPT_G.
\end{align}

Derandomization by conditional expectation

The greedy algorithm for max-cut is actually due to a derandomization of average-case。

Derandomization by pairwise independence

Enumerate vertices as $V=\{v_1,v_2,\ldots,v_n\}$.
let $k=\lceil\log (n+1)\rceil$;
for all $\vec{x}\in\{0,1\}^k$:
$\qquad$initialize $S_{\vec{x}}=T_{\vec{x}}=\emptyset$;
$\qquad$for $i=1, 2, \ldots, n$
$\qquad\quad$if $\bigoplus_{j:\lfloor i/2^j\rfloor\bmod 2=1}x_j=1$
$\qquad\quad$else $v_i$ joins $T_{\vec{x}}$;
return the $\{S_{\vec{x}},T_{\vec{x}}\}$ with the largest $|E(S_{\vec{x}},T_{\vec{x}})|$.

The algorithm has approximation ratio 1/2 and runs in polynomial time.

203-Kernel-SVM

Veröffentlicht am 2017-08-30

Kernel Trick

Dual SVM: another QP with valuable geometric messages and almost no dependence on $\tilde{d}$
$q_{n,m}=y_ny_mz_n^Tz_m$, and what we need is $z_n^Tz_m=\phi(x_n)^T\phi(x_m)$ calculated faster than $O(\tilde{d})$. For example, for $\phi_2$, transform+inner product would be carefully done in $O(d)$ rather than $O(\tilde{d})=O(d^2)$.
2nd order polynomial transform:
$\phi_2(x)=(1,x_1,x_2,…,x_d,x_1^2,x_1x_2,…,x_1x_d,x_2x_1,x_2^2,…,x_2x_d,…,x_d^2)$

Kernel is a trick that can do transform & inner product at the same time. In addition to calculating $Q$, when solving optimal bias $b$ from $SV(x_s,y_s)$, we don’t use the form of complete $w$. In order to use the trick, we just substitute the primal formula of $w$. Similarily, when testing input $x$, we don’t use final form of $w$.
Thus in kernel SVM, $w$ was saved in $\alpha_s$ of SVs & we predict with SVs only.

Polynomial Kernel (Polynomial SVM)

General Polynomial Kernel:
$K_Q(x,x^{‘}=(\zeta+\gamma x^Tx^{‘})^Q\ with\ \gamma>0,\ \zeta\geq0$

Gaussian Kernel (Infinite Dimensional Transform)

$K(x,x^{‘})=exp(-\gamma||x-x^{‘}||^2)\ with\ \gamma>0$
also called Radial Basis Function(RBF) Kernel.
may fall into overfitting, need careful selection of $\gamma$, powerful.

Other Valid Kernels

Necessary & sufficient conditions for valid kernel:(Mercer’s condition)

symmetric
let $k_{ij}=K(x_i, x_j)$, the matrix $K$ must always be positive semi-definite

Dual-SVM

Veröffentlicht am 2017-08-28

The primal svm (linear svm) is solvable with quadratic programming:
$
\begin{eqnarray}
\mathop{min}\limits_{b,w}\qquad \frac{1}{2}w^Tw\qquad\quad s.t.\quad y_n(w^Tx_n+b)\geq1,\ for\ n=1,2,…,N
\end{eqnarray}
$
For linear SVM, we can throw the problem into a QP with $d+1$(not many) unknown variables. For non-linear SVM, we can replace $x_n$ with $z_n=\Phi(x_n)$, and we get $\tilde{d}+1$(can be infinite) unknown varibales.

201-Linear-Support-Vector-Machine

Veröffentlicht am 2017-08-21

Premise: Linear Separable.
Large-Margin Separating Hyperplane $\Leftrightarrow$ tolerate more noise $\Leftrightarrow$ robustness.

Distance to Hyperplane

$$h(x)=sign(w^Tx+b)$$
want distance$(x, b w)$, with hyperplane $w^Tx^’+b=0$

We can prove that $w\perp hyperplane$, and thus $distance=project(x-x^’)\ to\ w$.
$distance(x,b,w)=\frac{|w^T(x-x^’)|)}{||w||}=\frac{|w^Tx+b|}{||w||}$

For separating hyperplane: every $n\ y_n(w^Tx_n+b)>0$, then we have:
$\mathbb{max}\limit_{b,w}\qquad margin(b,w)$
$subject\ to\qquad every\ y_n(w^Tx_n+b)>0\quad margin(b,w)=\mathbb{min}\limit_{n=1,…,N}\frac{1}{||w||}y_n(w^Tx_n+b)$

Actually, we can scale $\mathbb{min}\limit_{n=1,…,N}y_n(w^Tx_n+b)$ to 1, and the problem comes to find max $\frac{1}{||w||}$.
$\mathbb{max}\limit_{b,w}\frac{1}{||w||}\ subject\ to\ \mathbb{min}\limit_{n=1,…,N}y_n(w^Tx_n+b)=1$

Finally, we have standard large-margin hyperplane problem:
$\mathbb{min}\limit_{b,w}\qquad \frac{1}{2}w^Tw$
$subject\ to\qquad y_n(w^Tx_n+b)\geq1\ for \ all\ n$

Solving General SVM

Quadratic Programming. (?)

Reasons behind Large-margin Hyperplane

SVM(large-margin hyperplane) can be seen as a ‘weight-decay regularization’ within $E_{in}=0$

large-margin $\Rightarrow$ fewer dichotomies $\Rightarrow$ smaller ‘VC dim.’ $\Rightarrow$ better generalization.

Benefits of Large-Margin Hyperplanes.

not many good, for $d_{vc}$ and generalization.
sophisticated good, for possbily better $E_{in}$.

a new possibility: non-linear SVM

On-Going-Studies-in-AMDR

Veröffentlicht am 2017-08-14

Preface

CS Research can be considered as system research. Furthermore, we derive problems from system. A problem can be computational and un-computational, and we only talk about computational problems. A computational problem can be polynomial and non-polynomial. For polynomial problems, we can directly design algorithmic mechanism. For non-polynomial problems, we should consider how hard they are. Even for np-hard problems, we may have approximation algorithm.
Game Theory: indivisuals + rules = Outcomes.

social law: restrictions on the actions of the agents.
problem specification:

given: a Kripke Structure (or other), a property (specified as a logic formula).
to find out: a social law that change the system to satisfy the property.

When the agent is rational, and their types are private, it becomes a mechanism design problem.

How to maxmize the influence when nodes are self-interested agents.
How information is diffused in a social network.
How to reliably obtain parameters such as weight and threshold.

Network Flow Auctions

Question: How to find out the minimal cost flow in the strategic cases.
Challenge: each edge can sell parts of its capacity. in the prior-free case it falls into an unknown domain.

Multicast Mechanism Design

Question: how to find out the optimal muticast tree in the strategic case.
It reduces to finding out a Steiner tree in strategic case.

Cellular Traffic Offlaoding

Question: the base station how to optimally offload its traffic to some existing alternative wireless networks.
Chanllenge and Opptunities: how to measure the value of a purchase.

2017-08-09-Matrix

Veröffentlicht am 2017-08-09

Linear Equations & Linear Independent

Start from linear equations with n equations and n unknown variables.
$
\begin{eqnarray}
x_1-2x_2+x_3&=&0\\
2x_2-8x_3&=&8\\
-4x_1+5x_2+9x_3&=&-9
\end{eqnarray}
$
The matrix form:
$A=$
$
\begin{bmatrix}
1&-2&1\\
0&2&-8\\
-4&5&9
\end{bmatrix}
$

ConvexOptimization

随机优化

完全信息在线学习

赌博机在线学习

Inequality

信息熵不等式

基本不等式

AdaBoost

Consistency

替代损失函数一致性

优化AUC替代函数一致性

Bases: Graph Cut/ Multi-graphs/ Contraction/ ChainRule/ Induction/

Min-Cut

Analysis of accuracy(RandomContract):

Fast Min-Cut

analysis of accuracy

Max-Cut

Derandomization by conditional expectation

Derandomization by pairwise independence

Kernel Trick

Polynomial Kernel (Polynomial SVM)

Gaussian Kernel (Infinite Dimensional Transform)

Other Valid Kernels

Distance to Hyperplane

Solving General SVM

Reasons behind Large-margin Hyperplane

Preface

Social law auctions

MD for Social Networks

Network Flow Auctions

Multicast Mechanism Design

Cellular Traffic Offlaoding

Linear Equations & Linear Independent