2017-08-04-Similarity-Measure


In statistics and related fields, a similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects. Although no single definition of a simlarity measure exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero of a negative value for very dissimilar objects. E.g., in the context of cluster analysis, Frey and Dueck suggest defining a similarity measure $s(x,y)=-||x-y||_2^2$ where the inverse is the squared Euclidean distance.

用于度量的数据属性可能是连续的具有无限域的值,也有可能是离散的有限域的值,分别对应于定量和定性的情况. 另还存在混合类型属性的相似度度量.


Minkowski Distance

i.e. $||u-v||_p$ (p-norm) where $p\geq1$.
be Manhanttan distance when $p=1$.
be Euclidean distance when $p=2$.
be Chebyshev Distance when $p\rightarrow\infty$


Standardized Euclidean Distance

As every dimension may have diffrent distribution(different expectation and standard deviation), we can do some standardization on every dimension so that the expectation be zero and standard deviation be 1 for every dimension. To formulaize it:
$$\sqrt{\sum(u_i-v_i)^2/V[x_i]}$$
where V is the variance vector.


Hamming Distance & Jaccard similarity coefficient


Consine Distance & Correlation distance

$$1-\frac{u·v}{||u||_2||v||_2}$$
$$1-\frac{(u-\bar u)·(v-\bar v)}{||(u-\bar u)||_2||(v-\bar v)||_2}$$
Furthermore, correlation coefficient:
$$\rho_{XY}=\frac{Cov(X,Y)}{\sqrt{D(X)}\sqrt{D(Y)}}$$
positive linear correlation when 1.
negative linear correlation when -1.


Mahalanobis Distance

$$\sqrt{(u-v)S^{-1}(u-v)^T}$$
where $S^{-1}$ is the inverse covariance.


Sperman coefficient & KL Distance


参考:
[1]机器学习中的相似性度量
[2]scipy.spatial.distance.pdist