[更新ing]sklearn(十六)：Nearest Neighbors *

阿新 • • 發佈：2018-12-13

Finding the Nearest Neighbors

1、NearestNeighbors

#Unsupervised learner for implementing neighbor searches.
sklearn.neighbors.NearestNeighbors(n_neighbors=5, radius=1.0, algorithm=’auto’, leaf_size=30, metric=’minkowski’, p=2, metric_params=None, n_jobs=None, **kwargs)
#n_neighbors：設定要尋找的一個point的近鄰點的數量
#radius：尋找一個point  radius半徑以內的點，radius為半徑值
#algorithm：{ 
auto，ball_tree，kd_tree，brute}
#leaf_size：ball_tree，kd_tree的葉子結點數，決定了k近鄰搜尋的效率
#metric：距離公式
#p：minkowski距離公式的引數
#metric_params：其它距離公式的引數
#n_jobs：並行工作的數量

#method
fit(X[,y]) #說明：(X[,y])中y表示“位置引數”，用[]括起來，表示可有可無
#X：training data{array-like,sparse matrix,BallTree,KDTree}。如果function中meric='precomputed’,則training data shape為[ 
n_samples,n_samples]，元素代表“距離”

get_params(deep=True)
#deep=True：返回更多的資訊
#return：{param：value}

kneighbors(X=None, n_neighbors=None, return_distance=True)
#X：the query point or points
#n_neighbors：要找的鄰近點數
#return_distance=True：返回point與鄰近點的距離
#return：[array of distances，array of indices of  the nearest points] 


kneighbors_graph(X=None, n_neighbors=None, mode=’connectivity’)
#X：the query point or points
#n_neighbors：要找的鄰近點數
#mode：返回什麼樣的matrix。{connectivity matrix：各個近鄰點之間是否相連,返回值為1或0，distance matrix：各個元素為點間距離}
#return : matrix ：[n_samples,n_samples]

radius_neighbors(X=None, radius=None, return_distance=True)
#X：the query points
#radius：搜尋半徑，搜尋某點radius範圍內所有點
#return_distance=True：返回距離
#return [array of distance, array of indices of the nearest points]

radius_neighbors_graph(X=None, radius=None, mode=’connectivity’)
#return：matrix [n_samples，n_samples]

set_params(**params) #關鍵字引數{key：value}
#Set the parameters of this estimator.

2、KDTree

sklearn.neighbors.KDTree(X, leaf_size=40, metric=’minkowski’, **kwargs)
#X：training data
#leaf_size：KDTree的葉子節點數
#metric：距離公式

#attribute
.data    # memory view：the training data

#method
kernel_density(self，X，h[, kernel, atol, ...]) #計算給定點X的kernel density estimate
#X：the query point
#kernel：選擇核函式
#atol,rtol：設定stopping tolerance：abs(K_true - K_ret) < atol + rtol * K_ret ；其中k_true為real result，k_ret為predict result。
#breadth_first=True：if true use breadth search, or else use depth search
#return_log：return the logarithm of the result
#return：The array of (log)-density evaluations, shape = X.shape[:-1]

query(X, k=1, return_distance=True, dualtree=False, breadth_first=False)
#X：the query points
#dualtree=True：use the dual tree formalism for the query
#k：近鄰點的數
#return：[array of distance，array of indices of nearest points]

query_radius(self, X, r, count_only = False)
#r：radius
#count_only：是否只返回point個數
#return：[array of distance，array of indices of nearest points]

two_point_correlation(X,r,dualtree)
#compute  the two-point correlation function
#r：radius
#return：各個query point與近鄰點的correlation function

3、BallTree

BallTree(X, leaf_size=40, metric=’minkowski’, **kwargs)
#attributes及method與KDTree一致

[待更新]kernel density 官方文件：density estimation [待更新]breadth search [待更新]depth search [待更新]KDTree [待更新]dualTree [待更新]BallTree [待更新]brute-force search

Nearest Neighbors Classification

1、KNeighborsClassifier

演算法目的是要找到new point的k近鄰，然後選出最具代表性的label作為new point的label。

sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)
#n_neighbors：近鄰點個數
#weights：在prediction時，給每個近鄰點分配的權重。{uniform, distance, callable}。weights=distance，是指每個近鄰點分配權重與其距離成反比。
#algorithm：用於近鄰點搜尋的演算法{auto，KDTree，BallTree，brute}
#leaf_size：針對KDTree，BallTree來講
#p：minkowski距離公式的引數
#metric：距離公式

2、RadiusNeighborsClassifier 2. 對於“不是均勻取樣的資料”，比起k近鄰，RadiusNeighborsClassifier是一個更好的選擇。因為，對於給定的radius，當new point的領域較稀疏時，則採用較少的近鄰點predict class，當new point的領域較稠密時，則採用更多的近鄰點predict class. 3. 該演算法不適用於高維資料。因為在高維空間，該方法可能會由於curse of dimentionality而降低效果。

sklearn.neighbors.RadiusNeighborsClassifier(radius=1.0, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, outlier_label=None, metric_params=None, n_jobs=None, **kwargs)
#outlier_label：當new point在radius之內沒有近鄰點時，被定為Outlier，給其一個label。如果outlier_label=None，則發現這樣的new point會返回error。

Nearest Neighbors Regression

1、KNeighborsRegressor

sklearn.neighbors.KNeighborsRegressor(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)

2、RadiusNeighborsRegressor

sklearn.neighbors.RadiusNeighborsRegressor(radius=1.0, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs)

Nearest Neighbor Algorithms

brute force search

brute force search是一種簡單粗暴的方法，他枚舉了所有可能的pairs of points的距離，並從中，找到new point的k近鄰。複雜度達O(DN2)，D為維數，N為樣本數。 brute force search

KDTree

KDTree是二叉樹
KDTree構建的核心思想：依次以training data的某個feature作為切分依據，將小於feature.mean的資料放入左結點，大於feature.mean的資料放入右結點。重複上述過程，直到左右子節點中沒有資料。
KDTree適用於低維資料（D<20），如果維數過高，則KDTree的效率會大大降低 because of curse of dimentionality。
KDTree的時間複雜度為O[DNlog(N)]。對於某一new point，搜尋其k近鄰的時間複雜度為O[Dlog(N)]。 k-近鄰(KNN)

Ball tree

Ball tree是二叉樹
儘管Ball tree的construction需要花費很多的時間，但是，在應用Ball tree尋找近鄰點時，其效率很高。Ball tree彌補了KDTree無法用於高維資料的不足，他在高維資料中也可以達到很好的效果。
KDTree是基於每個特徵劃分子區域（超巨型區域），而Ball tree是基於所有特徵劃分超球面區域。
Ball tree通過以下策略，減少了需要搜尋的candidate points的數量：|x+y| <= |x| + |y|。（三角形原理：a+b < c，如果c為target point與node兄弟結點圓心的距離，b為‘最近鄰點’與兄弟結點圓心的距離，a為“最近鄰點”與target point的距離。如果a+b<c，則說明在兄弟結點中不存在比現有最近鄰點distance還小的點，如果a+b > c,則存在） Scikit-learn：最近鄰搜尋sklearn.neighbors

以上3中演算法的比較

當n_sample < 30時，brute algorithm效率最高。
data structure能夠顯著影響到Ball tree和KDTree的query time。一般來說，sparser data with small intrinsic dimentionality(降維後特徵數)其query 更fast。
k近鄰的數量對3中演算法的影響：brute不會受太大影響；ball tree和KDTree隨著k值的增大，其query time增加。
如果query points的數目較小的話，建議用brute，因為構建BallTree和KDTree需要花費很大時間，只尋找很少的幾個new point的k近鄰不值得專門構建tree。如果query points很多的話，則用BallTree和KDTree。
KDTree適用於維度較低的資料（d<10）,BallTree適用於維度較高的資料。 不太理解，為什麼k>N/2，反而用brute？？？，有路過的高手，請多多指教

Nearest Centroid Classifier（k-均值聚類）

linear discriminant analysis 假設各個class都有Identical variance；
Qudratic discriminant analysis 中各個class無需有相同方差；
Nearest Centroid Classifier中，當各個class variance不同時，可能會得到non-convex class。為了避免這一情況的發生，要假設：在所有dimention上，擁有相同的variance（通過samplevalue/varianceoffeature，使得各個feature上的variance相同）。

sklearn.neighbors.NearestCentroid(metric=’euclidean’, shrink_threshold=None)
#shrink_threshold：將各個feature的資料同時除以該feature的variance，使得各個feature在同一scale下進行classification，避免的feature對於classification的影響，同時也可以消除一些Noise的影響。

廣度優先搜尋，深度優先搜尋

深度優先搜尋

深度優先搜尋類似於樹的先序遍歷，是先序遍歷的推廣。演算法步驟:

從圖中某頂點v出發，訪問v；
找出剛訪問過的頂點的第一個未被訪問的鄰接點，訪問該頂點。以該頂點為新頂點，重複此步驟，直到剛訪問過的頂點沒有未被訪問的鄰接點為止。
返回前一個訪問過的且仍有未被訪問的鄰接點的頂點，找出該頂點的下一個未被訪問的鄰接點，訪問該頂點。
重複步驟2和3，直到圖中所有頂點都被訪問過，搜尋結束。

#虛擬碼
bool visited[MVNum]      //訪問標誌陣列，其初值為‘false’
void DFS(Graph G,int v){
		cout<<v; visited[v]=True; //訪問頂點v，將其標誌設為true
		for(w=FirstAdjVex(G,v),w>=0,w=NextAdjVex(G,v,w))  //FirstAdjVex：v的第一個鄰接點; NextAdjVex：v相對於w的下一個鄰接點
				if(!visited[w]) DFS(G,w); //遞迴訪問
}

廣度優先搜尋

廣度優先搜尋類似於樹的層次遍歷。演算法步驟：

從圖中某個頂點v出發，訪問v。
一次訪問v的各個未曾訪問的鄰接點。
分別從這些鄰接點出發，一次訪問他們的鄰接點，並使先被訪問到的頂點的鄰接點先於後被訪問到的頂點的鄰接點被訪問。重複步驟3。直到圖中所有已被訪問的頂點的鄰接點都被訪問到。

void BFS(Graph G, int v){
		cout<<v; visited[v]=True; //訪問頂點v，並將頂點v的標誌設為true
		InitQuene(Q)  //建立佇列Q
		EnQuene(Q,v)  //將頂點v裝入佇列
		while(!QueneEmpty(Q)){
				DeQuene(Q,v)  //佇列頭元素出隊，並將該出隊元素命名為v
				for(w=FirstAdjVex(G,v),w>=0,w=NextAdjVex(G,v,w)) //依次訪問頂點v的鄰接點
						if(!visited[w]){
								cout<<w //訪問鄰接點
								EnQuene(Q,w)  //將該鄰接點裝入佇列
								visited[w]=True  //將該鄰接點的標誌設為true
						}

[更新ing]sklearn(十六)：Nearest Neighbors *

Finding the Nearest Neighbors

Nearest Neighbors Classification

Nearest Neighbors Regression

Nearest Neighbor Algorithms

brute force search

KDTree

Ball tree

以上3中演算法的比較

Nearest Centroid Classifier（k-均值聚類）

廣度優先搜尋，深度優先搜尋

深度優先搜尋

廣度優先搜尋

dual tree

kernel density

[更新ing]sklearn(十六)：Nearest Neighbors *

[更新ing]sklearn(十八)：Gaussian Processes

[更新ing]sklearn(八)：quantifying the quality of predictions

Docker系列十六：Docker Swarm不間斷更新業務

轉：【Java並發編程】之十六：深入Java內存模型——happen-before規則及其對DCL的分析（含代碼）

每日算法之二十六：Substring with Concatenation of All Words

springboot(十六)：使用Jenkins部署Spring Boot

十六：矩陣數字求和

Redis(十六)：集群搭建

練習十六：Python日期格式應用（datetime）

skyfans之每天一個Liunx命令系列之十六：id

SpringBoot(十六)：Thymeleaf使用

elasticsearch 筆記十六：phase

kubernetes實戰(十六)：k8s高可用叢集平滑升級 v1.11.x 到v1.12.x

路一步步走>> 設計模式十六：Iterator-迭代器

論文閱讀筆記二十六：Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks（CVPR 2016）

opencv學習筆記三十六：AKAZE特徵點檢測與匹配

執行緒十六：CyclicBarrier

opencv學習筆記五十六：分水嶺分割演算法

筆記十六：“點點點”就能做到的一款分散式爬蟲管理框架——Gerapy

[更新ing]sklearn(十六)：Nearest Neighbors *

Finding the Nearest Neighbors

Nearest Neighbors Classification

Nearest Neighbors Regression

Nearest Neighbor Algorithms

brute force search

KDTree

Ball tree

以上3中演算法的比較

Nearest Centroid Classifier（k-均值聚類）

廣度優先搜尋，深度優先搜尋

深度優先搜尋

廣度優先搜尋

dual tree

kernel density

相關推薦