1. 程式人生 > >WeightedCLuster R package的使用

WeightedCLuster R package的使用

WeightedCLuster R package的使用

1. 本函式包的主要用途

權重資料的聚類(主要是state sequences and weighted data) 和聚類結果的評估

2.函式的安裝

install.packages("WeightedCluster")
library(WeightedCluster)

3.資料的輸入和計算

匯入mvad資料,mvad 資料追蹤了712個個體在20世紀90年代自訓練至工作的程序

#資料輸入
data(mvad)
aggMvad <- wcAggregateCases(mvad[, 17:86]) #確定和合計確定狀態的序列
uniqueMvad <- mvad[aggMvad$aggIndex, 17:86] #只打印出包含單獨序列的資料

#creat a state sequence and calculate the Hamming distance 
mvad.seq <- seqdef(uniqueMvad, weights=aggMvad$aggWeights) #用seqdef()生成一個狀態序列
mvaddist <- seqdist(mvad.seq, method="HAM") #計算序列Hamming 距離

#用層次聚類進行聚類
averageClust <- hclust(as.dist(mvaddist), method="average", members=aggMvad$aggWeights) #注意hclust中引數members

#層次聚類結果的展示
clust4 <- cutree(averageClust , k=4)
seqdplot(mvad.seq, group = clust4, border=NA)

#用PAM的方法進行聚類計算
pamclust4 <- wcKMedoids(mvaddist, k=4, weights=aggMvad$weight)
#用下面的方法可以顯示質心序列(mediod sequences)
print(mvad.seq[unique(pamclust4$clustering), ], format="SPS")

#層次聚類質量的計算和展示
avgClustQual <- as.clustrange(averageClust, diss, weights=aggMvad$aggWeights, ncluster=10) #自動計算幾種聚類質量值(只使用與層次聚類的質量展示)
plot(avgClustQual) #將聚類質量值用影象展示出來
plot(avgClustQual, norm="zscore") #用standardized scores進行展示
summary(avgClustQual, max.rank=2) #Alternatively, we can retrieve the two best solutions according to each quality measure
plot(avgClustQual, stat=c("ASWw", "HG", "PBC", "HC"))

#測量分割的質量
clustqual4 <- wcClusterQuality(mvaddist, clust4, weights=aggMvad$weight)
clustqual4$stats
sil <- wcSilhouetteObs(mvaddist, clust4, weights=aggMvad$weight, measure="ASWw")
seqIplot(mvad.seq, group=clust4, sortv=sil)

補充說明檔案
hclust函式包注意事項:
1.聚類方法"centroid" 相對應使用的距離為平方歐式距離 squared Euclidean distances. 如:hc1ust.centroid <- hclust(dist(cent)^2, method = “cen”)

2.聚類方法"ward.D2" 相對應使用的距離為歐式距離 “Euclidean” distances.

3.聚類方法"average"(=UPGMA) 相對應使用的距離為 “bray”(=Bray-Curtis) distances.
Bray-Curtis 相異度(Bray-Curtis dissimilarity)是生態學中用來衡量不同樣地物種組成差異的測度
在這裡插入圖片描述

4.關於其中的members引數的說明:
If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage.
根據上述描述,我們可以按照自己需要,隨意進行改動聚類樹的展現形式

參考檔案連結:
https://cran.r-project.org/web/packages/WeightedCluster/vignettes/WeightedCluster.pdf
https://cran.r-project.org/web/packages/WeightedCluster/vignettes/WeightedClusterPreview.pdf