Enjoy the pleasure in the ocean of big data

阿新 • • 發佈：2018-12-31

birch簡述

birch全名利用層次方法的平衡迭代規約和聚類。
birch只需要單遍掃描資料集就可以進行聚類，它最小化IO，天生來應對大資料。brich是通過聚類特徵樹（CF-tree/ClusterFeature-tree）實現的，單遍掃描資料集後建立一棵存放於記憶體中的CF-tree，可以看作是資料的多層壓縮。

聚類特徵

每一個CF是一個三元組，可以用（N，LS，SS）表示。其中N代表這個CF中擁有的樣本點的數量；LS代表了這個CF中擁有的樣本點各特徵維度的和向量，SS代表了這個CF中擁有的樣本點各特徵維度的平方和。
CF有一個很好的性質，就是滿足線性關係，也就是CF1+CF2=(N1+N2,LS1+LS2,SS1+SS2)。
一棵CF樹，還需要幾個重要引數：
B：內部節點平衡因子，每個內部節點的子節點最大個數
L：葉子節點平衡因子，每個葉子節點的子節點最大個數
T：簇直徑閾值，每個簇直徑最大閾值，超過後，簇分裂

聚類特徵樹構造

一個聚類特徵樹樣例：
CF-tree

演算法最初，掃描資料集，拿到第一個資料點，建立一個空的Leaf和MinCluster，MinCluster座位Leaf的一個孩子。
當後續點需要插入樹中時，把這個點封裝為一個MinCluster，把新到的資料點記為CF_new，從樹的根節點開始，根據D2（歐式距離）來找到CF_new與那個節點最近，就把CF_new加入那個子樹上面去。這是一個遞迴的過程。遞迴的終止點是要把CF_new加入到一個MinCluster中，如果加入之後MinCluster的直徑沒有超過T，則直接加入，否則CF_new要單獨作為一個簇，成為MinCluster的兄弟結點。插入之後注意更新該節點及其所有祖先節點的CF值。

插入新節點後，可能有些節點的孩子數大於了B（或L），此時該節點要分裂對於Leaf，它現在有L+1個MinCluster，我們要新建立一個Leaf，使它作為原Leaf的兄弟結點，同時注意每新建立一個Leaf都要把它插入到雙向連結串列中。L+1個MinCluster要分到這兩個Leaf中，怎麼分呢？找出這L+1個MinCluster中距離最遠的兩個Cluster（根據D2），剩下的Cluster看離哪個近就跟誰站在一起。分好後更新兩個Leaf的CF值，其祖先節點的CF值沒有變化，不需要更新。這可能導致祖先節點的遞迴分裂，因為Leaf分裂後恰好其父節點的孩子數超過了B。

程式碼描述

private 
 ClusteringFeature buildCFTree() {
        NonLeafNode rootNode = null;
        LeafNode leafNode = null;
        Cluster cluster = null;

        for (String[] record : totalDataRecords) {
            cluster = new Cluster(record);

            if (rootNode == null) {
                // CF樹只有1個節點的時候的情況
                if (leafNode == null) {
                    leafNode = new LeafNode();
                }
                leafNode.addingCluster(cluster);
                if (leafNode.getParentNode() != null) {
                    rootNode = leafNode.getParentNode();
                }
            } else {
                if (rootNode.getParentNode() != null) {
                    rootNode = rootNode.getParentNode();
                }

                // 從根節點開始，從上往下尋找到最近的新增目標葉子節點
                LeafNode temp = rootNode.findedClosestNode(cluster);
                temp.addingCluster(cluster);
            }
        }

        // 從下往上找出最上面的節點，返回根節點
        LeafNode node = cluster.getParentNode();
        NonLeafNode upNode = node.getParentNode();
        if (upNode == null) {
            return node;
        } else {
            while (upNode.getParentNode() != null) {
                upNode = upNode.getParentNode();
            }

            return upNode;
        }
    }

可以看出，插入過程是從根節點開始找到最近的新增目標葉子節點，然後呼叫葉子節點的addingCluster方法把節點新增到樹中。
leafNode的addingCluster方法

public void addingCluster(ClusteringFeature clusteringFeature) {
        //更新聚類特徵值
        directAddCluster(clusteringFeature);

        // 尋找到的目標叢集
        Cluster findedCluster = null;
        Cluster cluster = (Cluster) clusteringFeature;

        double disance = Integer.MAX_VALUE;
        double errorDistance = 0;
        boolean needDivided = false;
        if (clusterChilds == null) {
            clusterChilds = new ArrayList<>();
            clusterChilds.add(cluster);
            cluster.setParentNode(this);
        } else {
            for (Cluster c : clusterChilds) {
                errorDistance = ClusteringFeature.computerClusterDistance(c, cluster);
                if (errorDistance < disance) {
                    // 選出簇間距離最近的
                    disance = errorDistance;
                    findedCluster = c;
                }
            }

            ArrayList<double[]> data1 = (ArrayList<double[]>) findedCluster.getData().clone();
            ArrayList<double[]> data2 = cluster.getData();
            data1.addAll(data2);
            // 如果新增後的聚類的簇間距離超過給定閾值，需要額外新建簇
            if (ClusteringFeature.computerInClusterDistance(data1) > BIRCHTool.T) {
                // 如果新增後簇的簇間距離超過T，當前簇作為新的簇
                clusterChilds.add(cluster);
                cluster.setParentNode(this);
                // 葉子節點的孩子數不能超過平衡因子L
                if (clusterChilds.size() > BIRCHTool.L) {
                    needDivided = true;
                }
            } else {
                findedCluster.directAddCluster(cluster);
                cluster.setParentNode(this);
            }
        }

        if(needDivided){
            if(parentNode == null){
                parentNode = new NonLeafNode();
            }else{
                parentNode.getLeafChilds().remove(this);
            }

            LeafNode[] nodeArray = divideLeafNode();
            for(LeafNode n: nodeArray){
                parentNode.addingCluster(n);
            }
        }
    }

先找到距離最近的簇，把當前簇新增到最近的簇中，如果新增後簇的簇間距離超過T，當前簇作為新的簇，如果葉子節點的孩子數超過平衡因子L，則葉子節點需要分裂。分裂後分裂為2個葉子節點，然後呼叫非葉子節點的addingCluster方法，依次向上更新父節點。
NonLeafNode的addingCluster方法：

public void addingCluster(ClusteringFeature clusteringFeature) {
        LeafNode leafNode = null;
        NonLeafNode nonLeafNode = null;
        NonLeafNode[] nonLeafNodeArrays;
        boolean neededDivide = false;
        // 更新聚類特徵值
        directAddCluster(clusteringFeature);

        if (clusteringFeature instanceof LeafNode) {
            leafNode = (LeafNode) clusteringFeature;
        } else {
            nonLeafNode = (NonLeafNode) clusteringFeature;
        }

        if (nonLeafNode != null) {
            neededDivide = addingNeededDivide(nonLeafNode);

            if (neededDivide) {
                if (parentNode == null) {
                    parentNode = new NonLeafNode();
                } else {
                    parentNode.nonLeafChilds.remove(this);
                }

                nonLeafNodeArrays = this.nonLeafNodeDivided();
                for (NonLeafNode n1 : nonLeafNodeArrays) {
                    parentNode.addingCluster(n1);
                }
            }
        } else {
            neededDivide = addingNeededDivide(leafNode);

            if (neededDivide) {
                if (parentNode == null) {
                    parentNode = new NonLeafNode();
                } else {
                    parentNode.nonLeafChilds.remove(this);
                }

                nonLeafNodeArrays = this.leafNodeDivided();
                for (NonLeafNode n2 : nonLeafNodeArrays) {
                    parentNode.addingCluster(n2);
                }
            }
        }
    }

Enjoy the pleasure in the ocean of big data

birch簡述 birch全名利用層次方法的平衡迭代規約和聚類。 birch只需要單遍掃描資料集就可以進行聚類，它最小化IO，天生來應對大資料。brich是通過聚類特徵樹（CF-tree/ClusterFeature-tree）實現的，單遍掃描資料集後建立一棵

Procurement Benchmarking in the Era of Big Data

Let's be honest: most benchmark reports promise much but deliver little. They often start with good intentions but focus on high-level best practices or re

Studying genetics in the age of big data

New biomedical techniques, such as next-generation genome sequencing, are creating vast amounts of data and transforming the scientific landscape. They're

Free will a dwindling commodity in the age of big data and AI

In his rather dystopian foray into an educated prophecy about what society will look like 100 years from now, the celebrated historian Yuval Noah Harari re

The dependencies of some of the beans in the application context form a cycle:

啟動服務報錯： ┌─────┐ | userServiceImpl defined in file [D:\IdeaProjects\pro_new\diich-ecology\diich-biz\diich-biz-core\out\production\classe

SpringBoot2 初始資料庫迴圈依賴 The dependencies of some of the beans in the application context form a cycle

今天做初始化資料庫，報一下異常 *************************** APPLICATION FAILED TO START *************************** Description: The dependencies of s

How sports teams, athletes and fans reap the rewards of big data

How sports teams, athletes and fans reap the rewards of big dataThe sports industry is booming.Just one look at the stats — KPMG found that the global spor

Marginally Interesting: The future of Big Data (according to Stratosphere/Flink)

Tweet The DIMA group at TU Berlin have a very interesting project which

"Loading a plug-in failed The plug-in or one of its prerequisite plug-ins may be missing or damaged and may need to be reinstal

The Unarchiver 雖好，但存在問題比我們在mac上zip打包一個軟體xcode，然後copy to another mac, 這時用The Unarchiver解壓縮出來的xcode包不能執行，好像是裡面的檔案資訊結構被破壞，會出現而用archive utility 解壓就能正常執行。通

Q&A: Trifacta's Sachin Chawla on getting the most out of Big Data Internet of Business

The insights offered by Big Data are key to many businesses today. Getting the information that's hidden within it isn't easy but there are plenty of compa

scale the service in the swarm

docker swarm 一旦你部署了一個服務到swarm集群中，你就可以使用docker命令行來伸縮擴容運行該服務的容器數量。運行在多個容器的一個服務叫做tasks 任務。$docker machine ssh manager1$ docker service scale <SERVICE-I

【二分】Petrozavodsk Winter Training Camp 2017 Day 1: Jagiellonian U Contest, Monday, January 30, 2017 Problem A. The Catcher in the Rye

什麽不同 stdin n) clas sqrt ios 這份 std 一個區域，垂直分成三塊，每塊有一個速度限制，問你從左下角跑到右上角的最短時間。將區域看作三塊折射率不同的介質，可以證明，按照光路跑時間最短。於是可以二分第一個入射角，此時可以推出射到最右側邊界上的位

C語言考題：Find the key in the picture,good luck..

int c語言 bsp pict fin find print str1 bin str1="Find the key in the picture,good luck.." for i in range(256): for j in range(39):

maven web項目的web.xml報錯The markup in the document following the root element must be well-formed.

utf-8 style sta 元素 nbsp 地形很好 ati instance maven項目裏面的web.xml開頭約束是這樣的 <?xml version="1.0" encoding="UTF-8"?> <web-app xmlns:xsi=

Contemplating the eyes in the sky

Satellites have changed the way we experience the world, by beaming back images from around the globe and letting us explore the planet through online maps

Spreading the Love in the LinkedIn Feed with Creator

Standard A/B testing can’t measure network impact. Our first choice might be to do two experiments: one randomized on the viewers and one randomized on

Marginally Interesting: My thoughts on the NY Times article: Troves of Personal Data, Forbidden to Researchers

Tweet The NY Times has an article basically complaining that the big

Errors while building APK. You can find the errors in the 'Messages' view.

最近在用Android Studio打包簽名apk時遇到了一個問題，經過查詢資料，順利解決。問題一：Messages報錯如下： Error:Execution failed for task ':app:lintVitalRelease'. >

Amazon.com: The Voice in the Machine: Building Computers That Understand Speech (The MIT Press) (9780262533294): Roberto Pieracc

I enjoyed reading this book! It is a comprehensive description of the evolution of the speech technologies focused on the major results of research and the

Korean Telecom Plans Use of Big Data to Prevent Diseases in Ghana

The Korean telecom leader has signed an agreement on the use of Big Data in preventing infectious diseases with the Ghana Health Service.

Enjoy the pleasure in the ocean of big data

birch簡述

聚類特徵

聚類特徵樹構造

程式碼描述

相關推薦