使用Spark Mlib K-Means演算法分析網路攻擊資料

阿新 • • 發佈：2018-11-22

package apache.spark.mlib.rdd.kmeanclustering

import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.feature.{StandardScaler, VectorAssembler}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._

/**
  * https://medium.com/tensorist/using-k-means-to-analyse-hacking-attacks-81957c492c93
  */

object AnalysingHackingAttacks {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName("AnalysingHackingAttacks")
      .master("local[2]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("warn")

    import spark.implicits._

    //load data
    /**
      * Session_Connection_Time (How long the session lasted in minutes)
      * Bytes_Transferred (Megabytes transferred during session)
      * Kali_Trace_Used (Whether the hacker was using Kali Linux)
      * Servers_Corrupted (Number of server corrupted during the attack)
      * Pages_Corrupted (Number of pages illegally accessed)
      * Location (Location attack came from)
      * WPM_Typing_Speed (Estimated typing speed based on session logs)
      */


// data url：https://www.dropbox.com/s/g5r2dh46abx1vdr/hack_data.csv?dl=0

   val data= spark.read.option("header", "true").csv("data/hack_data.csv")

    data.printSchema()
    data.show(5)


    data.createOrReplaceTempView("hack_data_table")

    val sqlString="select cast(Session_Connection_Time as Double) ," +
      "cast(Bytes_Transferred  as Double)," +
      "cast(Kali_Trace_Used  as Integer)," +
      "cast(Servers_Corrupted  as Double)," +
      "cast(Pages_Corrupted  as Double)," +
      "cast(Location  as String)," +
      "cast(WPM_Typing_Speed  as Double)" +
      "from hack_data_table";
    println(sqlString)

    val df= spark.sql(sqlString)
    df.show(5)

    val dropLocationDF=df.drop("Location")

    /**
      * Note: The Location column will be useless to consider because
      * the hackers probably used VPNs to hide their real locations during the attacks.
      */

      val cols=Array("Session_Connection_Time", "Bytes Transferred",
        "Kali_Trace_Used",
        "Servers_Corrupted",
        "Pages_Corrupted",
        "WPM_Typing_Speed")
    /**
      * We can assemble our attributes into one column using Spark’s VectorAssembler.
      * When creating a VectorAssembler object, we must specify the input columns and the output column.
      * he input columns are a list of columns that we want to assemble,
      * and the output column is just a name for the column created by the assembler.
      */
   val  assembler = new  VectorAssembler().setInputCols(dropLocationDF.columns).setOutputCol("features")
    /**
      * Now that we’ve created our assembler, we can use it to transform our data. Upon transformation,
      * our data will contain all the original attributes as well as the newly created attribute, called features.
      */

   val  assembled_data = assembler.transform(dropLocationDF)
    println("show assembled_data top 5")

    assembled_data.show(5)

    /**
      * Feature scaling
      */


    //1.Next, we need to standardise our data. To accomplish this, Spark has its own StandardScaler which takes in two arguments
    //  — the name of the input column and the name of the output (scaled) column.

   val  scaler =new   StandardScaler().setInputCol("features").setOutputCol("scaledFeatures")


    /**
      * Let’s fit our scaler to our assembled dataframe and get the final cluster dataset using the .transform() method.
      * After transforming our data, the dataframe will contain the newly created scaledFeatures attribute along with the original attributes.
      */
    val   scaler_model = scaler.fit(assembled_data)
    val   scaled_data = scaler_model.transform(assembled_data)
    println("show scaled_data top 5")
    scaled_data.show(5)
    scaled_data.printSchema()

    /**
      * To tackle the question of whether there were two hackers or three, we can create two k-means models.
      * One model will be initialized with two clusters (k = 2),
      * and the other will be initialized with three clusters (k = 3).
      * We will also specify the column we want to pass in to the model for training.
      */

    val k_means_2 = new KMeans().setFeaturesCol("scaledFeatures").setK(2)
    val k_means_3 = new KMeans().setFeaturesCol("scaledFeatures").setK(3)
    val k_means_5 = new KMeans().setFeaturesCol("scaledFeatures").setK(5)

    /**
      * The idea behind this approach is that based on the number of attacks belonging to each cluster,
      * we can figure out the number of groups involved in the attacks.Let’s fit our models on our scaled_data.
      */
    val  model_k2 = k_means_2.fit(scaled_data)
    val model_k3 = k_means_3.fit(scaled_data)
    val model_k5 = k_means_5.fit(scaled_data)


    /**
      * Was the third hacker involved?
      * Finally, it’s time to find out how many hackers were involved with the attacks.
      * Using .transform() on our clustering models will transform our dataset
      * so that a new attribute called predictions will be created. This new column will contain
      * integers that indicate the cluster to which each attack instance has been classified.
      * Let’s take a look at how many instances are grouped into each cluster in the case of three clusters.
      */
    val model_k3_data = model_k3.transform(scaled_data)
      model_k3_data.groupBy("prediction").count.show


    /**
      * Seems like the number of instances isn’t similar between the three clusters.
      * Since this goes against our background information that the hackers trade off attacks,
      * it seems unlikely that three hackers were involved.Next,
      * let’s take a look at the instance classifications in the case of two clusters.
      */

   val  model_k2_data = model_k2.transform(scaled_data)
     model_k2_data.groupBy("prediction").count.show

    /**
      * Both clusters here have exactly the same number of instances assigned to them,
      * and this perfectly aligns with the idea of hackers trading off attacks.
      * Therefore, it is highly likely that only two hackers were involved with the attacks at RhinoTech.
      */


    // try the k is 5

    val  model_k5_data = model_k5.transform(scaled_data)
    model_k5_data.groupBy("prediction").count.show

  }

}

使用Spark Mlib K-Means演算法分析網路攻擊資料

package apache.spark.mlib.rdd.kmeanclustering import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.feature.{StandardScaler, Vect

spark 實現K-means演算法

spark 實現K-means演算法 package kmeans; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFou

K均值聚類--利用k-means演算法分析NBA近四年球隊實力

分類作為一種監督學習方法，要求必須事先明確知道各個類別的資訊，並且斷言所有待分類項都有一個類別與之對應。但是很多時候上述條件得不到滿足，尤其是在處理海量資料的時候，如果通過預處理使得資料滿足分類演算法的要求，則代價非常大，這時候可以考慮使用聚類演算法。聚類屬於無監督學習，相比於分類，聚類不依賴預定義的類和類標

Spark實現K-Means演算法

import org.apache.log4j.{Level,Logger} import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.mllib.clustering.KMeans import org.ap

使用Spark mlib Kmean演算法分析網路資料(Hacker attack)

package apache.spark.mlib.rdd.kmeanclustering import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.feature.{StandardSca

聚類分析層次聚類及k-means演算法

參考文獻： [1]Jure Leskovec，Anand Rajaraman，Jeffrey David Ullman.大資料網際網路大規模資料探勘與分散式處理（第二版） [M]北京：人民郵電出版社，2015.，190-199； [2]蔣盛益，李霞，鄭琪.資料探勘原理與實踐 [M]北京：電子工業出版社，20

python 聚類分析實戰案例:K-means演算法(原理原始碼)

K-means演算法：關於步驟：參考之前的部落格關於程式碼與資料：暫時整理程式碼如下：後期會附上github地址，上傳原始資料與程式碼完整版，各種聚類演算法的對比：參考連線 Kme

Python之使用K-Means演算法聚類消費行為特徵資料分析（異常點檢測）

源資料（這裡僅展示10行）：程式：#-*- coding: utf-8 -*- #使用K-Means演算法聚類消費行為特徵資料 import numpy as np import pandas as pd #引數初始化 inputfile = '../data/consu

Spark K-Means 演算法例子

k-means演算法是以空間的點距離為基準，隨機或者按照一定規則選擇幾個中心點資料，計算每個點到該幾個中心點的距離，按照距離值最近歸為一類的原則，把空間所有的點歸為初始化的幾個中心，稱之為中心簇。然後，找到每個中心簇的中心，再次計算空間所有的點到新的中心點的

資料建模-聚類分析-K-Means演算法

常用聚類方法類別包括主要演算法劃分（分裂）方法 K-Means演算法（K-平均）、K-MEDOIDS演算法（K-中心點）、CLARANS演算法（基於選擇的演算法）層次分析方法 BIRCH演算法（平衡迭代規約和聚類）、CURE演算法（代表點

機器學習——K-means演算法（聚類演算法）

聚類在說K-means聚類演算法之前必須要先理解聚類和分類的區別。分類其實是從特定的資料中挖掘模式，作出判斷的過程。比如Gmail郵箱裡有垃圾郵件分類器，一開始的時候可能什麼都不過濾，在日常使用過程中，我人工對於每一封郵件點選“垃圾”或“不是垃圾”，過一段時間，Gmail就體現出

第九次作業---K-means演算法應用：圖片壓縮

讀取一張示例圖片或自己準備的圖片，觀察圖片存放資料特點。 from sklearn.datasets import load_sample_image from sklearn.cluster import KMeans import matplotlib.pyplot as plt import

K-means演算法應用：圖片壓縮

from sklearn.datasets import load_sample_image from matplotlib import pyplot as plt from sklearn.cluster import KMeans import numpy as np #讀取一張示例圖片或自己準