1. 程式人生 > >sparkmllib 特徵抽取、特徵轉換及特徵選擇

sparkmllib 特徵抽取、特徵轉換及特徵選擇

特徵抽取

  • TF-IDF

TF-IDF一般應用於文字挖掘中,用來反映一個特徵項的重要性。設特徵項為 t,文件為d,文件集為D。特徵頻率( term frequency)TF(t,d) 為特徵項在文件d中出現在次數。 文件頻率(document frequency)DF(t,D)表示含特徵項t的文件數。如果只是用tf來衡量重要性,那麼對於一遍文件中出現多次但含資訊量極少來說是沒什麼用處的。因此可以用逆文件頻率IDF(Inverse document frequency )來衡量特徵項的重要性,公式如下:
這裡寫圖片描述
|D|表示文件總數,顯然如果t出現在所有的文件中,那麼idf的值為0。然後tf-idf為:
這裡寫圖片描述


示例:

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could use case classes"),
  (1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol
("sentence").setOutputCol("words") val wordsData = tokenizer.transform(sentenceData) val hashingTF = new HashingTF() .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20) val featurizedData = hashingTF.transform(wordsData) // alternatively, CountVectorizer can also be used to get term frequency vectors val idf = new IDF().setInputCol
("rawFeatures").setOutputCol("features") val idfModel = idf.fit(featurizedData) val rescaledData = idfModel.transform(featurizedData) rescaledData.select("label", "features").show()
  • Word2Vec
    Word2Vec使用每個詞在文件中出現的概率代表這個詞,然後整個文件是這些概率值組成的一個向量。它常被用來作文件相似性計算。
    示例:
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
  "Hi I heard about Spark".split(" "),
  "I wish Java could use case classes".split(" "),
  "Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")

// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
  .setInputCol("text")
  .setOutputCol("result")
  .setVectorSize(3)
  .setMinCount(0)
val model = word2Vec.fit(documentDF)

val result = model.transform(documentDF)
result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
  println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }

特徵轉換

  • n-gram
    n-gram代表由n個字組成的句子。利用上下文中相鄰詞間的搭配資訊,在需要把連續無空格的拼音、筆劃,或代表字母或筆劃的數字,轉換成漢字串(即句子)時,可以計算出具有最大概率的句子,從而實現到漢字的自動轉換,無需使用者手動選擇,避開了許多漢字對應一個相同的拼音(或筆劃串,或數字串)的重碼問題。該模型基於這樣一種假設,第N個詞的出現只與前面N-1個詞相關,而與其它任何詞都不相關,整句的概率就是各個詞出現概率的乘積。
    示例:
import org.apache.spark.ml.feature.NGram

val wordDataFrame = spark.createDataFrame(Seq(
  (0, Array("Hi", "I", "heard", "about", "Spark")),
  (1, Array("I", "wish", "Java", "could", "use", "case", "classes")),
  (2, Array("Logistic", "regression", "models", "are", "neat"))
)).toDF("id", "words")

val ngram = new NGram().setN(2).setInputCol("words").setOutputCol("ngrams")

val ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(false)
  • 標準化
    sparkmllib使用Normalizer類來進行標準化。可以指定p的值來進行標準化,p的預設值為2。
    示例:
import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.ml.linalg.Vectors

val dataFrame = spark.createDataFrame(Seq(
  (0, Vectors.dense(1.0, 0.5, -1.0)),
  (1, Vectors.dense(2.0, 1.0, 1.0)),
  (2, Vectors.dense(4.0, 10.0, 2.0))
)).toDF("id", "features")

// Normalize each Vector using $L^1$ norm.
val normalizer = new Normalizer()
  .setInputCol("features")
  .setOutputCol("normFeatures")
  .setP(1.0)

val l1NormData = normalizer.transform(dataFrame)
println("Normalized using L^1 norm")
l1NormData.show()

// Normalize each Vector using $L^\infty$ norm.   infty範數
val lInfNormData = normalizer.transform(dataFrame, normalizer.p -> Double.PositiveInfinity)
println("Normalized using L^inf norm")
lInfNormData.show()

特徵選擇

特徵選擇是從特徵集合中選取一組子集。因為機器學習中特徵太多,我們需要提取有用的特徵。假如我們用個DataFrame:

 userFeatures
------------------
 [0.0, 10.0, 0.5]

它第一列為0,因些沒什麼用需剔除。我們使用VectorSlicer 類的setIndices(1, 2)方法來進行特徵提取。

userFeatures     | features
------------------|-----------------------------
 [0.0, 10.0, 0.5] | [10.0, 0.5]

示例:

import java.util.Arrays

import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType

val data = Arrays.asList(
  Row(Vectors.sparse(3, Seq((0, -2.0), (1, 2.3)))),
  Row(Vectors.dense(-2.0, 2.3, 0.0))
)

val defaultAttr = NumericAttribute.defaultAttr
val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)
val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])

val dataset = spark.createDataFrame(data, StructType(Array(attrGroup.toStructField())))

val slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")

slicer.setIndices(Array(1)).setNames(Array("f3"))
// or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array("f2", "f3"))

val output = slicer.transform(dataset)
output.show(false)