sparkmllib 特徵抽取、特徵轉換及特徵選擇
阿新 • • 發佈:2019-01-03
特徵抽取
- TF-IDF
TF-IDF一般應用於文字挖掘中,用來反映一個特徵項的重要性。設特徵項為 t,文件為d,文件集為D。特徵頻率( term frequency)TF(t,d) 為特徵項在文件d中出現在次數。 文件頻率(document frequency)DF(t,D)表示含特徵項t的文件數。如果只是用tf來衡量重要性,那麼對於一遍文件中出現多次但含資訊量極少來說是沒什麼用處的。因此可以用逆文件頻率IDF(Inverse document frequency )來衡量特徵項的重要性,公式如下:
|D|表示文件總數,顯然如果t出現在所有的文件中,那麼idf的值為0。然後tf-idf為:
示例:
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = spark.createDataFrame(Seq(
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol ("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
// alternatively, CountVectorizer can also be used to get term frequency vectors
val idf = new IDF().setInputCol ("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show()
- Word2Vec
Word2Vec使用每個詞在文件中出現的概率代表這個詞,然後整個文件是這些概率值組成的一個向量。它常被用來作文件相似性計算。
示例:
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)
val result = model.transform(documentDF)
result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }
特徵轉換
- n-gram
n-gram代表由n個字組成的句子。利用上下文中相鄰詞間的搭配資訊,在需要把連續無空格的拼音、筆劃,或代表字母或筆劃的數字,轉換成漢字串(即句子)時,可以計算出具有最大概率的句子,從而實現到漢字的自動轉換,無需使用者手動選擇,避開了許多漢字對應一個相同的拼音(或筆劃串,或數字串)的重碼問題。該模型基於這樣一種假設,第N個詞的出現只與前面N-1個詞相關,而與其它任何詞都不相關,整句的概率就是各個詞出現概率的乘積。
示例:
import org.apache.spark.ml.feature.NGram
val wordDataFrame = spark.createDataFrame(Seq(
(0, Array("Hi", "I", "heard", "about", "Spark")),
(1, Array("I", "wish", "Java", "could", "use", "case", "classes")),
(2, Array("Logistic", "regression", "models", "are", "neat"))
)).toDF("id", "words")
val ngram = new NGram().setN(2).setInputCol("words").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(false)
- 標準化
sparkmllib使用Normalizer類來進行標準化。可以指定p的值來進行標準化,p的預設值為2。
示例:
import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.ml.linalg.Vectors
val dataFrame = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 0.5, -1.0)),
(1, Vectors.dense(2.0, 1.0, 1.0)),
(2, Vectors.dense(4.0, 10.0, 2.0))
)).toDF("id", "features")
// Normalize each Vector using $L^1$ norm.
val normalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normFeatures")
.setP(1.0)
val l1NormData = normalizer.transform(dataFrame)
println("Normalized using L^1 norm")
l1NormData.show()
// Normalize each Vector using $L^\infty$ norm. infty範數
val lInfNormData = normalizer.transform(dataFrame, normalizer.p -> Double.PositiveInfinity)
println("Normalized using L^inf norm")
lInfNormData.show()
特徵選擇
特徵選擇是從特徵集合中選取一組子集。因為機器學習中特徵太多,我們需要提取有用的特徵。假如我們用個DataFrame:
userFeatures
------------------
[0.0, 10.0, 0.5]
它第一列為0,因些沒什麼用需剔除。我們使用VectorSlicer 類的setIndices(1, 2)方法來進行特徵提取。
userFeatures | features
------------------|-----------------------------
[0.0, 10.0, 0.5] | [10.0, 0.5]
示例:
import java.util.Arrays
import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
val data = Arrays.asList(
Row(Vectors.sparse(3, Seq((0, -2.0), (1, 2.3)))),
Row(Vectors.dense(-2.0, 2.3, 0.0))
)
val defaultAttr = NumericAttribute.defaultAttr
val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)
val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])
val dataset = spark.createDataFrame(data, StructType(Array(attrGroup.toStructField())))
val slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")
slicer.setIndices(Array(1)).setNames(Array("f3"))
// or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array("f2", "f3"))
val output = slicer.transform(dataset)
output.show(false)