Sparkml庫標籤和索引之間轉化

阿新 • • 發佈：2019-02-14

StringIndexer

StringIndexer將一串字串標籤編碼為一列標籤索引。這些索引範圍是[0, numLabels)按照標籤頻率排序，因此最頻繁的標籤獲得索引0。如果使用者選擇保留它們，那麼看不見的標籤將被放在索引numLabels處。如果輸入列是數字，我們將其轉換為字串值並將為其建索引。當下遊管道元件（例如Estimator或 Transformer使用此字串索引標籤）時，必須將元件的輸入列設定為此字串索引列名稱。在許多情況下，您可以使用設定輸入列setInputCol。

例1，假如我們有下面的DataFrame，帶有id和category列：

Id	category
0	a
1	b
2	c
3	a
4	a
5	c

對著個Dataframe使用StringIndexer，輸入列式category，categoryIndex作為輸出列，得到如下值：

Id	Category	CategoryIndex
0	a	0.0
1	b	2.0
2	c	1.0
3	a	0.0
4	a	0.0
5	c	1.0

字元a，索引值是0，原因是a出現的頻率最高，接著就是c：1，b：2。

另外，對於不可見的標籤，StringIndexer有是三種處理策略：

1，丟擲異常，這是預設行為

2，跳過不可見的標籤

3，把不可見的標籤，標記為numLabels(這個是無用的)。

還用上面的例子，資料如下：

Id	Category
0	a
1	b
2	c
3	a
4	a
5	c
6	d
7	e

如果你沒有設定StringIndexer如何處理這些不可見的詞，或者設定為了error，他將會丟擲一個異常。然而，你如果設定setHandleInvalid("skip")，將會得到如下結果：

Id	Category	CategoryIndex
0	a	0.0
1	b	2.0
2	c	1.0

注意，包含d,e的行並沒有出現。

如果，呼叫setHandleInvalid("keep")，會得到下面的結果：

Id	Category	CategoryIndex
0	a	0.0
1	b	2.0
2	c	1.0
3	d	3.0
4	e	3.0

注意，d，e獲得的索引值是3.0

程式碼用例如下：

importorg.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame(

Seq((0,"a"),(1,"b"),(2,"c"),(3,"a"),(4,"a"),(5,"c"))

).toDF("id","category"

)

val indexer =newStringIndexer()

.setInputCol("category")

.setOutputCol("categoryIndex")

val indexed = indexer.fit(df).transform(df)

indexed.show()

IndexToString

對稱地StringIndexer，IndexToString將一列標籤索引映射回包含作為字串的原始標籤的列。一個常見的用例是從標籤生成索引StringIndexer，用這些索引對模型進行訓練，並從預測索引列中檢索原始標籤IndexToString。但是，您可以自由提供自己的標籤。

例如，假如我們有dataframe格式如下：

Id	CategoryIndex
0	0.0
1	2.0
2	1.0
3	0.0
4	0.0
5	1.0

使用IndexToString 並且使用categoryIndex作為輸入列，originalCategory作為輸出列，可以檢索到原始標籤如下：

Id	originalCategory	CategoryIndex
0	a	0.0
1	b	2.0
2	c	1.0
3	a	0.0
4	a	0.0
5	c	1.0

程式碼案例如下：

importorg.apache.spark.ml.attribute.Attribute

importorg.apache.spark.ml.feature.{IndexToString,StringIndexer}

valdf=spark.createDataFrame(Seq(

(0,"a"),

(1,"b"),

(2,"c"),

(3,"a"),

(4,"a"),

(5,"c")

)).toDF("id","category")

valindexer=newStringIndexer()

.setInputCol("category")

.setOutputCol("categoryIndex")

.fit(df)

valindexed=indexer.transform(df)

println(s"Transformed string column '${indexer.getInputCol}' "+

s"to indexed column '${indexer.getOutputCol}'")

indexed.show()

valinputColSchema=indexed.schema(indexer.getOutputCol)

println(s"StringIndexer will store labels in output column metadata: "+

s"${Attribute.fromStructField(inputColSchema).toString} ")

valconverter=newIndexToString()

.setInputCol("categoryIndex")

.setOutputCol("originalCategory")

valconverted=converter.transform(indexed)

println(s"Transformed indexed column '${converter.getInputCol}' back to original string "+

s"column '${converter.getOutputCol}' using labels in metadata")

converted.select("id","categoryIndex","originalCategory").show()

推薦閱讀：

本文主要參考翻譯整理自Spark官網，打原創標籤純屬為了保證，翻譯勞動成果，謝謝大家諒解。

關於Spark學習技巧

kafka，hbase，spark，Flink等入門到深入源碼，spark機器學習，大資料安全，大資料運維，請關注浪尖公眾號，看高質量文章。

更多文章，敬請期待

Sparkml庫標籤和索引之間轉化

Sparkml庫標籤和索引之間轉化

Pandas詳解十四之DataFrame物件的列和索引之間的轉化

數據庫事務和索引

Java 8 Lsit和Map之間轉化-程式碼示例

去掉html標籤與標籤之間的空格以及標籤和內容之間的空格

淺談mysql的鎖和索引之間莫大的聯絡

Struts標籤、Ognl表示式、el表示式、jstl標籤庫這四者之間的關係和各自作用

數據庫之表操作(DDL語句)和索引

msyql數據庫簡單操作及事務和索引

Json字串和物件之間的區別和轉化

jQuery物件和DOM物件和字串之間的轉化

【html】使用img標籤和背景圖片之間的區別

自動化測試庫、框架和工具之間的區別

庫函式是使用者程式和核心之間的橋樑

物件和map之間的相互轉化

Date和LocalDateTime之間的相互轉化

List和二維陣列之間轉化及初始化

java中String和int之間的相互轉化

jdom處理的XML Document 和String 之間的相互轉化

Java List 和陣列之間的相互轉化

Sparkml庫標籤和索引之間轉化

相關推薦