python coo_matrix的理解和用法
1. 理解和用法
首先ffm格式(主key,副key,1)資料如下:第一列是lable,後面是x(特徵值)
舉例2:3:1表示 源資料第2列,索引為3
源資料test.txt:(其中第8列是連續型特徵沒有離散化,其他列是離散型特徵)
1 2:3:1 3:5:1 5:7:1 7:10:1 8:14:1.2
0 1:1:1 2:4:1 6:9:1 7:10:1 8:14:2.3
1 2:3:1 3:5:1 7:11:1 8:14:1.5
1 1:2:1 5:7:1 7:12:1 8:14:2.2 9:15:1
0 3:6:1 5:8:1 7:13:1 9:16:1
def libsvm_2_coo(libsvm_data, shape): coo_rows = [] coo_cols = [] coo_data = [] n = 0 for x, d in libsvm_data: coo_rows.extend(n) coo_cols.extend(x) coo_data.extend(d) n += 1 coo_rows = np.array(coo_rows) coo_cols = np.array(coo_cols) coo_data = np.array(coo_data) #coo_rows 即n 從1開始 #coo_col 即副key[ 3 5 7 10 14 1 4 9 10 14 3 5 11 14 2 7 12 14 15 6 8 13 16]#coo_data 即1 return coo_matrix((coo_data, (coo_rows, coo_cols)), shape=shape)
# data = coo_matrix((coo_data, (coo_rows, coo_cols)), shape=shape)#得到的結果是:(由於是用第0列第0行開始的,所以在源資料中沒有第0列,這裡全部補0)
[[0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.]]
data.tocsr()得到的結果如下:csr_matrix記憶體使用約為coo_matrix的70% ,所以我們轉換成coo_csr
(0, 3) 1.0
(0, 5) 1.0
(0, 7) 1.0
(0, 10) 1.0
(0, 14) 1.2
(1, 1) 1.0
(1, 4) 1.0
(1, 9) 1.0
(1, 10) 1.0
(1, 14) 2.3
(2, 3) 1.0
(2, 5) 1.0
(2, 11) 1.0
(2, 14) 1.5
(3, 2) 1.0
(3, 7) 1.0
(3, 12) 1.0
(3, 14) 2.2
(3, 15) 1.0
(4, 6) 1.0
(4, 8) 1.0
(4, 13) 1.0
(4, 16) 1.0
參考:
def read_data("test.txt"): X = [] D = [] y = [] file = open(file_name) fin = file.readlines() for line in fin: X_i = [] D_i = [] line = line.strip().split() yy=float(line[0]) if yy!= 0.: y_i=(float(line[0])) else: y_i=0. for x in line[1:]: # Just get categorical features # if x.split(':')[2] == '1.0': X_i.append(int(x.split(':')[1])) D_i.append(float((x.split(':')[2]))) y.append(y_i) X.append(X_i) D.append(D_i) y = np.reshape(np.array(y), [-1]) X = libsvm_2_coo(zip(X, D), (len(X), INPUT_DIM)).tocsr() return X, y
2. 使用中遇到的問題:
column index exceeds matrix dimensions'
解決方法:即列的個數指上文中的coo_cols 不能大於coo_matrix((coo_data, (coo_rows, coo_cols)), shape=shape) 中的引數shape的列,指上文中的INPUT_DIM。
舉例這裡coo_cols最大值為16,所以這裡的INPUT_DIM至少應該取17(即0~16共17列),如果取值>17,則後面會補0,不影響也無意義。