稀疏矩陣庫scipy.sparse
稀疏矩陣在Python科學計算中的實際意義
對於那些零元素數目遠遠多於非零元素數目,並且非零元素的分佈沒有規律的矩陣稱為稀疏矩陣(sparse)。
由於稀疏矩陣中非零元素較少,零元素較多,因此可以採用只儲存非零元素的方法來進行壓縮儲存。對於一個用二維陣列儲存的稀疏矩陣Amn,如果假設儲存每個陣列元素需要L個位元組,那麼儲存整個矩陣需要m*n*L個位元組。但是,這些儲存空間的大部分存放的是0元素,從而造成大量的空間浪費。為了節省儲存空間,可以只儲存其中的非0元素。大大減少了空間的儲存。
另外對於很多元素為零的稀疏矩陣,僅儲存非零元素可使矩陣操作效率更高。也就是稀疏矩陣的計算速度更快,因為只對非零元素進行操作,這是稀疏矩陣的一個突出的優點。
python不能自動建立稀疏矩陣,所以要用scipy中特殊的命令來得到稀疏矩陣。
稀疏矩陣的常用儲存格式
對於很多元素為零的稀疏矩陣,僅儲存非零元素可使矩陣操作效率更高。現有許多種稀疏矩陣的儲存方式,但是多數採用相同的基本技術,即儲存矩陣所有的非零元素到一個線性陣列中,並提供輔助陣列來描述原陣列中非零元素的位置。
Sparse Matrix Storage Formats稀疏矩陣的儲存格式
1. Coordinate Format (COO)
是一種座標形式的稀疏矩陣。採用三個陣列row、col和data儲存非零元素的資訊,這三個陣列的長度相同,row儲存元素的行,col儲存元素的列,data儲存元素的值。儲存的主要優點是靈活、簡單,僅儲存非零元素以及每個非零元素的座標。但是COO不支援元素的存取和增刪,一旦建立之後,除了將之轉換成其它格式的矩陣,幾乎無法對其做任何操作和矩陣運算。
COO使用3個數組進行儲存:values,rows, andcolumn。
其中
陣列values: 實數或複數資料,包括矩陣中的非零元素,順序任意。
陣列rows: 資料所處的行。
陣列columns: 資料所處的列。
引數:矩陣中非零元素的數量 nnz,3個數組的長度均為nnz.2. Diagonal Storage Format (DIA)
如果稀疏矩陣有僅包含非0元素的對角線,則對角儲存格式(DIA)可以減少非0元素定位的資訊量。這種儲存格式對有限元素或者有限差分離散化的矩陣尤其有效。
DIA通過兩個陣列確定: values、distance。
其中values:對角線元素的值;
distance:第i個distance是當前第i個對角線和主對角線的距離。
If the sparse matrix has diagonals containing only zero elements, then the diagonal storage format can be used to reduce the amount of information needed to locate the non-zero elements. This storage format is particularly useful in many applications where the matrix arises from a finite element or finite difference discretization.
The Intel MKL diagonal storage format is specified by two arrays:values anddistance, and two parameters:ndiag, which is the number of non-empty diagonals, andlval, which is the declared leading dimension in the calling (sub)programs.
- values
A real or complex two-dimensional array is dimensioned aslval byndiag. Each column of it contains the non-zero elements of certain diagonal ofA. The key point of the storage is that each element invalues retains the row number of the original matrix. To achieve this diagonals in the lower triangular part of the matrix are padded from the top, and those in the upper triangular part are padded from the bottom. Note that the value ofdistance(i) is the number of elements to be padded for diagonali.
- distance
An integer array with dimension ndiag. Elementi of the arraydistance is the distance betweeni-diagonal and the main diagonal. The distance is positive if the diagonal is above the main diagonal, and negative if the diagonal is below the main diagonal. The main diagonal has a distance equal to zero.
3. Compressed Sparse Row Format (CSR)
壓縮稀疏行格式(CSR)通過四個陣列確定: values,columns, pointerB, pointerE.
其中
陣列values:是一個實(復)數,包含矩陣A中的非0元,以行優先的形式儲存;陣列columns:第i個整型元素代表矩陣A中第i列;
陣列pointerB :第j個整型元素給出矩陣A行j中第一個非0元的位置,等價於pointerB(j) -pointerB(1)+1 ;
陣列pointerE:第j個整型元素給出矩陣A第j行最後一個非0元的位置,等價於pointerE(j)-pointerB(1)。
The Intel MKL compressed sparse row (CSR) format is specified by four arrays: thevalues,columns,pointerB, andpointerE. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrixA.
- values
A real or complex array that contains the non-zero elements ofA. Values of the non-zero elements ofA are mapped into thevalues array using the row-major storage mapping described above.
- columns
Element i of the integer array columns is the number of the column inA that contains thei-th value in thevalues array.
- pointerB
Element j of this integer array gives the index of the element in thevalues array that is first non-zero element in a rowj ofA. Note that this index is equal topointerB(j) -pointerB(1)+1 .
- pointerE
An integer array that contains row indices, such thatpointerE(j)-pointerB(1) is the index of the element in thevalues array that is last non-zero element in a row j of A.
4. Compressed Sparse Column Format (CSC)
壓縮稀疏列格式(CSC)類似CSR格式,只是用的是列而不是行壓縮。換句話說,矩陣A的CSC 格式和矩陣A的轉置的CSR是一樣的。
同樣CSC也是由四個陣列確定:values, columns, pointerB, and pointerE. 含義類同CSR。
The compressed sparse column format (CSC) is similar to the CSR format, but the columns are used instead the rows. In other words, the CSC format is identical to the CSR format for the transposed matrix. The CSR format is specified by four arrays: values, columns, pointerB, and pointerE. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrixA.
- values
A real or complex array that contains the non-zero elements ofA. Values of the non-zero elements ofA are mapped into thevalues array using the column-major storage mapping.
- rows
Element i of the integer array rows is the number of the row inA that contains thei-th value in thevalues array.
- pointerB
Element j of this integer array gives the index of the element in thevalues array that is first non-zero element in a columnj ofA. Note that this index is equal topointerB(j) -pointerB(1)+1 .
- pointerE
An integer array that contains column indices, such thatpointerE(j)-pointerB(1) is the index of the element in thevalues array that is last non-zero element in a column j ofA.
5. Skyline Storage Format
The skyline storage format is important for the direct sparse solvers, and it is well suited for Cholesky or LU decomposition when no pivoting is required.
The skyline storage format accepted in Intel MKL can store only triangular matrix or triangular part of a matrix. This format is specified by two arrays:values andpointers. The following table describes these arrays:
- values
A scalar array. For a lower triangular matrix it contains the set of elements from each row of the matrix starting from the first non-zero element to and including the diagonal element. For an upper triangular matrix it contains the set of elements from each column of the matrix starting with the first non-zero element down to and including the diagonal element. Encountered zero elements are included in the sets.
- pointers
An integer array with dimension (m+1), where m is the number of rows for lower triangle (columns for the upper triangle).pointers(i) -pointers(1)+1 gives the index of element invalues that is first non-zero element in row (column)i. The value ofpointers(m+1) is set tonnz+pointers(1), wherennz is the number of elements in the arrayvalues.
6. Block Compressed Sparse Row Format (BSR)
原矩陣A:
block_size為2時,分塊表示的壓縮矩陣E:
BSR的zero-based索引表示:
values = (1 02 1 6 7 8 2 1 4 5 1 4 3 0 0 7 2 0 0)
columns = (0 1 1 1 2)
pointerB= (0 2 3)
pointerE= (2 3 5)
分塊壓縮稀疏行格式(BSR) 通過四個陣列確定:values,columns,pointerB, pointerE.
其中陣列values:是一個實(復)數,包含原始矩陣A中的非0元,以行優先的形式儲存;
陣列columns:第i個整型元素代表塊壓縮矩陣E中第i列;
陣列pointerB :第j個整型元素給出columns第j個非0塊的起始位置;
陣列pointerE:第j個整型元素給出columns陣列中第j個非0塊的終止位置。
The Intel MKL block compressed sparse row (BSR) format for sparse matrices is specified by four arrays:values,columns,pointerB, andpointerE. The following table describes these arrays.
- values
A real array that contains the elements of the non-zero blocks of a sparse matrix. The elements are stored block-by-block in row-major order. A non-zero block is the block that contains at least one non-zero element. All elements of non-zero blocks are stored, even if some of them is equal to zero. Within each non-zero block elements are stored in column-major order in the case of one-based indexing, and in row-major order in the case of the zero-based indexing.
- columns
Element i of the integer array columns is the number of the column in the block matrix that contains thei-th non-zero block.
- pointerB
Element j of this integer array gives the index of the element in thecolumns array that is first non-zero block in a rowj of the block matrix.
- pointerE
Element j of this integer array gives the index of the element in thecolumns array that contains the last non-zero block in a rowj of the block matrix plus 1.
7. ELLPACK (ELL)
8. Hybrid (HYB)
由ELL+COO兩種格式結合而成。
dok_matrix
基於keys的字典稀疏矩陣。
lil_matrix
基於行連結列表的稀疏矩陣,增量式建立稀疏矩陣的結構。
不同稀疏矩陣的優缺點和使用經驗
sparse matrix稀疏矩陣不同的儲存形式在sparse模組中對應如下:bsr_matrix(arg1[, shape, dtype,copy, blocksize]) Block Sparse Row matrixcoo_matrix(arg1[, shape, dtype,copy]) A sparse matrix in COOrdinate format.csc_matrix(arg1[, shape, dtype,copy]) Compressed Sparse Column matrixcsr_matrix(arg1[, shape, dtype,copy]) Compressed Sparse Row matrixdia_matrix(arg1[, shape, dtype,copy]) Sparse matrix with DIAgonal storagedok_matrix(arg1[, shape, dtype,copy]) Dictionary Of Keys based sparse matrix.lil_matrix(arg1[, shape, dtype,copy]) Row-based linked list sparse matrixscipy不同稀疏矩陣的介紹和優缺點
scipy.sparse庫中提供了多種表示稀疏矩陣的格式,每種格式都有不同的用處。同時稀疏矩陣可以支援加、減、乘、除和冪等算術操作。Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division,and matrix power.分塊壓縮稀疏行格式(BSR)bsr_matrix(arg1, shape=None, dtype=None, copy=False, blocksize=None)Block Sparse Row matrix:
和壓縮稀疏行格式(CSR)很相似,但是BSR更適合於有密集子矩陣的稀疏矩陣,分塊矩陣通常出現在向量值有限的離散元中,在這種情景下,比CSR和CSC算術操作更有效。The Block Compressed Row (BSR) format is very similar to the Compressed Sparse Row (CSR) format. BSR is appropriate for sparse matrices with dense sub matrices. Block matrices often arise in vector-valued finite element discretizations. In such cases, BSR is considerably more efficient than CSR and CSC for many sparse arithmetic operations.
csc_matrix(arg1,shape=None, dtype=None, copy=False)壓縮的列稀疏矩陣CSC :
高效的CSC +CSC, CSC * CSC算術運算;高效的列切片操作。但是矩陣內積操作沒有CSR, BSR快;行切片操作慢(相比CSR);稀疏結構的變化代價高(相比LIL 或者 DOK)。
Advantages of the CSC format
•efficient arithmetic operations CSC + CSC, CSC * CSC, etc.
•efficient column slicing
•fast matrix vector products (CSR, BSR may be faster!)
Disadvantages of the CSC format
•slow row slicing operations (consider CSR)
•changes to the sparsity structure are expensive (consider LIL or DOK)
csr_matrix(arg1, shape=None, dtype=None, copy=False)Compressed Sparse Row matrix壓縮稀疏行格式(CSR):
高效的CSR + CSR, CSR *CSR算術運算;高效的行切片操作;高效的矩陣內積內積操作。但是列切片操作慢(相比CSC);稀疏結構的變化代價高(相比LIL 或者 DOK)。CSR格式在儲存稀疏矩陣時非零元素平均使用的位元組數(Bytes per Nonzero Entry)最為穩定(float型別約為8.5,double型別約為12.5)。CSR格式常用於讀入資料後進行稀疏矩陣計算。
Advantages of the CSR format
•efficient arithmetic operations CSR + CSR, CSR * CSR, etc.
•efficient row slicing
•fast matrix vector products
Disadvantages of the CSR format
•slow column slicing operations (consider CSC)
•changes to the sparsity structure are expensive (consider LIL or DOK)
coo_matrix(arg1,shape=None,dtype=None,copy=False)座標格式(COO):
座標形式的一種稀疏矩陣。採用三個陣列row、col和data儲存非零元素的資訊。這三個陣列的長度相同,row儲存元素的行,col儲存元素的列,data儲存元素的值。
coo_matrix不支援元素的存取和增刪,一旦建立之後,除了將之轉換成其它格式的矩陣,幾乎無法對其做任何操作和矩陣運算。
Advantages of the COO format
•facilitates fast conversion among sparse formats
•permits duplicate entries (see example)
•very fast conversion to and from CSR/CSC formats
•does not directly support:
–arithmetic operations
–slicing缺點:不能直接進行科學計算和切片操作
COO格式常用於從檔案中進行稀疏矩陣的讀寫,如matrix market即採用COO格式。
最常用的函式:
tocsc() | Return a copy of this matrix in Compressed Sparse Column format |
tocsr() | Return a copy of this matrix in Compressed Sparse Row format |
todense([order, out]) | Return a dense matrix representation of this matrix |
許多稀疏矩陣的資料都是採用這種格式儲存在檔案中的,例如某個CSV檔案中可能有這樣三列:“使用者ID,商品ID,評價值”。採用numpy.loadtxt或pandas.read_csv將資料讀入之後,可以通過coo_matrix快速將其轉換成稀疏矩陣:矩陣的每行對應一位使用者,每列對應一件商品,而元素值為使用者對商品的評價。
dia_matrix(arg1, shape=None, dtype=None, copy=False)Sparse matrix with DIAgonal storage
dok_matrix(arg1, shape=None, dtype=None, copy=False)Dictionary Of Keys based sparse matrix.
dok_matrix從dict繼承,它採用字典儲存矩陣中不為0的元素:字典的鍵是一個儲存元素(行,列)資訊的元組,其對應的值為矩陣中位於(行,列)中的元素值。顯然字典格式的稀疏矩陣很適合單個元素的新增、刪除和存取操作。通常用來逐漸新增非零元素,然後轉換成其它支援快速運算的格式。
基於字典儲存的稀疏矩陣。This is an efficient structure for constructing sparse matrices incrementally.Allows for efficient O(1) access of individual elements. Duplicates are not allowed. Can be efficiently converted to a coo_matrix once constructed.
lil_matrix(arg1, shape=None, dtype=None, copy=False)Row-based linked list sparse matrix
This is an efficient structure for constructing sparse matrices incrementally.
基於行連線儲存的稀疏矩陣。lil_matrix使用兩個列表儲存非零元素。data儲存每行中的非零元素,rows儲存非零元素所在的列。這種格式也很適合逐個新增元素,並且能快速獲取行相關的資料。
•supports flexible slicing
•changes to the matrix sparsity structure are efficient
Disadvantages of the LIL format
•arithmetic operations LIL + LIL are slow (consider CSR or CSC)
•slow column slicing (consider CSC)
•slow matrix vector products (consider CSR or CSC)
Intended Usage
•LIL is a convenient format for constructing sparse matrices
•once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
•consider using the COO format when constructing large matrices
Note:{dok_matrix和lil_matrix適合逐漸新增元素}
綜合
2. COO和CSR格式比起DIA和ELL來,更加靈活,易於操作;
3. ELL的優點是快速,而COO優點是靈活,二者結合後的HYB格式是一種不錯的稀疏矩陣表示格式;
4. 根據Nathan Bell的工作:
CSR格式在儲存稀疏矩陣時非零元素平均使用的位元組數(Bytes per Nonzero Entry)最為穩定(float型別約為8.5,double型別約為12.5)
而DIA格式儲存資料的非零元素平均使用的位元組數與矩陣型別有較大關係,適合於StructuredMesh結構的稀疏矩陣(float型別約為4.05,double型別約為8.10)
對於Unstructured Mesh以及Random Matrix,DIA格式使用的位元組數是CSR格式的十幾倍;
5. 一些線性代數計算庫:COO格式常用於從檔案中進行稀疏矩陣的讀寫,如matrix market即採用COO格式,而CSR格式常用於讀入資料後進行稀疏矩陣計算。
[scipy-ref-0.14.0 - Sparse matrices (scipy.sparse)]
scipy.sparse稀疏矩陣的呼叫格式及引數、屬性、方法說明
sparse matrix稀疏矩陣不同的儲存形式在sparse模組中對應如下:
呼叫格式及引數說明:
arg1:密集矩陣或者另一個稀疏矩陣;
shape=(M, N):建立的稀疏矩陣的shape為(M, N)未指定時從索引陣列中推斷;
dtype:稀疏矩陣元素型別,預設為’d’;
copy:bool型別,是否進行深拷貝,預設False。
其中BSR特有的引數blocksize:分塊矩陣分塊大小,而且必須被矩陣shape (M,N)整除。未指定時會自動使用啟發式方法找到合適的分塊大小。
座標格式(COO) :
coo_matrix(arg1[, shape, dtype,copy])
對角儲存格式(DIA) :
dia_matrix(arg1[, shape, dtype,copy])
壓縮稀疏行格式(CSR) :
csr_matrix(arg1[, shape, dtype,copy])
壓縮稀疏列格式(CSC) :
csc_matrix(arg1[, shape, dtype,copy])
分塊壓縮稀疏行格式(BSR) :
bsr_matrix(arg1[, shape, dtype,copy,blocksize])
稀疏矩陣常用屬性
dtype 矩陣資料型別
shape (2-tuple)矩陣形狀
ndim (int)矩陣維數
nnz 非0元個數
data矩陣的資料陣列
row COO特有的,矩陣行索引
col COO特有的,矩陣列索引
has_sorted_indices BSR有的,是否有排序索引
indices BSR特有的,BSR格式的索引陣列
indptr BSR特有的,BSR格式的索引指標陣列
blocksize BSR特有的,矩陣塊大小
等等
稀疏矩陣常用方法
asformat(format) 返回給定格式的稀疏矩陣
astype(t) 返回給定元素格式的稀疏矩陣
diagonal() 返回矩陣主對角元素
dot(other) 座標點積
getcol(j) 返回矩陣列j的一個拷貝,作為一個(mx 1) 稀疏矩陣 (列向量)
getrow(i) 返回矩陣行i的一個拷貝,作為一個(1 x n) 稀疏矩陣 (行向量)
max([axis]) 給定軸的矩陣最大元素
nonzero() 非0元索引
todense([order, out]) 返回當前稀疏矩陣的密集矩陣表示
sparse模組中用於建立稀疏矩陣的函式
eye(m[, n, k, dtype, format])
Sparse matrix with ones on diagonal
identity(n[, dtype, format])
Identity matrix in sparse format
kron(A, B[, format])
kronecker product of sparse matrices A and B
kronsum(A, B[, format])
kronecker sum of sparse matrices A and B
diags(diagonals[, offsets, shape, format, dtype])
Construct a sparse matrix from diagonals.
spdiags(data, diags, m, n[, format])
Return a sparse matrix from diagonals.
block_diag(mats[, format, dtype])
Build a block diagonal sparse matrix from provided matrices.
tril(A[, k, format])
Return the lower triangular portion of a matrix in sparse format
triu(A[, k, format])
Return the upper triangular portion of a matrix in sparse format
bmat(blocks[, format, dtype])
Build a sparse matrix from sparse sub-blocks
hstack(blocks[, format, dtype])
Stack sparse matrices horizontally (column wise)
vstack(blocks[, format, dtype])
Stack sparse matrices vertically (row wise)
rand(m, n[, density, format, dtype, ...])
Generate a sparse matrix of the given shape and density with uniformly distributed values.
random(m, n[, density, format, dtype, ...])
Generate a sparse matrix of the given shape and density with randomly distributed values.
sparse matrix稀疏矩陣的相關操作
建立和檢視稀疏矩陣
以coo_matrix為例:
1 直接將dense矩陣轉換成稀疏矩陣A =coo_matrix([[1,2],[3,4]]) print(A) (0, 0) 1 (0, 1) 2 (1, 0) 3 (1, 1) 42 按照相應儲存形式的要求構建矩陣:
row = array([0,0,0,0,1,3,1]) col = array([0,0,0,2,1,3,1]) data = array([1,1,1,8,1,1,1])
matrix = coo_matrix((data, (row,col)), shape=(4,4))print(matrix)
print(matrix.todense())
(0, 0) 1
(0, 0) 1
(0, 0) 1
(0, 2) 8
(1, 1) 1
(3, 3) 1
(1, 1) 1
[[3 0 8 0]
[0 2 0 0]
[0 0 0 0]
[0 0 0 1]]
Note:csr_matrix總是返回稀疏矩陣,而不會返回一維向量。即使csr_matrix([2,3])也返回矩陣。
稀疏矩陣大小
csr = csr_matrix([[1, 5], [4, 0], [1, 3]]) print(csr.todense()) #todense()之後是<class 'numpy.matrixlib.defmatrix.matrix'> print(csr.shape) print(csr.shape[1]) [[1 5] [4 0] [1 3]] (3, 2) 2
稀疏矩陣下標存取slice
print(csr)
(0, 0) 1 (0, 1) 5 (1, 0) 4 (2, 0) 1 (2, 1) 3 print(csr[0]) #<class 'scipy.sparse.csr.csr_matrix'> (0, 0) 1 (0, 1) 5 print(csr[1,1]) 1
print(csr[0,0]) 0
for c in csr: #每次讀取csr中的一行 type(c) <class 'scipy.sparse.csr.csr_matrix'> print(c) break(0, 0) 1
(0, 1) 5
csr_mat = csr_matrix([1, 5, 0]) print(csr_mat.todense()) # print(type(csr_mat.nonzero())) #<class 'tuple'> for row, col in csr_mat.nonzero(): print(row, col, csr_mat[row, col])[[1 5 0]]
0 0 1
0 1 5
將稀疏矩陣橫向或者縱向合併
from scipy.sparse import coo_matrix, vstackcsr = csr_matrix([[1, 5, 5], [4, 0, 6], [1, 3, 7]]) print(csr.todense()) [[1 5 5] [4 0 6] [1 3 7]] csr2 = csr_matrix([[3, 0, 9]]) print(csr2.todense()) [[3 0 9]] print(vstack([csr, csr2]).todense()) [[1 5 5] [4 0 6] [1 3 7] [3 0 9]]Note:如果合併資料形式不一樣,不能合併。一個矩陣中的資料格式必須是相同的。diags函式建立稀疏的對角矩陣
sparce矩陣的讀取
可以像常規矩陣一樣通過下標讀取。也可以通過getrow(i),gecol(i)讀取特定的列或者特定的行,以及nonzero()讀取非零元素的位置。對於大多數(似乎只處了coo之外)稀疏矩陣的儲存格式,都可以進行slice操作,比如對於csc,csr。也可以進行arithmeticoperations,矩陣的加減乘除,速度很快。
取矩陣的指定列數
sub = matrix.getcol(1) #'coo_matrix' object does not support indexing,不能使用matrix[1] print(sub)
(1, 0) 2sub = matrix.todense()[:,[1,2]] #常規矩陣取指定列print(sub)
[[0 8]
[2 0]
[0 0]
[0 0]]
稀疏矩陣點積計算
A = csr_matrix([[1, 2, 0], [0, 0, 3]]) print(A.todense())
[[1 2 0] [0 0 3]] v = A.T print(v.todense())
[[1 0] [2 0] [0 3]] d = A.dot(v) print(d)(0, 0) 5
(1, 1) 9
A = lil_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]]) v = array([1, 0, -1]) s = datetime.datetime.now() for i in range(100000): d = A.dot(v) #這裡v是一個ndarray print(datetime.datetime.now() - s) 計算時間: bsr:0:00:01.666072 coo:1.04 csc:0.93 csr:0.90 dia:1.06 dok:1.57 lil:11.37 故推薦用csr計算點積
csr_mat1 = csr_matrix([1, 2, 0]) csr_mat2 = csr_matrix([1, 0, -1]) similar = (csr_mat1.dot(csr_mat2.transpose())) #這裡csr_mat2也是一個csr_matrix print(type(similar)) print(similar) print(similar[0, 0]) <class 'scipy.sparse.csr.csr_matrix'> (0, 0) 1 1
scipy稀疏矩陣在檔案中的讀取(讀取和儲存稀疏矩陣)
mmwrite(target, a[, comment, field, precision]) Writes the sparse or dense array a to a Matrix Market formatted file.
mmread(source) Reads the contents of a Matrix Market file ‘filename’ into a matrix.<class 'scipy.sparse.coo.coo_matrix'>
mminfo(source) Queries the contents of the Matrix Market file ‘filename’ to extract size and storage.
def save_csr_mat( item_item_sparse_mat_filename=r'.\datasets\lastfm-dataset-1K\item_item_csr_mat.mtx'): random.seed(10) raw_user_item_mat = random.randint(0, 6, (3, 2)) d = csr_matrix(raw_user_item_mat) print(d.todense()) print(d) mmwrite(item_item_sparse_mat_filename, d) print("item_item_sparse_mat_file information: ") print(mminfo(item_item_sparse_mat_filename)) k = mmread(item_item_sparse_mat_filename) print(k.todense()) [[1 5] [4 0] [1 3]] (0, 0) 1 (0, 1) 5 (1, 0) 4 (2, 0) 1 (2, 1) 3 item_item_sparse_mat_file information: (3, 2, 5, 'coordinate', 'integer', 'general') [[1 5] [4 0] [1 3]] 儲存的檔案中的內容: %%MatrixMarket matrix coordinate integer general % 3 2 5 1 1 1 1 2 5 2 1 4 3 1 1 3 2 3Note:儲存的檔案拓展名應為.mtx
[scipy-ref-0.14.0 - Matrix Market files]
一種比較省記憶體的稀疏矩陣Python儲存方案
python字典模擬稀疏矩陣
{lz覺得這種主要用於稀疏檢索和儲存,而不適合應用於計算}
要支援data[i, ...]、data[..., j]的快速切片,需要i或者j的資料集中儲存;同時,為了儲存海量的資料,也需要把資料的一部分放在硬碟上,用記憶體做buffer。
這裡的解決方案比較簡單,用一個類Dict的東西來儲存資料,對於某個i(比如9527),它的資料儲存在dict['i9527']裡面,需要取出data[9527, ...]的時候,只要取出dict['i9527']即可,dict['i9527']原本是一個dict物件,儲存某個j對應的值,為了節省記憶體空間,我們把這個dict以二進位制字串形式儲存。採用類Dict來儲存資料的另一個好處是你可以隨便用記憶體Dict或者其他任何形式的DBM,甚至傳說中的Tokyo Cabinet.blogread.cn/it/article/1229]
[How to Think Like a Computer Scientist 講授了 dictionary 和如何使用 dictionary 模擬稀疏矩陣。]
sklearn特徵提取中的稀疏情景
載入字典的中的特徵:類 DictVectorizer 可以把特徵向量轉化成標準的Python字典物件的一個列表,同時也是被scikit-learn的估計器使用的一個NumPy/SciPy體現(ndarray)即使處理時並不是特別快,python的字典有易於使用的優勢,適用於稀疏情景(缺失特徵不會被儲存),儲存特徵的名字和值。
特徵雜湊:類 FeatureHasher 是一個快速且低記憶體消耗的向量化方法,使用了 feature hashing 技術,或可稱為”hashing trick”。沒有像向量化那樣,為計算訓練得到的特徵建立哈西表,類 FeatureHasher 的例項使用了一個雜湊函式來直接確定特徵在樣本矩陣中的列號。這樣在可檢查性上增加了速度減少了記憶體開銷。這個類不會記住輸入特徵的形狀,也沒有 inverse_transform 方法。
[sklearn 特徵提取]from:http://blog.csdn.net/pipisorry/article/details/41762945
http://blog.sina.com.cn/s/blog_6a90ae320101aavg.html