用詞袋(bag of word)實現場景識別
前段時間在standford university的計算機視覺:演算法與應用這門課上做了一個小作業——利用詞袋實現場景識別(Scene recognition with bag of words),下面整理如下:
一、詞袋模型
最先是由Josef等基於自然語言處理模型而提出的。這一模型在文件分類裡廣為應用,通過統計each word的frequency來作為分類器的特徵。類比一篇文章由很多文字(textual words) 組合而成,如果將一張圖片表示成由許多 視覺單詞(visual words)組合而成,就能將過去在文字檢索(text retrieval
下面通過一個簡單的例子來說明詞袋在文字處理中的應用:
如下兩篇簡單的文件;
基於這兩篇文件建立一個字典(Dictionary)如下:
易見這個字典由10個distinct word構成,將其作為indexes,我們可將兩篇文件表示為如下的10-entry vector:
通俗的講:
Bag-of-words model實際就是把文件表示成向量,其中vector的維數就是字典所含詞的個數,在上例中,vector中的第i個元素就是統計該文件中對應(字典)dictionry中的第i個單詞出現的個數,因此可認為BoW model就是統計詞頻直方圖的簡單文件表示方法。
二、 詞袋模型在計算機視覺中的應用
類別識別的最簡單的演算法之一是詞袋(bag of words,也稱為特徵袋,即bag of features或“關鍵點袋”bag of keypoints)方法。詞袋類別識別系統的典型處理框架如下圖所示:
也就是說,我們要先從訓練影象庫中提取影象的sift特徵,形成區域性特徵描述子集合,通過聚類演算法形成多類視覺詞彙,最終形成視覺詞典。具體地說,影象視覺單詞直方圖生成過程可如下圖所示:
由上圖的上半部分可以看到,利用Bag-of-words(詞袋)模型將一幅影象表示成數值向量的步驟如下: 1.特徵提取,利用sift演算法從不同類別的影象集(即訓練影象)中提取特徵,及我們所說的視覺詞彙向量,這些向量代表的是影象中區域性不變的特徵點。 2.將所有的特徵點向量集合到一塊,利用K-Means演算法合併詞義相近的視覺詞彙,構造一個包含K個詞彙的的單詞表。 3.統計單詞中每個單詞在影象中出現的次數,從而可以將影象表示成一個K維數值向量,就是K維的統計直方圖。 要實現場景識別或者分類,我們就可以把測試影象和訓練影象聯絡起來,從而實現場景識別或者分類。 三、實現場景識別 此次我實現的場景識別,是呼叫了一個機器視覺庫的函式包完成的,它是VLFeat 0.9.17 bingary package,它在matlab中的配置方法如下: 一、首先應準備的東西:
二、安裝
1. 將所下載的二進位制包解壓縮到某個位置,如D:\盤
2.將解壓完後的vlfeat檔案複製到matlab安裝目錄toolbox資料夾下
3. 開啟matlab,輸入edit startup.m建立啟動檔案startup.m
4. 在startup.m中編輯發下內容(注意,如果將vlfeat安裝在不同的地方,需要將以下的”D:\”改為你所安裝的地址):
5. 儲存並關閉startup.m檔案,重新開啟matlab程式,安裝即成功(安裝成功後,不能刪除vlfeat解壓後的資料夾,因為vl_setup只是 將vlfeat的toolbox的地址加到matlab的path裡面,使得matlab可以使用vlfeat toolbox)。
6.檢驗vlfeat是否成功配置,在matlab中輸入以下命令,出現如下證明就已經配置成功了!
在計算即視覺——演算法與應用這個小作業中,主要準備的資料有不同場景下的訓練影象和測試影象集,主要通過兩種方法實現場景識別: 1.tiny image and nearest neighbor classification(微影象和最近鄰分類)(正確率18%~25%) tiiny image(微小影象)特徵,是最簡單的影象表徵方法之一。這裡我按照作業的說法,簡單地將每幅影象調整為固定的解析度(16x16大小)。如果使tiny image 影象矩陣具有零均值和單位長度,這種方法效果會更好。但這種不是很好的影象表徵方法,因為它回忽略高頻影象內容,並且不具有影象尺度不變性。 nearest neighbor classification,即最近鄰分類,這裡我將測試影象得到的視覺詞彙簡單地與訓練影象得到的視覺詞典詞彙做歐氏距離度量,然後找出最近距離,從而實現場景分類識別。最近鄰分類有很多優點:無需訓練,簡單,易於理解,易於實現,無需估計引數。但是這種方法易於受到噪聲的影響,並且隨著特徵維度的增加,這種方法不能很好地學習不相關維度的決策。 其matlab程式碼如下:
function image_feats = get_tiny_images(image_paths)
% image_paths is an N x 1 cell array of strings where each string is an
% image path on the file system.
% image_feats is an N x d matrix of resized and then vectorized tiny
% images. E.g. if the images are resized to 16x16, d would equal 256.
% small square resolution, e.g. 16x16. You can either resize the images to
% square while ignoring their aspect ratio or you can crop the center
% square portion out of each image. Making the tiny images zero mean and
% unit length (normalizing them) will increase performance modestly.
%file_paths = cell(System.IO.Directory.GetDirectories('D:\MATLAB\R2014a\bin\data3'));
%celldisp(file_paths);
[m,n] = size(image_paths);
d = 256;
%image_feats = [];
image_feats = zeros(m,d);
for i = 1:m
%string = image_paths{i};
s = num2str(cell2mat(image_paths(i)));
image = imread(s);
image = imresize(image,[16,16]);
image = reshape(image,1,256);
%image = image/norm(image); %normalize the tiny image
image = image - mean(image); %make the tiny image zero mean
image_feats(i,1:d) = image;
%image_feats = [image_feats;image];
end
最近鄰分類程式碼如下:
function predicted_categories = nearest_neighbor_classify(train_image_feats, train_labels, test_image_feats)
% image_feats is an N x d matrix, where d is the dimensionality of the
% feature representation.
% train_labels is an N x 1 cell array, where each entry is a string
% indicating the ground truth category for each training image.
% test_image_feats is an M x d matrix, where d is the dimensionality of the
% feature representation. You can assume M = N unless you've modified the
D = vl_alldist2(X,Y)
http://www.vlfeat.org/matlab/vl_alldist2.html
returns the pairwise distance matrix D of the columns of X and Y.
D(i,j) = sum (X(:,i) - Y(:,j)).^2
Note that vl_feat represents points as columns vs this code (and Matlab
in general) represents points as rows. So you probably want to use the
transpose operator '
vl_alldist2 supports different distance metrics which can influence
performance significantly. The default distance, L2, is fine for images.
CHI2 tends to work well for histograms.
[Y,I] = MIN(X) if you're only doing 1 nearest neighbor, or
[Y,I] = SORT(X) if you're going to be reasoning about many nearest
neighbors
%}
[N,d] = size(test_image_feats);
predicted_categories = cell(N,1);
dist = zeros(N,N);
for i = 1:N
for j = 1:N
dist(i,j) = vl_alldist2(test_image_feats(i,:)',train_image_feats(j,:)');
end
[Y,I] = min(dist(i,:));
predicted_categories(i,1) = train_labels(I);
end
出來的結果如下:
2.SIFT features and nearset neighbor classification(正確率50%~60%) 利用vlfeat視覺庫實現sift特徵檢測的程式碼如下: (1)建立訓練影象集的視覺詞彙表
function vocab = build_vocabulary( image_paths, vocab_size )
% The inputs are images, a N x 1 cell array of image paths and the size of
% the vocabulary.
[centers, assignments] = vl_kmeans(X, K)
http://www.vlfeat.org/matlab/vl_kmeans.html
X is a d x M matrix of sampled SIFT features, where M is the number of
features sampled. M should be pretty large! Make sure matrix is of type
single to be safe. E.g. single(matrix).
K is the number of clusters desired (vocab_size)
centers is a d x K matrix of cluster centroids. This is your vocabulary.
N = size(image_paths,1);
image_sampledSIFT = [];
for i = 1:4:N
s = num2str(image_paths(i)); %%s = num2str(cell2mat(image_paths(i)));
img = single(imread(s));
[locations,SIFT_features] = vl_dsift(img,'STEP',10);
SIFT_features = single(SIFT_features);
image_sampledSIFT = [image_sampledSIFT SIFT_features];
end
[vocab assignments] = vl_kmeans(image_sampledSIFT,vocab_size);
(2)獲取sift特徵
function image_feats = get_bags_of_sifts(image_paths)
% image_paths is an N x 1 cell array of strings where each string is an
% image path on the file system.
%{Useful functions:
[locations, SIFT_features] = vl_dsift(img)
http://www.vlfeat.org/matlab/vl_dsift.html
locations is a 2 x n list list of locations, which can be used for extra
credit if you are constructing a "spatial pyramid".
SIFT_features is a 128 x N matrix of SIFT features
D = vl_alldist2(X,Y)
http://www.vlfeat.org/matlab/vl_alldist2.html
returns the pairwise distance matrix D of the columns of X and Y.
D(i,j) = sum (X(:,i) - Y(:,j)).^2 %}
</pre><pre name="code" class="html" style="color: rgb(51, 51, 51);">
load('vocab.mat')
fprintf('vocab loaded\n')
vocab_size = size(vocab, 2);
image_feats = [];
for i = 1:size(image_paths)
img = single(imread(num2str(cell2mat(image_paths(i)))));
[locations,SIFT_features] = vl_dsift(img,'STEP',10);
SIFT_features = single(SIFT_features);
D = vl_alldist2(vocab,SIFT_features);
[X,I] = min(D);
histogram = zeros(vocab_size,1);
for j = 1:vocab_size
histogram(I(j)) = histogram(I(j)) + 1;
end
histogram = histogram/norm(histogram);
image_feats(i,:) = histogram';
end
同樣也是利用最近鄰分類進行分類。
結果如下: