基於Tensorflow的目標檢測(Detection)的程式碼案例詳解
這篇博文我主要闡述了基於Tensorflow的Faster RCNN在Windows上的一個Demo程式,其中,分為兩個部分,一個是訓練資料匯入部分,一個是網路架構部分開始。源程式git地址我會放在文章最後,下載後可以參考對應看一下。
一、程式執行環境說明
首先,我想闡述一堆巨坑,下面只要有一條沒有環境或條件達到或做到,你的程式將無法執行:
Windows10 家庭版:
Python3.5+Windows+Visual Studio 2015+cuda9.1
這裡,本人踩過幾個坑,忘後來人應用這個版本的Demo不要再走:
① Python3.6無法編譯該程式。因為作者編譯時環境為3.5
② 如果你的電腦是Windows家庭版,不要用Anaconda進行安裝Python3.5,直接裝上Python3.5即可,因為家庭版的Windows10系統無法安裝Anaconda3+Python3.5的環境,Anaconda3預設3.6版或2.7版。
③ 除Visual Studio2015外的版本將無法執行符合要求的編譯Python所需的C++環境。(不要問我為什麼,我也不知道)
Windows10 企業版:
Anaconda3+Python3.5+Cuda9.1
① Anaconda與Python對應的版本可以百度搜索清華Python映象中下載。
② 如果用Anaconda搭載python3.5將不需要Visual Studio環境,無需安裝。反之,如果沒有用Anaconda搭載python,而是直接安裝python,就必須要安裝Visual Studio 2015的環境。
好了,坑到此結束,說完這些,按照ReadMe編譯程式之後,應該程式可以運行了。
我的IDE用的是Pycharm Jetbrain。
二、訓練資料匯入部分
那麼,我們先來看資料匯入的環節:
由於物體檢測是迴歸和分類任務,那麼匯入的資料就要包括物體的位置以及他的類別,那麼在程式中,這些資訊的根目錄在:
...\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\VOCDevkit2007\VOC2007\Annotations
影象資訊由xml檔案讀取。
而影象與影象資訊xml檔案是一一對應的,這些訓練集中影象的根目錄在:
...\Desktop\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\VOCDevkit2007\VOC2007\JPEGImages
現在,我們回到程式碼train.py:
可以明顯的發現,train檔案中主函式中一共就有兩句話:
train = Train()
train.train()
第一句就是我們網路訓練資料集匯入的過程,而第二句主要就是真正的訓練資料集的過程,那麼我們還是從第一句開始:
首先,我們跳入這句Train(),再跳入VGG16.py中的初始化過程,具體在network.py中:
self._feat_stride = [16, ]
self._feat_compress = [1. / 16., ]
self._batch_size = batch_size
self._predictions = {}
self._losses = {}
self._anchor_targets = {}
self._proposal_targets = {}
self._layers = {}
self._act_summaries = []
self._score_summaries = {}
self._train_summaries = []
self._event_summaries = {}
self._variables_to_fix = {}
一開始包括了一些引數的指定,例如feat_stride,為後續說到的錨點和原始影象對應的區域。
我們回到train.py接下去看:
self.imdb, self.roidb = combined_roidb("voc_2007_trainval")
這一句,把訓練的影象資訊全部讀入到了roidb這樣一個變數中,跳入combined_roidb():
def get_roidb(imdb_name):
imdb = get_imdb(imdb_name)
print('Loaded dataset `{:s}` for training'.format(imdb.name))
imdb.set_proposal_method("gt")
print('Set proposal method: {:s}'.format("gt"))
roidb = get_training_roidb(imdb)
return roidb
以上程式碼,表示了通過名字把roidb讀入進來的過程,最後返回了roidb這個變數:
注意到,程式碼中有這樣一句:
roidbs = [get_roidb(s) for s in imdb_names.split('+')]
這句的意思就是資料來源可能是從多個源頭進行匯入的,所以假如真的是從多個數據源進行匯入,則用加號把各種資料集連起來,到了用到的時候再用split函式把各種資料集的名字分開。
但事實上,程式中只用到了一個數據集,所以下一句是:
roidb = roidbs[0]
由於程式確定只有一個數據集的資料,所以只需要取0位置上的資料集即可,這裡如果後續有修改,則可以按照具體情況修改。
那麼具體的資料集操作是怎麼進行的呢?我們跳入get_imdb():
再跳一次,到了factory.py
# Set up voc_<year>_<split>
for year in ['2007', '2012']:
for split in ['train', 'val', 'trainval', 'test']:
name = 'voc_{}_{}'.format(year, split)
__sets[name] = (lambda split=split, year=year: pascal_voc(split, year))
# Set up coco_2014_<split>
for year in ['2014']:
for split in ['train', 'val', 'minival', 'valminusminival', 'trainval']:
name = 'coco_{}_{}'.format(year, split)
__sets[name] = (lambda split=split, year=year: coco(split, year))
# Set up coco_2015_<split>
for year in ['2015']:
for split in ['test', 'test-dev']:
name = 'coco_{}_{}'.format(year, split)
__sets[name] = (lambda split=split, year=year: coco(split, year))
我們發現,會有三個迴圈,怕是coco資料集和pascal_voc資料集在不同年份,他內部的格式也不同,所以要經過這樣的處理吧。
先從pascal_voc資料集看起,跳入imdb.init函式,下面程式碼位於imdb.py:
def __init__(self, name, classes=None):
self._name = name
self._num_classes = 0
if not classes:
self._classes = []
else:
self._classes = classes
self._image_index = []
self._obj_proposer = 'gt'
self._roidb = None
self._roidb_handler = self.default_roidb
# Use this dict for storing dataset specific config options
self.config = {}
imdb.py這個檔案主要就是對讀入的資料進行一系列的操作:
初始化部分指定了資料集的名字,初始化類的數量,初始化類的索引標籤。指定了proposal的名字為gt,roidb是我們最終得到的結果,先設為NULL,同時,程式設定了一個handler,進行一些操作,一會兒會詳細說到。
現在回到pascal_voc.py繼續看初始化後的過程:
self._year = year
self._image_set = image_set
先指定了資料集年份,然後指定了要用到的東西的Annotation在哪裡,我們現在用到的就只有Val和Train,即訓練資料和我們的真實資料,就是ground truth:
其中PascalVOC的標註檔案在:
...\Desktop\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\VOCDevkit2007\VOC2007\ImageSets\Main
其中可以開啟看一下,trainval這個檔案:
000005
000007
000009
000012
000016
000017
000019
000020
000021
000023
000024
000026
000030
檔案中是以這樣的形式出現的資料,一共五千條,測試了五千組需要用的案例。trainval中的這些資料就是我們接下來需要訓練的資料的一個標籤,即對應的圖片的名字以及對應的xml資訊。
接下來就是指定路徑讀入相關的資訊了。
self._devkit_path = self._get_default_path() if devkit_path is None \
else devkit_path
self._data_path = os.path.join(self._devkit_path, 'VOC' + self._year)
再後面指定了我們做分類的類別,一共21個類,二十個前景加上一個背景。之後,給每個類的字串設定一個固定的索引值,這樣更加方便接下來的一系列操作:
self._class_to_ind = dict(list(zip(self.classes, list(range(self.num_classes)))))
實際上,pascalVOC這麼多檔案中,這個程式中用到的怕是隻有valtrain這一個txt檔案了,之後,load一下我們的資料,根據ImageSet中指定的資料,從_data_path路徑中讀出,並通過x.strip一條一條讀出,並把讀到的東西以image_index的引數形式返回:
def _load_image_set_index(self):
"""
Load the indexes listed in this dataset's image set file.
"""
# Example path to image set file:
# self._devkit_path + /VOCdevkit2007/VOC2007/ImageSets/Main/val.txt
image_set_file = os.path.join(self._data_path, 'ImageSets', 'Main',
self._image_set + '.txt')
assert os.path.exists(image_set_file), \
'Path does not exist: {}'.format(image_set_file)
with open(image_set_file) as f:
image_index = [x.strip() for x in f.readlines()]
return image_index
接下來,我們已經看完了pascalVOC的讀入過程了,coco資料集也是同理,所以不作贅述,繼續回到train.py:
其中set_proposal_method(“gt”),這句話指定了讀入的資訊就是我們的ground truth。
以下的一句話,有點意思哦:
roidb = get_training_roidb(imdb)
然後我們跳入這個方法來看一下:
def get_training_roidb(imdb):
"""Returns a roidb (Region of Interest database) for use in training."""
if True:
print('Appending horizontally-flipped training examples...')
imdb.append_flipped_images()
print('done')
print('Preparing training data...')
rdl_roidb.prepare_roidb(imdb)
print('done')
return imdb.roidb
這裡將得到的影象都反轉了一下,其實就是將影象做了一個鏡面對稱,這樣我們一開始的資料量有5000,翻轉之後,我們的資料量就有了一萬。
我們仔細來看一下這個翻轉的過程,具體再imdb.py中:
def append_flipped_images(self):
num_images = self.num_images
widths = self._get_widths()
for i in range(num_images):
boxes = self.roidb[i]['boxes'].copy()
oldx1 = boxes[:, 0].copy()
oldx2 = boxes[:, 2].copy()
boxes[:, 0] = widths[i] - oldx2 - 1
boxes[:, 2] = widths[i] - oldx1 - 1
assert (boxes[:, 2] >= boxes[:, 0]).all()
entry = {'boxes': boxes,
'gt_overlaps': self.roidb[i]['gt_overlaps'],
'gt_classes': self.roidb[i]['gt_classes'],
'flipped': True}
self.roidb.append(entry)
self._image_index = self._image_index * 2
好了,到此,我們的資料就算是基本載入完畢了,有一些其他的處理要說明一下,就比如pascalVOC中的:
def gt_roidb(self):
"""
Return the database of ground-truth regions of interest.
This function loads/saves from/to a cache file to speed up future calls.
"""
cache_file = os.path.join(self.cache_path, self.name + '_gt_roidb.pkl')
if os.path.exists(cache_file):
with open(cache_file, 'rb') as fid:
try:
roidb = pickle.load(fid)
except:
roidb = pickle.load(fid, encoding='bytes')
print('{} gt roidb loaded from {}'.format(self.name, cache_file))
return roidb
這個函式的目的是載入資料之後形成一個pickle檔案,以後再執行程式的時候,如果資料已經載入就直接從pickle檔案中讀取,如果沒有載入,就繼續載入。
...\Desktop\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\cache
這是快取的根目錄,可以嘗試刪除試試會出現什麼效果哦。
看程式碼中,指定快取目錄和名字,如果名字存在就先載入完已有的再載入新的資料,如果不存在就從頭載入。 好,那麼到現在為止,我們已經知道了,選用哪個資料集,載入哪些資料,那些固定的資料在什麼位置,以何種形式載入進來,但是,還有一個重要的問題就是,這個資料是怎麼以標籤的形式具體載入進來的呢?XML檔案是通過解析器解析出來的:
def _load_pascal_annotation(self, index):
"""
Load image and bounding boxes info from XML file in the PASCAL VOC
format.
"""
filename = os.path.join(self._data_path, 'Annotations', index + '.xml')
tree = ET.parse(filename)
objs = tree.findall('object')
if not self.config['use_diff']:
# Exclude the samples labeled as difficult
non_diff_objs = [
obj for obj in objs if int(obj.find('difficult').text) == 0]
# if len(non_diff_objs) != len(objs):
# print 'Removed {} difficult objects'.format(
# len(objs) - len(non_diff_objs))
objs = non_diff_objs
num_objs = len(objs)
boxes = np.zeros((num_objs, 4), dtype=np.uint16)
gt_classes = np.zeros((num_objs), dtype=np.int32)
overlaps = np.zeros((num_objs, self.num_classes), dtype=np.float32)
# "Seg" area for pascal is just the box area
seg_areas = np.zeros((num_objs), dtype=np.float32)
# Load object bounding boxes into a data frame.
for ix, obj in enumerate(objs):
bbox = obj.find('bndbox')
# Make pixel indexes 0-based
x1 = float(bbox.find('xmin').text) - 1
y1 = float(bbox.find('ymin').text) - 1
x2 = float(bbox.find('xmax').text) - 1
y2 = float(bbox.find('ymax').text) - 1
cls = self._class_to_ind[obj.find('name').text.lower().strip()]
boxes[ix, :] = [x1, y1, x2, y2]
gt_classes[ix] = cls
overlaps[ix, cls] = 1.0
seg_areas[ix] = (x2 - x1 + 1) * (y2 - y1 + 1)
overlaps = scipy.sparse.csr_matrix(overlaps)
return {'boxes': boxes,
'gt_classes': gt_classes,
'gt_overlaps': overlaps,
'flipped': False,
'seg_areas': seg_areas}
boxes = np.zeros((num_objs, 4), dtype=np.uint16)
其中,boxes是一個迴歸框,兩個座標,有n個物體,就是4×n個位置。
gt_classes = np.zeros((num_objs), dtype=np.int32)其中,有幾類,就載入幾類進來。 overlaps做one hold recording。 seg_areas求面積,暫時還沒有用到。
然後就是迴圈了:這裡迴圈的是一張圖片上的n個物體。
現在翻轉也做了,數量加倍了,指定了相應的資料了,也都提取出來了。
下面還有一句:
rdl_roidb.prepare_roidb(imdb)
再跳一次,到roidb中的prepare_roidb函式中:
def prepare_roidb(imdb):
"""Enrich the imdb's roidb by adding some derived quantities that
are useful for training. This function precomputes the maximum
overlap, taken over ground-truth boxes, between each ROI and
each ground-truth box. The class with maximum overlap is also
recorded.
"""
roidb = imdb.roidb
if not (imdb.name.startswith('coco')):
sizes = [PIL.Image.open(imdb.image_path_at(i)).size
for i in range(imdb.num_images)]
for i in range(len(imdb.image_index)):
roidb[i]['image'] = imdb.image_path_at(i)
if not (imdb.name.startswith('coco')):
roidb[i]['width'] = sizes[i][0]
roidb[i]['height'] = sizes[i][1]
# need gt_overlaps as a dense array for argmax
gt_overlaps = roidb[i]['gt_overlaps'].toarray()
# max overlap with gt over classes (columns)
max_overlaps = gt_overlaps.max(axis=1)
# gt class that had the max overlap
max_classes = gt_overlaps.argmax(axis=1)
roidb[i]['max_classes'] = max_classes
roidb[i]['max_overlaps'] = max_overlaps
# sanity checks
# max overlap of 0 => class should be zero (background)
zero_inds = np.where(max_overlaps == 0)[0]
assert all(max_classes[zero_inds] == 0)
# max overlap > 0 => class should not be zero (must be a fg class)
nonzero_inds = np.where(max_overlaps > 0)[0]
assert all(max_classes[nonzero_inds] != 0)
這裡,主要做了什麼樣的工作呢?
把所有的資料集合到了roidb上並返回。分別指定了路徑,圖片寬度,高度,重疊率,重疊最大的類別等等。
self.data_layer = RoIDataLayer(self.roidb, self.imdb.num_classes)
self.output_dir = cfg.get_output_dir(self.imdb, 'default')
最後,output_dir設定了pickel的預設路徑。
datalayer傳入了roidb處理完之後的相關資料,和相應類別,並做了一個洗牌操作shuffle。
三、網路架構搭建部分
好,現在先來總結一下Faster RCNN中網路的搭建架構:
圖1
① 搭建了一個conv layers,即一個全卷積網路,在Tensorflow程式碼中為一個VGG16的結構。
② 從①中迭代幾次後的卷積,池化操作後的Feature Map送入RPN(RegionProposal Network)層。
③ 用一個3×3的滑動視窗在②中得到的Feature Map中,(從左到右)滑動,以中間點為錨點,對應到原圖,設定三個影象大小,和三個不同的長寬比例,經排列組合,一個錨點位置得到9個不同的對應影象,設所有錨點共計k個對應影象。
④ 用③中得到的k個對應影象,分別執行下述兩個操作:迴歸和分類。迴歸操作為區分前景背景所用,進行一個二分類操作,故得到2k scores;當迴歸操作區分出是背景,則無需進行分類操作,如是前景則進行分類操作,得到4k coordinates,每個影象得到的四個值分別是,中心點座標(x,y),以及該影象的具體長和寬(h,w)。
⑤ 經過迴歸和分類操作之後,進行框的篩選操作,即proposal層做的主要事情。首先,篩掉的框走以下幾個步驟:第一,IOU>0.7,即產生的框和原始影象的ground truth的對比,如果重疊率大於0.7,則保留,否則篩掉;第二,NMS非極大值抑制篩選,通過二分類得到的scores值(即為前景的概率值),篩選前n個從大到小的框;第三,越界框篩選。第四,經過以上步驟後,繼續篩選score值前m個從大到小的框。
⑥ 對得到的框進行Roi Pooling操作之後,連線一個全連線網路,並在此做一個分類任務,一個迴歸任務,分類任務為二十一分類,即二十個前景和一個背景,完成整個操作。
好了,到現在為止,回憶結束。
下面,我們正式進入程式碼:
① 網路結構搭建的大部分程式碼都位於VGG16.py這個網路中,進入主函式中,第一個Train()交代了資料的部分讀入操作,第二個train()交代了網路的訓練過程。我們先來解釋網路的訓練過程。核心程式碼為第85行:
layers = self.net.create_architecture(sess,"TRAIN", self.imdb.num_classes, tag='default')
其中,create_architecture()函式建立了所有的網路結構。下面,我們跳入該函式。
② 前面指定了一系列卷積,反捲積的引數,核心程式碼為295行:
rois, cls_prob, bbox_pred = self.build_network(sess,training)
rois為roi pooling層得到的框,cls_prob得到的是最後全連線層的分類score,bbox_pred得到的是二十一分類之後的分類標籤。我們繼續跳入build_network();
③ 跳入VGG16.py中的第18行同名函式。 好,我們來仔細研究一下這個同名函式: def build_network(self, sess, is_training=True):
with tf.variable_scope('vgg_16', 'vgg_16'):
# select initializer
if cfg.FLAGS.initializer == "truncated":
initializer = tf.truncated_normal_initializer(mean=0.0, stddev=0.01)
initializer_bbox = tf.truncated_normal_initializer(mean=0.0, stddev=0.001)
else:
initializer = tf.random_normal_initializer(mean=0.0, stddev=0.01)
initializer_bbox = tf.random_normal_initializer(mean=0.0, stddev=0.001)
# Build head
net = self.build_head(is_training)
# Build rpn
rpn_cls_prob, rpn_bbox_pred, rpn_cls_score, rpn_cls_score_reshape = self.build_rpn(net, is_training, initializer)
# Build proposals
rois = self.build_proposals(is_training, rpn_cls_prob, rpn_bbox_pred, rpn_cls_score)
# Build predictions
cls_score, cls_prob, bbox_pred = self.build_predictions(net, rois, is_training, initializer, initializer_bbox)
self._predictions["rpn_cls_score"] = rpn_cls_score
self._predictions["rpn_cls_score_reshape"] = rpn_cls_score_reshape
self._predictions["rpn_cls_prob"] = rpn_cls_prob
self._predictions["rpn_bbox_pred"] = rpn_bbox_pred
self._predictions["cls_score"] = cls_score
self._predictions["cls_prob"] = cls_prob
self._predictions["bbox_pred"] = bbox_pred
self._predictions["rois"] = rois
self._score_summaries.update(self._predictions)
return rois, cls_prob, bbox_pred
④ 該函式分為了幾段,build head,buildrpn,build proposals,build predictions對應的剛好是我們所剛剛敘述的全卷積層,RPN層,Proposal Layer,和最後經過的全連線層。大體結構已有,那麼我們就來逐步分析這個這幾個函式:
⑤ 全卷積網路層的建立(build head)。在這個Demo中,全卷積網路為五個層,每層有一個卷積,一個池化操作,但是,最後一層操作中,僅有一個卷積操作,無池化操作。
# Main network
# Layer 1
net = slim.repeat(self._image, 2, slim.conv2d, 64, [3, 3], trainable=False, scope='conv1')
net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool1')
# Layer 2
net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], trainable=False, scope='conv2')
net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool2')
# Layer 3
net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], trainable=is_training, scope='conv3')
net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool3')
# Layer 4
net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], trainable=is_training, scope='conv4')
net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool4')
# Layer 5
net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], trainable=is_training, scope='conv5')
由程式碼中可以看出,這裡作者用的silm.conv2d函式進行卷積操作,傳統卷積操作為nn模組下的conv2d,max_pool2d進行池化操作。池化用2×2的方格進行,由於卷積層操作不能夠縮小影象大小,池化層變為原來的二分之一,所以四個池化層最終變為原來的1/16。
⑦RPN層的建立(build rpn)。_anchor_component()是用來生成九個框的函式。我們繼續進入,其中設定了引數,height和width,在這裡,都為3,然後通過,tf.py_func()生成9個候選框,generate_anchors_pre中,產生框的具體函式是generate_anchor();generate_anchors()產生位置。建立了位置關係之後,需要對映到原始影象,所以feat_stride為原始影象與這裡影象的倍數關係,feat_stride在這裡為16。
network.py檔案相關程式碼(從Vgg16.py)跳轉來:
def _anchor_component(self):
with tf.variable_scope('ANCHOR_' + 'default'):
# just to get the shape right
height = tf.to_int32(tf.ceil(self._im_info[0, 0] / np.float32(self._feat_stride[0])))
width = tf.to_int32(tf.ceil(self._im_info[0, 1] / np.float32(self._feat_stride[0])))
anchors, anchor_length = tf.py_func(generate_anchors_pre,
[height, width,
self._feat_stride, self._anchor_scales, self._anchor_ratios],
[tf.float32, tf.int32], name="generate_anchors")
anchors.set_shape([None, 4])
anchor_length.set_shape([])
self._anchors = anchors
self._anchor_length = anchor_length
snippit()中相關程式碼:
def generate_anchors_pre(height, width, feat_stride, anchor_scales=(8, 16, 32), anchor_ratios=(0.5, 1, 2)):
""" A wrapper function to generate anchors given different scales
Also return the number of anchors in variable 'length'
"""
anchors = generate_anchors(ratios=np.array(anchor_ratios), scales=np.array(anchor_scales))
A = anchors.shape[0]
shift_x = np.arange(0, width) * feat_stride
shift_y = np.arange(0, height) * feat_stride
shift_x, shift_y = np.meshgrid(shift_x, shift_y)
shifts = np.vstack((shift_x.ravel(), shift_y.ravel(), shift_x.ravel(), shift_y.ravel())).transpose()
K = shifts.shape[0]
# width changes faster, so here it is H, W, C
anchors = anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2))
anchors = anchors.reshape((K * A, 4)).astype(np.float32, copy=False)
length = np.int32(anchors.shape[0])
return anchors, length
我們現在再回到vgg16.py中的build_rpn()函式,看產生完9個候選框之後的操作。首先經過了一個3×3的卷積,之後用1×1的卷積去進行迴歸操作,分出前景或是背景,形成分數值,即rpn_cls_score_reshape。再通過softmax函式,得到rpn_clas_prob_reshape,之後,通過reshape化成了標準型,則,變為rpn_bbox_prob。
進行二分類操作和迴歸操作是並行的,於是用同樣1×1的卷積去操作原來的future map,生成長度為4×k,即_num_anchors×4的長度。
最後,將二分類產生的引數以及迴歸任務產生的引數進行返回,Rpn層就建立好了。
① Proposal層的建立(build proposal)。
def build_proposals(self, is_training, rpn_cls_prob, rpn_bbox_pred, rpn_cls_score):
if is_training:
rois, roi_scores = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred, "rois")
rpn_labels = self._anchor_target_layer(rpn_cls_score, "anchor")
# Try to have a deterministic order for the computing graph, for reproducibility
with tf.control_dependencies([rpn_labels]):
rois, _ = self._proposal_target_layer(rois, roi_scores, "rpn_rois")
else:
if cfg.FLAGS.test_mode == 'nms':
rois, _ = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred, "rois")
elif cfg.FLAGS.test_mode == 'top':
rois, _ = self._proposal_top_layer(rpn_cls_prob, rpn_bbox_pred, "rois")
else:
raise NotImplementedError
return rois
依然是vgg16.py中的build_proposal函式,我們跳到_proposal_layer的函式中:
network.py:
def _proposal_layer(self, rpn_cls_prob, rpn_bbox_pred, name):
with tf.variable_scope(name):
rois, rpn_scores = tf.py_func(proposal_layer,
[rpn_cls_prob, rpn_bbox_pred, self._im_info, self._mode,
self._feat_stride, self._anchors, self._num_anchors],
[tf.float32, tf.float32])
rois.set_shape([None, 5])
rpn_scores.set_shape([None, 1])
return rois, rpn_scores
其中核心程式碼為,tf.func()中的proposal_layer,我們繼續跳入,proposal_layer.py中:
def proposal_layer(rpn_cls_prob, rpn_bbox_pred, im_info, cfg_key, _feat_stride, anchors, num_anchors):
"""A simplified version compared to fast/er RCNN
For details please see the technical report
"""
if type(cfg_key) == bytes:
cfg_key = cfg_key.decode('utf-8')
if cfg_key == "TRAIN":
pre_nms_topN = cfg.FLAGS.rpn_train_pre_nms_top_n
post_nms_topN = cfg.FLAGS.rpn_train_post_nms_top_n
nms_thresh = cfg.FLAGS.rpn_train_nms_thresh
else:
pre_nms_topN = cfg.FLAGS.rpn_test_pre_nms_top_n
post_nms_topN = cfg.FLAGS.rpn_test_post_nms_top_n
nms_thresh = cfg.FLAGS.rpn_test_nms_thresh
im_info = im_info[0]
# Get the scores and bounding boxes
scores = rpn_cls_prob[:, :, :, num_anchors:]
rpn_bbox_pred = rpn_bbox_pred.reshape((-1, 4))
scores = scores.reshape((-1, 1))
proposals = bbox_transform_inv(anchors, rpn_bbox_pred)
proposals = clip_boxes(proposals, im_info[:2])
# Pick the top region proposals
order = scores.ravel().argsort()[::-1]
if pre_nms_topN > 0:
order = order[:pre_nms_topN]
proposals = proposals[order, :]
scores = scores[order]
# Non-maximal suppression
keep = nms(np.hstack((proposals, scores)), nms_thresh)
# Pick th top region proposals after NMS
if post_nms_topN > 0:
keep = keep[:post_nms_topN]
proposals = proposals[keep, :]
scores = scores[keep]
# Only support single image as input
batch_inds = np.zeros((proposals.shape[0], 1), dtype=np.float32)
blob = np.hstack((batch_inds, proposals.astype(np.float32, copy=False)))
return blob, scores
再來回憶一下,我們proposal_layer中做的事情:實際上,再proposal_layer中的任務主要就是篩選合適的框,縮小檢測範圍,那麼,在前文回憶部分的步驟⑤中我們已經說到:第一,篩選與ground truth中,重疊率大於70%的候選框,篩掉其他的候選框,縮小範圍;第二,用NMS非極大值抑制,篩選二分類中前n個score值的候選框;第三,篩掉越界框後,再來從前n個從大到小排序的值中篩選一次。好了,那麼現在就嚴格按照這個步驟開始操作:
一開始先指定引數,我們剛才說進行了兩次topN操作,所以設定兩個引數,一個pre_num_topN和post_num_topN。bbox_transform中為調整框和ground truth大小位置的操作。進入bbox_transform函式:
def bbox_transform_inv(boxes, deltas):
if boxes.shape[0] == 0:
return np.zeros((0, deltas.shape[1]), dtype=deltas.dtype)
boxes = boxes.astype(deltas.dtype, copy=False)
widths = boxes[:, 2] - boxes[:, 0] + 1.0
heights = boxes[:, 3] - boxes[:, 1] + 1.0
ctr_x = boxes[:, 0] + 0.5 * widths
ctr_y = boxes[:, 1] + 0.5 * heights
dx = deltas[:, 0::4]
dy = deltas[:, 1::4]
dw = deltas[:, 2::4]
dh = deltas[:, 3::4]
pred_ctr_x = dx * widths[:, np.newaxis] + ctr_x[:, np.newaxis]
pred_ctr_y = dy * heights[:, np.newaxis] + ctr_y[:, np.newaxis]
pred_w = np.exp(dw) * widths[:, np.newaxis]
pred_h = np.exp(dh) * heights[:, np.newaxis]
pred_boxes = np.zeros(deltas.shape, dtype=deltas.dtype)
# x1
pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w
# y1
pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h
# x2
pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w
# y2
pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h
return pred_boxes
之後,程式碼對框先進行了一下出界清除操作,篩掉出界的框,對應程式碼中clip_transform(),同時選取了前n個框。再接下來nms函式得到keep,之後,在通過topN操作得到非極大值抑制篩選後的框。
# Non-maximal suppression
keep = nms(np.hstack((proposals, scores)), nms_thresh)
# Pick th top region proposals after NMS
if post_nms_topN > 0:
keep = keep[:post_nms_topN]
proposals = proposals[keep, :]
scores = scores[keep]
最後將所得到剩下的框返回,便得到了proposal層之後的留下的框。
接下來,就是篩出來IOU大於70%的框,於是:程式碼中,_anchor_target_layer()函式中,
def _anchor_target_layer(self, rpn_cls_score, name):
with tf.variable_scope(name):
rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights = tf.py_func(
anchor_target_layer,
[rpn_cls_score, self._gt_boxes, self._im_info, self._feat_stride, self._anchors, self._num_anchors],
[tf.float32, tf.float32, tf.float32, tf.float32])
在進入,anchor_target_layer.py中看一看相關的程式碼:
def anchor_target_layer(rpn_cls_score, gt_boxes, im_info, _feat_stride, all_anchors, num_anchors):
"""Same as the anchor target layer in original Fast/er RCNN """
A = num_anchors
total_anchors = all_anchors.shape[0]
K = total_anchors / num_anchors
im_info = im_info[0]
# allow boxes to sit over the edge by a small amount
_allowed_border = 0
# map of shape (..., H, W)
height, width = rpn_cls_score.shape[1:3]
# only keep anchors inside the image
inds_inside = np.where(
(all_anchors[:, 0] >= -_allowed_border) &
(all_anchors[:, 1] >= -_allowed_border) &
(all_anchors[:, 2] < im_info[1] + _allowed_border) & # width
(all_anchors[:, 3] < im_info[0] + _allowed_border) # height
)[0]
# keep only inside anchors
anchors = all_anchors[inds_inside, :]
# label: 1 is positive, 0 is negative, -1 is dont care
labels = np.empty((len(inds_inside),), dtype=np.float32)
labels.fill(-1)
# overlaps between the anchors and the gt boxes
# overlaps (ex, gt)
overlaps = bbox_overlaps(
np.ascontiguousarray(anchors, dtype=np.float),
np.ascontiguousarray(gt_boxes, dtype=np.float))
argmax_overlaps = overlaps.argmax(axis=1)
max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps]
gt_argmax_overlaps = overlaps.argmax(axis=0)
gt_max_overlaps = overlaps[gt_argmax_overlaps,
np.arange(overlaps.shape[1])]
gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]
if not cfg.FLAGS.rpn_clobber_positives:
# assign bg labels first so that positive labels can clobber them
# first set the negatives
labels[max_overlaps < cfg.FLAGS.rpn_negative_overlap] = 0
# fg label: for each gt, anchor with highest overlap
labels[gt_argmax_overlaps] = 1
# fg label: above threshold IOU
labels[max_overlaps >= cfg.FLAGS.rpn_positive_overlap] = 1
if cfg.FLAGS.rpn_clobber_positives:
# assign bg labels last so that negative labels can clobber positives
labels[max_overlaps < cfg.FLAGS.rpn_negative_overlap] = 0
# subsample positive labels if we have too many
num_fg = int(cfg.FLAGS.rpn_fg_fraction * cfg.FLAGS.rpn_batchsize)
fg_inds = np.where(labels == 1)[0]
if len(fg_inds) > num_fg:
disable_inds = npr.choice(
fg_inds, size=(len(fg_inds) - num_fg), replace=False)
labels[disable_inds] = -1
# subsample negative labels if we have too many
num_bg = cfg.FLAGS.rpn_batchsize - np.sum(labels == 1)
bg_inds = np.where(labels == 0)[0]
if len(bg_inds) > num_bg:
disable_inds = npr.choice(
bg_inds, size=(len(bg_inds) - num_bg), replace=False)
labels[disable_inds] = -1
bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :])
bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
# only the positive ones have regression targets
bbox_inside_weights[labels == 1, :] = np.array(cfg.FLAGS2["bbox_inside_weights"])
bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
if cfg.FLAGS.rpn_positive_weight < 0:
# uniform weighting of examples (given non-uniform sampling)
num_examples = np.sum(labels >= 0)
positive_weights = np.ones((1, 4)) * 1.0 / num_examples
negative_weights = np.ones((1, 4)) * 1.0 / num_examples
else:
assert ((cfg.FLAGS.rpn_positive_weight > 0) &
(cfg.FLAGS.rpn_positive_weight < 1))
positive_weights = (cfg.FLAGS.rpn_positive_weight /
np.sum(labels == 1))
negative_weights = ((1.0 - cfg.FLAGS.rpn_positive_weight) /
np.sum(labels == 0))
bbox_outside_weights[labels == 1, :] = positive_weights
bbox_outside_weights[labels == 0, :] = negative_weights
# map up to original set of anchors
labels = _unmap(labels, total_anchors, inds_inside, fill=-1)
bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0)
bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors, inds_inside, fill=0)
bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors, inds_inside, fill=0)
# labels
labels = labels.reshape((1, height, width, A)).transpose(0, 3, 1, 2)
labels = labels.reshape((1, 1, A * height, width))
rpn_labels = labels
# bbox_targets
bbox_targets = bbox_targets \
.reshape((1, height, width, A * 4))
rpn_bbox_targets = bbox_targets
# bbox_inside_weights
bbox_inside_weights = bbox_inside_weights \
.reshape((1, height, width, A * 4))
rpn_bbox_inside_weights = bbox_inside_weights
# bbox_outside_weights
bbox_outside_weights = bbox_outside_weights \
.reshape((1, height, width, A * 4))
rpn_bbox_outside_weights = bbox_outside_weights
return rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights
本文原始碼Github地址:
https://github.com/dBeker/Faster-RCNN-TensorFlow-Python3.5