tensorflow學習筆記(七):YOLO v1學習筆記
1、網路結構
這裡,所有卷積操作都是'SAME'方式,所以以步長為1的卷積操作過程中,不會影響輸出feature map的width和height,feature map大小變化源自於卷積步長和pooling池化操作,而這兩種因素都保留了feature map中元素與輸入影象塊之間的相對位置關係。因此,尺寸為448x448大小影象,經過一系列卷積層、下采樣層之後,最終輸出7x7大小feature map,feature map中每個cell對應於輸入影象中影象塊大小為:448/7 = 64,相當於將輸入影象分割成7x7個影象塊,就可以將影象與輸出feature map對應起來,但是,由於網路的輸入從224x224縮放到448x448,所以,實際上影象塊大小為32x32,這裡對應於論文中說的將影象分成SxS個格子。
構建網路的程式碼為:
def build_network(self, images, keep_prob=0.5, is_training=True, scope='yolo'): with tf.variable_scope(scope): with slim.arg_scope([slim.conv2d, slim.fully_connected], activation_fn=leaky_relu(self.alpha), weights_initializer=tf.truncated_normal_initializer(0.0, 0.01), weights_regularizer=slim.l2_regularizer(0.0005)): # padding:上下左右、上下左右... net = tf.pad(images, np.array([[0, 0], [3, 3], [3, 3], [0, 0]]), name='pad_1') # 經過padding之後,相當於'SAME'方式的conv # c = 64, f = 7, s = 2 ==> 64 x 224 x 224 net = slim.conv2d(net, 64, 7, 2, padding='VALID', scope='conv_2') # max-pooling # f = 2, c = 2, p = 'SAME' ==> 64 x 112 x 112 net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_3') # c = 192, f = 3, s = 1 ==> 192 x 112 x 112 net = slim.conv2d(net, 192, 3, scope='conv_4') # max-pooling # f = 2, c = 2, p = 'SAME' ==> 192 x 56 x 56 net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_5') # 預設padding為'SAME' # c = 128, f = 1, s = 1 ==> 128 x 56 x 56 net = slim.conv2d(net, 128, 1, scope='conv_6') # c = 256, f = 3, s = 1 ==> 256 x 56 x 56 net = slim.conv2d(net, 256, 3, scope='conv_7') # c = 256, f = 1, s = 1 ==> 256 x 56 x 56 net = slim.conv2d(net, 256, 1, scope='conv_8') # c = 512, f = 3, s = 1 ==> 512 x 56 x 56 net = slim.conv2d(net, 512, 3, scope='conv_9') # f = 2, s = 2 ==> 512 x 28 x 28 net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_10') # c = 256, f = 1, s = 1 ==> 256 x 28 x 28 net = slim.conv2d(net, 256, 1, scope='conv_11') # c = 512, f = 3, s = 1, ==> 512 x 28 x 28 net = slim.conv2d(net, 512, 3, scope='conv_12') # c = 256, f = 1, s = 1, ==> 256 x 28 x 28 net = slim.conv2d(net, 256, 1, scope='conv_13') # c = 512, f = 3, s = 1, ==> 512 x 28 x 28 net = slim.conv2d(net, 512, 3, scope='conv_14') # c = 256, f = 1, s = 1, ==> 256 x 28 x 28 net = slim.conv2d(net, 256, 1, scope='conv_15') # c = 512, f = 3, s = 1, ==> 512 x 28 x 28 net = slim.conv2d(net, 512, 3, scope='conv_16') # c = 256, f = 1, s = 1, ==> 256 x 28 x 28 net = slim.conv2d(net, 256, 1, scope='conv_17') # c = 512, f = 3, s = 1, ==> 512 x 28 x 28 net = slim.conv2d(net, 512, 3, scope='conv_18') # c = 512, f = 1, s = 1, ==> 512 x 28 x 28 net = slim.conv2d(net, 512, 1, scope='conv_19') # c = 1024, f = 3, s = 1, ==> 1024 x 28 x 28 net = slim.conv2d(net, 1024, 3, scope='conv_20') # f = 2, s = 2, ==> 1024 x 14 x 14 net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_21') # c = 512, f = 1, s = 1, ==> 512 x 14 x 14 net = slim.conv2d(net, 512, 1, scope='conv_22') # c = 1024, f = 3, s = 1, ==> 1024 x 14 x 14 net = slim.conv2d(net, 1024, 3, scope='conv_23') # c = 512, f = 1, s = 1, ==> 512 x 14 x 14 net = slim.conv2d(net, 512, 1, scope='conv_24') # c = 1024, f = 3, s = 1, ==> 1024 x 14 x 14 net = slim.conv2d(net, 1024, 3, scope='conv_25') # c = 1024, f = 3, s = 1, ==> 1024 x 14 x 14 net = slim.conv2d(net, 1024, 3, scope='conv_26') # 相當於padding = 'SAME'的conv net = tf.pad(net, np.array([[0, 0], [1, 1], [1, 1], [0, 0]]), name='pad_27') # c = 1024, f = 3, s = 2 ==> 1024 x 7 x 7 net = slim.conv2d(net, 1024, 3, 2, padding='VALID', scope='conv_28') # c = 1024, f = 3, s = 1, ==> 1024 x 7 x 7 net = slim.conv2d(net, 1024, 3, scope='conv_29') # c = 1024, f = 3, s = 1, ==> 1024 x 7 x 7 net = slim.conv2d(net, 1024, 3, scope='conv_30') # ==> 7 x 7 x 1024 net = tf.transpose(net, [0, 3, 1, 2], name='trans_31') net = slim.flatten(net, scope='flat_32') net = slim.fully_connected(net, 512, scope='fc_33') net = slim.fully_connected(net, 4096, scope='fc_34') net = slim.dropout(net, keep_prob=keep_prob, is_training=is_training, scope='dropout_35') net = slim.fully_connected(net, self.output_size, activation_fn=None, scope='fc_36') return net
2、輸出7x7x30
YOLO最後輸出的7x7x30中,7x7表示最後輸出feature map大小,每一個位置對應於輸入影象的一個cell,如圖2所示。
圖2 cell對應第一個box資訊(摘自deepsystems.io)
每個cell對應於一個1x30的向量,前面10位對應於位置及置信度資訊,由於每個cell對應兩個box,而每個box對應於一個(x, y, w, h, c),因此當前cell對應的第一個box資訊如圖2所示,當前cell對應第二個box資訊如圖3所示。
圖3 cell對應第二個box資訊(摘自deepsystems.io)
後面20位對應類別資訊,是對於類別資訊的編碼,每個cell,對應於每個box都有一個類別編碼值,如圖4和5所示。
圖 4 box1的類別資訊編碼(摘自deepsystems.io)
圖5 box2的類別資訊編碼(摘自deepsystems.io)
因此,7x7個cell對應於49x2=98個20x1的類別資訊,如圖6所示:
圖6 7x7個cell對應的類別資訊(摘自deepsystems.io)
3、檢測過程
如圖7所示,首先,按照box的得分score(if score < threshold1(0.2), then Set score to zero)判斷當前box中是否存在目標物體;然後,對box的得分score,按照從大到小的順序進行排序;其次,採用NMS(非極大值抑制)策略對box進行進一步篩選;最後,將得分scorce值大於0的框顯示出來,即最後檢測結果。
圖7 YOLO目標檢測流程(摘自deepsystems.io)
程式碼:
def main():
parser = argparse.ArgumentParser()
# 訓練好的權重名
parser.add_argument('--weights', default="YOLO_v.ckpt-10750", type=str)#YOLO_small.ckpt
# 訓練好權重所在路徑
parser.add_argument('--weight_dir', default='output', type=str)
parser.add_argument('--data_dir', default="data", type=str)
parser.add_argument('--gpu', default= '', type=str)
args = parser.parse_args()
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu # gpu
yolo = YOLONet(False) # 網路結構
# 權重路徑+權重檔名
weight_file = os.path.join(args.data_dir, args.weight_dir, args.weights)
detector = Detector(yolo, weight_file) # 載入訓練好的檢測器
# Detect Image
imname = './test/1.jpg'
detector.image_detector(imname)
if __name__ == '__main__':
main()
這個就是檢測器的檢測實現程式碼,我們再看:
detector = Detector(yolo, weight_file)
detector.image_detector.image_detector(imname)
檢測器的主體部分:
class Detector(object):
def __init__(self, net, weight_file):
self.net = net
self.weights_file = weight_file
self.classes = cfg.CLASSES
self.num_class = len(self.classes)
self.image_size = cfg.IMAGE_SIZE
self.cell_size = cfg.CELL_SIZE
self.boxes_per_cell = cfg.BOXES_PER_CELL
self.threshold = cfg.THRESHOLD # score閾值
self.iou_threshold = cfg.IOU_THRESHOLD # iou 閾值
# 類別
self.boundary1 = self.cell_size * self.cell_size * self.num_class
# 每個cell對應兩個box
self.boundary2 = self.boundary1 + self.cell_size * self.cell_size * self.boxes_per_cell
self.sess = tf.Session() # 宣告會話
self.sess.run(tf.global_variables_initializer()) # 變數初始化
print('Restoring weights from: ' + self.weights_file)
self.saver = tf.train.Saver()
# 權重檔案中讀取訓練好的權重
self.saver.restore(self.sess, self.weights_file)
def draw_result(self, img, result):
colors = self.random_colors(len(result))
for i in range(len(result)):
x = int(result[i][1])
y = int(result[i][2])
w = int(result[i][3] / 2)
h = int(result[i][4] / 2)
color = tuple([rgb * 255 for rgb in colors[i]])
cv2.rectangle(img, (x - w, y - h), (x + w, y + h), color, 3)
cv2.putText(img, result[i][0], (x - w - 3, y - h - 15), cv2.FONT_HERSHEY_SIMPLEX, 2, color, 2)
print(result[i][0],': %.2f%%' % (result[i][5]*100))
def detect(self, img):
img_h, img_w, _ = img.shape # 影象寬和高
inputs = cv2.resize(img, (self.image_size, self.image_size)) # 縮放尺度至448x448
inputs = cv2.cvtColor(inputs, cv2.COLOR_BGR2RGB).astype(np.float32)
inputs = (inputs / 255.0) * 2.0 - 1.0 # 畫素值歸一化
inputs = np.reshape(inputs, (1, self.image_size, self.image_size, 3))
# 將影象作為輸入,得到網路的輸出結果
result = self.detect_from_cvmat(inputs)[0]
# 檢測結果還原到實際位置
for i in range(len(result)):
result[i][1] *= (1.0 * img_w / self.image_size) # 計算當前box在原來影象中大小
result[i][2] *= (1.0 * img_h / self.image_size)
result[i][3] *= (1.0 * img_w / self.image_size)
result[i][4] *= (1.0 * img_h / self.image_size)
return result
# 對opencv的Mat資料進行檢測
def detect_from_cvmat(self, inputs):
# 網路輸出
net_output = self.sess.run(self.net.logits, feed_dict={self.net.images: inputs})
results = []
for i in range(net_output.shape[0]): # 網路的輸出結果
results.append(self.interpret_output(net_output[i])) # NMS
return results
def interpret_output(self, output):
probs = np.zeros((self.cell_size, self.cell_size, self.boxes_per_cell, self.num_class))
# 類別:boundary1:cell_size x cell_size x num_class
class_probs = np.reshape(output[0:self.boundary1], (self.cell_size, self.cell_size, self.num_class))
scales = np.reshape(output[self.boundary1:self.boundary2], (self.cell_size, self.cell_size, self.boxes_per_cell))
# cell_size x cell_size x boxes_per_cell x 4:bnd box的四個座標量
boxes = np.reshape(output[self.boundary2:], (self.cell_size, self.cell_size, self.boxes_per_cell, 4))
# 包含兩個步驟:reshape 14x7 -> 2 x 7 x 7
# 第二個步驟:transpose 2 x 7 x 7 -> 7 x 7 x 2
offset = np.transpose(np.reshape(np.array([np.arange(self.cell_size)] * self.cell_size * self.boxes_per_cell),
[self.boxes_per_cell, self.cell_size, self.cell_size]), (1, 2, 0))#7*7*2
boxes[:, :, :, 0] += offset
boxes[:, :, :, 1] += np.transpose(offset, (1, 0, 2))
boxes[:, :, :, :2] = 1.0 * boxes[:, :, :, 0:2] / self.cell_size
boxes[:, :, :, 2:] = np.square(boxes[:, :, :, 2:])
boxes *= self.image_size
for i in range(self.boxes_per_cell):
for j in range(self.num_class):
probs[:, :, i, j] = np.multiply(class_probs[:, :, j], scales[:, :, i])
filter_mat_probs = np.array(probs >= self.threshold, dtype='bool')
filter_mat_boxes = np.nonzero(filter_mat_probs) # 大於概率閾值
boxes_filtered = boxes[filter_mat_boxes[0], filter_mat_boxes[1], filter_mat_boxes[2]]
probs_filtered = probs[filter_mat_probs]
classes_num_filtered = np.argmax(filter_mat_probs, axis=3)[filter_mat_boxes[0], filter_mat_boxes[1], filter_mat_boxes[2]]
argsort = np.array(np.argsort(probs_filtered))[::-1] # 按照score進行排序
boxes_filtered = boxes_filtered[argsort] # 按照排序後的順序調整box順序
probs_filtered = probs_filtered[argsort] # 按照排序後的順序調整score順序
classes_num_filtered = classes_num_filtered[argsort]
for i in range(len(boxes_filtered)):
if probs_filtered[i] == 0:
continue
for j in range(i + 1, len(boxes_filtered)): # 計算IOU,然後使用NMS
if self.iou(boxes_filtered[i], boxes_filtered[j]) > self.iou_threshold:
probs_filtered[j] = 0.0
filter_iou = np.array(probs_filtered > 0.0, dtype='bool') # score大於0的部分
boxes_filtered = boxes_filtered[filter_iou] # boxes
probs_filtered = probs_filtered[filter_iou] # scores
classes_num_filtered = classes_num_filtered[filter_iou] # 看最後還儲存的類別
result = []
for i in range(len(boxes_filtered)): # 將這些類別及位置返還
result.append([self.classes[classes_num_filtered[i]], boxes_filtered[i][0], boxes_filtered[
i][1], boxes_filtered[i][2], boxes_filtered[i][3], probs_filtered[i]])
return result
# 計算交併比
def iou(self, box1, box2):
tb = min(box1[0] + 0.5 * box1[2], box2[0] + 0.5 * box2[2]) - \
max(box1[0] - 0.5 * box1[2], box2[0] - 0.5 * box2[2])
lr = min(box1[1] + 0.5 * box1[3], box2[1] + 0.5 * box2[3]) - \
max(box1[1] - 0.5 * box1[3], box2[1] - 0.5 * box2[3])
if tb < 0 or lr < 0:
intersection = 0
else:
intersection = tb * lr
return intersection / (box1[2] * box1[3] + box2[2] * box2[3] - intersection)
def random_colors(self, N, bright=True):
brightness = 1.0 if bright else 0.7
hsv = [(i / N, 1, brightness) for i in range(N)]
colors = list(map(lambda c: colorsys.hsv_to_rgb(*c), hsv))
np.random.shuffle(colors)
return colors
# 視訊檢測
def camera_detector(self, cap, wait=30):
while(1):
ret, frame = cap.read()
result = self.detect(frame)
self.draw_result(frame, result)
cv2.imshow('Camera', frame)
cv2.waitKey(wait)
if cv2.waitKey(wait) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
# 影象檢測
def image_detector(self, imname, wait=0):
image = cv2.imread(imname)
result = self.detect(image)
self.draw_result(image, result)
cv2.imshow('Image', image)
cv2.waitKey(wait)
首先,將圖片直接放到訓練好的網路中,得到一個輸出結果;然後,使用score閾值過濾掉得分較低的box;其次,使用NMS來對box做進一步篩選;最後,將結果還原到實際尺度,並顯示輸出結果。
4、訓練
(1) 資料處理:
這個部分程式碼位於utils\pascal_voc.py檔案中,主要分為資料和標籤兩個部分:
# 訓練
def next_batches(self, gt_labels, batch_size):
# n x w x h x c
images = np.zeros((batch_size, self.image_size, self.image_size, 3))
# n x cell_size x cell_size x (class + 5):輸入只有一個位置
labels = np.zeros((batch_size, self.cell_size, self.cell_size, self.num_class + 5))
count = 0
while count < batch_size:
# 當前樣本檔名
imname = gt_labels[self.cursor]['imname']
# 映象標誌
flipped = gt_labels[self.cursor]['flipped']
# 讀取樣本:gray -> normalize
images[count, :, :, :] = self.image_read(imname, flipped)
# 獲取標籤
labels[count, :, :, :] = gt_labels[self.cursor]['label']
# 讀取下一個樣本
count += 1
self.cursor += 1
# 如果樣本數目小於bacth_size
# 將樣本隨機打亂順序
if self.cursor >= len(gt_labels):
np.random.shuffle(gt_labels)
self.cursor = 0
self.epoch += 1
return images, labels
資料讀寫部分位於image_read,標籤讀取位於gt_labels
資料讀取步驟主要有:尺度縮放、灰度化、畫素值歸一化和映象處理
# 使用opencv介面讀取樣本影象
def image_read(self, imname, flipped=False):
image = cv2.imread(imname)
# 保證尺度一致
image = cv2.resize(image, (self.image_size, self.image_size))
# 灰度化處理
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
# 畫素值歸一化
image = (image / 255.0) * 2.0 - 1.0
# 映象操作
if flipped:
image = image[:, ::-1, :]
return image
標籤讀取步驟主要有:獲取訓練樣本圖片所在路徑、讀取標註檔案中boungding box資訊,並進行編碼
# 載入樣本標籤
def load_labels(self, model):
# 訓練
if model == 'train':
# 樣本資料所在路徑
self.devkil_path = os.path.join(cfg.PASCAL_PATH, 'VOCdevkit')
self.data_path = os.path.join(self.devkil_path, 'VOC2007')
txtname = os.path.join(self.data_path, 'ImageSets', 'Main', 'trainval.txt')
# 測試
if model == 'test':
self.devkil_path = os.path.join(cfg.PASCAL_PATH, 'VOCdevkit')
self.data_path = os.path.join(self.devkil_path, 'VOC2007')
txtname = os.path.join(self.data_path, 'ImageSets', 'Main', 'test.txt')
# 讀取訓練樣本名
with open(txtname, 'r') as f:
self.image_index = [x.strip() for x in f.readlines()]
gt_labels = []
for index in self.image_index:
# 讀取bnd box資訊,並進行編碼, num:一張樣本中目標物體數目
label, num = self.load_pascal_annotation(index)
if num == 0:
continue
imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
gt_labels.append({'imname': imname, 'label': label, 'flipped': False})
return gt_labels
讀取標註檔案中bounding box資訊實現程式碼位於load_pascal_annotation中:首先,讀取樣本圖片及對應標註檔案中目標物體的bounding box資訊;然後,根據樣本實際大小與送入網路中樣本大小(448x448)之間比例,找到目標物體的對應位置;最後,根據目標物體中心與cell之間位置關係,對bounding box進行編碼——距離目標物體中心最近的cell負責對當前目標物體進行檢測,所以其實,每個樣本對應於一個7x7x(num_class + 5)的矩陣。
# 讀取樣本的標記
def load_pascal_annotation(self, index):
imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
# 讀取樣本影象資料
im = cv2.imread(imname)
# 將樣本的座標歸一化
h_ratio = 1.0 * self.image_size / im.shape[0]
w_ratio = 1.0 * self.image_size / im.shape[1]
# 樣本標籤:cell_size x cell_size x (num_class + 5)
# 每個cell需要預測(num_class + 5)個值
# 分別對應:類別數目 + 4個座標 + 1個置信度
# 表明:當前樣本屬於某個類別的置信度及座標位置
label = np.zeros((self.cell_size, self.cell_size, self.num_class + 5))
# 樣本標記檔案
filename = os.path.join(self.data_path, 'Annotations', index + '.xml')
# xml解析檔案
tree = ET.parse(filename)
# 獲取object屬性
objs = tree.findall('object')
for obj in objs:
# 獲取object屬性對應的子屬性bndbox
# bounding box
bbox = obj.find('bndbox')
# 建立bbox在輸入image和feature map cell上對應位置關係
x1 = max(min((float(bbox.find('xmin').text)) * w_ratio, self.image_size), 0)
y1 = max(min((float(bbox.find('ymin').text)) * h_ratio, self.image_size), 0)
x2 = max(min((float(bbox.find('xmax').text)) * w_ratio, self.image_size), 0)
y2 = max(min((float(bbox.find('ymax').text)) * h_ratio, self.image_size), 0)
# 查詢類別名對應索引
cls_ind = self.class_to_ind[obj.find('name').text.lower().strip()]
# 中心位置,及寬、高
boxes = [(x2 + x1) / 2.0, (y2 + y1) / 2.0, x2 - x1, y2 - y1]
# bounding box 對應cell_size x cell_size網格中位置
x_ind = int(boxes[0] * self.cell_size / self.image_size)
y_ind = int(boxes[1] * self.cell_size / self.image_size)
# 如果已經標記了,表明當前位置存在物體
if label[y_ind, x_ind, 0] == 1:
continue
# 對當前cell進行標記
label[y_ind, x_ind, 0] = 1 # 置信度
label[y_ind, x_ind, 1:5] = boxes # 座標
label[y_ind, x_ind, 5 + cls_ind] = 1 # 類別
return label, len(objs)
(2) 損失函式
這個部分位於yolo\yolo_net.py檔案中:
if is_training:
self.labels = tf.placeholder(tf.float32, [None, self.cell_size, self.cell_size, 5 + self.num_class])
self.loss_layer(self.logits, self.labels)
self.total_loss = tf.losses.get_total_loss()
tf.summary.scalar('total_loss', self.total_loss)
具體loss在這個loss_layer中:
# 定義損失層
def loss_layer(self, predicts, labels, scope='loss_layer'):
with tf.variable_scope(scope):
# class
# tf.reshape(tensor, shape, name=None):將tensor變換為引數shape的形式
# boundary1 = cell_size x cell_size x num_classes
# N x cell_size x cell_size x num_classes -> [N, cell_size, cell_size, num_classes]
predict_classes = tf.reshape(predicts[:, :self.boundary1], [self.batch_size, self.cell_size, self.cell_size, self.num_class])
# bb:confidence
# [N, cell_size, cell_size, boxes_per_cell]
predict_scales = tf.reshape(predicts[:, self.boundary1:self.boundary2], [self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell])
# (dx, dy, dw, dh)
# [N, cell_size, cell_size, boxes_per_cell, 4]
predict_boxes = tf.reshape(predicts[:, self.boundary2:], [self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell, 4])
# 響應:batch_size * cell_size * cell_size * 1
# [N, cell_size, cell_size, 1]
response = tf.reshape(labels[:, :, :, 0], [self.batch_size, self.cell_size, self.cell_size, 1])
# [N, cell_size, cell_size, 1, 4]
boxes = tf.reshape(labels[:, :, :, 1:5], [self.batch_size, self.cell_size, self.cell_size, 1, 4])
# tf.tile:張量擴充套件
# tf.tile(raw, multiples=[a, b, c, d])
# 將raw的第0維輸入a次,第1維輸入b次,第2維輸入c次,第3維輸入d次
# [N, cell_size, cell_size, boxes_per_cell, 4]
boxes = tf.tile(boxes, [1, 1, 1, self.boxes_per_cell, 1]) / self.image_size
# 輸入為:[N, cell_size, cell_size, boxes + class_num]
# labels[:, :, :, 5:]為class對應編碼
classes = labels[:, :, :, 5:]
# 初始化為一個常量: [cell_size, cell_size, boxes_per_cell]
offset = tf.constant(self.offset, dtype=tf.float32)
# [1, cell_size, cell_size, boxes_per_cell]
offset = tf.reshape(offset, [1, self.cell_size, self.cell_size, self.boxes_per_cell])
# [N, cell_size, cell_size, boxes_per_cell]
offset = tf.tile(offset, [self.batch_size, 1, 1, 1])
# shape為 [4, N, cell_size, cell_size, boxes_per_cell]
predict_boxes_tran = tf.stack([1. * (predict_boxes[:, :, :, :, 0] + offset) / self.cell_size,
1. * (predict_boxes[:, :, :, :, 1] + tf.transpose(offset, (0, 2, 1, 3))) / self.cell_size,
tf.square(predict_boxes[:, :, :, :, 2]), # 開根號
tf.square(predict_boxes[:, :, :, :, 3])])
# shape為 [batch_size, 7, 7, 2, 4]
# tf.transpose(input, [dimension_1, dimenaion_2,..,dimension_n]):
# 這個函式主要適用於交換輸入張量的不同維度用的
# [N, cell_size, cell_size, boxes_per_cell, 4]
predict_boxes_tran = tf.transpose(predict_boxes_tran, [1, 2, 3, 4, 0])
# 計算IOU: 交併比
iou_predict_truth = self.calc_iou(predict_boxes_tran, boxes)
# calculate I tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
# 計算iou_predict_truth在第3個維度上的最大值
object_mask = tf.reduce_max(iou_predict_truth, 3, keep_dims=True)
object_mask = tf.cast((iou_predict_truth >= object_mask), tf.float32) * response
# calculate no_I tensor [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
noobject_mask = tf.ones_like(object_mask, dtype=tf.float32) - object_mask
boxes_tran = tf.stack([1. * boxes[:, :, :, :, 0] * self.cell_size - offset,
1. * boxes[:, :, :, :, 1] * self.cell_size - tf.transpose(offset, (0, 2, 1, 3)),
tf.sqrt(boxes[:, :, :, :, 2]),
tf.sqrt(boxes[:, :, :, :, 3])])
# 引數中加上平方根是對 w 和 h 進行開平方操作,原因在論文中有說明
# #shape為(4, batch_size, 7, 7, 2)
boxes_tran = tf.transpose(boxes_tran, [1, 2, 3, 4, 0])
# class_loss 分類損失
class_delta = response * (predict_classes - classes)
class_loss = tf.reduce_mean(tf.reduce_sum(tf.square(class_delta), axis=[1, 2, 3]), name='class_loss') * self.class_scale
# object_loss 有目標物體存在的損失
object_delta = object_mask * (predict_scales - iou_predict_truth)
object_loss = tf.reduce_mean(tf.reduce_sum(tf.square(object_delta), axis=[1, 2, 3]), name='object_loss') * self.object_scale
# noobject_loss 沒有目標物體時的損失
noobject_delta = noobject_mask * predict_scales
noobject_loss = tf.reduce_mean(tf.reduce_sum(tf.square(noobject_delta), axis=[1, 2, 3]), name='noobject_loss') * self.noobject_scale
# coord_loss 座標損失 #shape 為 (batch_size, 7, 7, 2, 1)
coord_mask = tf.expand_dims(object_mask, 4)
# shape 為(batch_size, 7, 7, 2, 4)
boxes_delta = coord_mask * (predict_boxes - boxes_tran)
coord_loss = tf.reduce_mean(tf.reduce_sum(tf.square(boxes_delta), axis=[1, 2, 3, 4]), name='coord_loss') * self.coord_scale
# 將所有損失放在一起
tf.losses.add_loss(class_loss)
tf.losses.add_loss(object_loss)
tf.losses.add_loss(noobject_loss)
tf.losses.add_loss(coord_loss)
# 將每個損失新增到日誌記錄
tf.summary.scalar('class_loss', class_loss)
tf.summary.scalar('object_loss', object_loss)
tf.summary.scalar('noobject_loss', noobject_loss)
tf.summary.scalar('coord_loss', coord_loss)
tf.summary.histogram('boxes_delta_x', boxes_delta[:, :, :, :, 0])
tf.summary.histogram('boxes_delta_y', boxes_delta[:, :, :, :, 1])
tf.summary.histogram('boxes_delta_w', boxes_delta[:, :, :, :, 2])
tf.summary.histogram('boxes_delta_h', boxes_delta[:, :, :, :, 3])
tf.summary.histogram('iou', iou_predict_truth)
對應於論文中定義的loss函式:
這僅僅是自己學習的一個筆記,如果有地方不妥之處,歡迎大家批評指證,謝謝!
參考資料:
Andrew NG的deeplearning.ai課程