【Tensorflow tf 掏糞記錄】筆記五——YOLOv3 tensorflow 實現


由於作者的編碼能力實在是太強了,連所用的框架好像都是自己寫的,所以目前並沒有其他的框架的版本的釋出。我根據作者寫的關於YOLOv3的論文和之前的關於YOLO的所有論文,按我對論文的理解來實現YOLOv3。(若我對論文的理解有誤, 歡迎指出。不勝感激)



YOLO( you only look once)的縮寫。顧名思義,就是隻看一次,整個預測過程只看圖片一次。這有別與之前的目標定位識別的專案,YOLO之前的專案一般是定位目標看一次影象,物體分類再看一次影象。所以簡單直觀地看,YOLO似乎效率比他們都高,畢竟少看了一次,省下了時間。事實上YOLO也是快的驚人,作者在Titan GPU上實現了實時目標定位與識別,識別視訊在




  • reader.py:
    • 存放了用來讀取資料集,資料集標籤存放地址與檔案的方法。並在讀取名字的過程中實現的mini_batch的操作
  • train.py:
    • 就是組合各種工具,用來訓練網路的程式碼
  • eval.ph:
    • 用來跑訓練好的YOLOv3
  • utils資料夾:
    • extract_labels.py:
      • 裡面的labels_normaliszer()根據傳入的標籤的存放地址來讀取標籤,並且把標籤轉化成我們網路需要的格式
    • get_loss.py:
      • 整合了計算YOLOv3作者之前提到的3個loss的計算方法,並計算batchloss
    • IOU.py:
      • 裡面的IOU_calculator()方法根據傳入的預測值和目標的標籤的值來計算IOU
    • net.ph:
      • 裡面實現了YOLOv3的核心演算法。DarkNet-53
    • read_config.py:
      • 讀取config配置檔案中的配置引數
    • ·select_things.py`:
      • 顧名思義,裡面的方法實現了選擇功能。例如選擇YOLOv3中scale的大小,選擇scale對應的check_point檔案

Utils 工具




從中提取了object_name, bdbox, xmin,ymin,xmax,ymax

import xml.dom.minidom

def xml_extractor( dir ):
    DOMTree = parse( dir )
    collection = DOMTree.documentElement    # 得到xml檔案的根節點
    file_name_xml = collection.getElementsByTagName( 'filename' )[0]
    objects_xml = collection.getElementsByTagName( 'object' )
    size_xml = collection.getElementsByTagName( 'size' )

    file_name = file_name_xml.childNodes[0].data

    for size in size_xml:
        width = size.getElementsByTagName( 'width' )[0]
        height = size.getElementsByTagName( 'height' )[0]

        width = width.childNodes[0].data
        height = height.childNodes[0].data

    objects = []
    for object_xml in objects_xml:
        object_name = object_xml.getElementsByTagName( 'name' )[0]
        bdbox = object_xml.getElementsByTagName( 'bndbox' )[0]
        xmin = bdbox.getElementsByTagName( 'xmin' )[0]
        ymin = bdbox.getElementsByTagName( 'ymin' )[0]
        xmax = bdbox.getElementsByTagName( 'xmax' )[0]
        ymax = bdbox.getElementsByTagName( 'ymax' )[0]

        object = ( object_name.childNodes[0].data,
                   ymax.childNodes[0].data )

        objects.append( object )

    return file_name, width, height, objects

labels_normalizer()方法中我才正式的把得到的labels轉化為陣列。有一點值得注意,在生成新的陣列的時候務必要加1e-8(一個近似0的數),為的是防止之後在計算IOU時出現分母為0從而輸出為nan的情況。因為VOC資料集標記的是目標的對角線的座標,而我們需要的是目標中點的座標與之對應的boundding box的長寬。所以需要一點小計算。


def labels_normalizer( batches_filenames, target_width, target_height, layerout_width, layerout_height ):

    class_map = {
        'person' : 5,
        'bird' : 6,
        'cat' : 7,
        'cow' : 8,
        'dog' : 9,
        'horse' : 10,
        'sheep' : 11,
        'aeroplane' : 12,
        'bicycle' : 13,
        'boat' : 14,
        'bus' : 15,
        'car' : 16,
        'motorbike' : 17,
        'train' : 18,
        'bottle' : 19,
        'chair' : 20,
        'diningtable' : 21,
        'pottedplant': 22,
        'sofa' : 23,
        'tvmonitor' : 24

    height_width = []
    batches_labels = []
    for batch_filenames in batches_filenames:
        batch_labels = []
        for filename in batch_filenames:
            _, width, height, objects = xml_extractor( filename )
            width_preprotion = target_width / int( width )
            height_preprotion = target_height / int( height )
            label = np.add( np.zeros( [int( layerout_height ), int( layerout_width ), 255] ), 1e-8 )    # 這裡加1e-8的原因是防止之後在用該資料在計算IOU時出現分母為0從而導致輸出為nan的情況
            for object in objects:
                class_label = class_map[object[0]]
                xmin = float( object[1] )
                ymin = float( object[2] )
                xmax = float( object[3] )
                ymax = float( object[4] )
                x = ( 1.0 * xmax + xmin ) / 2 * width_preprotion    # 計算目標中點的x值
                y = ( 1.0 * ymax + ymin ) / 2 * height_preprotion    # 計算目標中點的y值
                bdbox_width = ( 1.0 * xmax - xmin ) * width_preprotion    # 計算目標的boundding box的寬
                bdbox_height = ( 1.0 * ymax - ymin ) * height_preprotion    # 計算目標的boundding box的高
                falg_width = int( target_width ) / layerout_width    # 計算一個box內含有多少個原影象的橫軸畫素
                flag_height = int( target_height ) / layerout_height    # 計算一個box內含有多少個原影象的橫軸畫素
                box_x = x // falg_width    # 計算x所屬的box的x下標
                box_y = y // flag_height    # 計算y所屬的box的y下標
                if box_x == layerout_width:    # 把最後一個box右邊界的點歸為最後一個box管理(本來為下一個box管理)
                    box_x -= 1
                if box_y == layerout_height:    # 把最下面一個box的下邊界的點歸為最下面一個box管理(本來為下一個box管理)
                    box_y -= 1
                for i in range( 3 ):    # 每個box預測3個bdbox
                    label[int( box_y ), int( box_x ), i * 25] = x    # point x
                    label[int( box_y ), int( box_x ), i * 25 + 1] = y    # point y
                    label[int( box_y ), int( box_x ), i * 25 + 2] = bdbox_width    # bdbox width
                    label[int( box_y ), int( box_x ), i * 25 + 3] = bdbox_height    # bdbox height
                    label[int( box_y ), int( box_x ), i * 25 + 4] = 1    # objectness
                    label[int( box_y ), int( box_x ), i * 25 + int( class_label )] = 0.9    # class label

            batch_labels.append( label )

        batches_labels.append( batch_labels )

    # batches_labels = np.array( batches_labels )

    return batches_labels



def calculate_loss( batch_inputs, batch_labels ):
    batch_loss = 0
    # for batch in range( batch_inputs.shape[0] ):
    for image_num in range( batch_inputs.shape[0] ):
        for y in range( batch_inputs.shape[1] ):
            for x in range( batch_inputs.shape[2] ):
                for i in range( 3 ):
                    pretect_x = batch_inputs[image_num][y][x][i * 25]
                    pretect_y = batch_inputs[image_num][y][x][i * 25 + 1]
                    pretect_width = batch_inputs[image_num][y][x][i * 25 + 2]
                    pretect_height = batch_inputs[image_num][y][x][i * 25 + 3]
                    pretect_objectness = batch_inputs[image_num][y][x][i * 25 + 4]
                    pretect_class = batch_inputs[image_num][y][x][i * 25 + 5 : i * 25 + 5 + 20]
                    label_x = batch_labels[image_num][y][x][i * 25]
                    label_y = batch_labels[image_num][y][x][i * 25 + 1]
                    label_width = batch_labels[image_num][y][x][i * 25 + 2]
                    label_height = batch_labels[image_num][y][x][i * 25 + 3]
                    label_objectness = batch_labels[image_num][y][x][i * 25 + 4]
                    label_class = batch_labels[image_num][y][x][i * 25 + 5 : i * 25 + 5 + 20]
                    IOU = get_IOU.IOU_calculator( tf.cast( pretect_x, tf.float32 ),
                                                  tf.cast( pretect_y, tf.float32 ),
                                                  tf.cast( pretect_width, tf.float32 ),
                                                  tf.cast( pretect_height, tf.float32 ),
                                                  tf.cast( label_x, tf.float32 ),
                                                  tf.cast( label_y, tf.float32 ),
                                                  tf.cast( label_width, tf.float32 ),
                                                  tf.cast( label_height, tf.float32 ) )
                    loss = class_loss( pretect_class,
                                       label_class ) + location_loss( pretect_x,
                                                                      label_height ) + objectness_loss( IOU, pretect_objectness, label_objectness )

                    batch_loss += loss
    return batch_loss


def objectness_loss( input, switch, l_switch, alpha = 0.5 ):
    Calculate the objectness loss

    :param input: input IOU
    :param switch: If target in this box is 1, else 1e-8
    :param l_switch: Target in this box is 1, else 0
    :return: objectness_loss

    IOU_loss = tf.square( l_switch - input * switch )
    loss_max = tf.square( l_switch * 0.5 - input * switch )

    IOU_loss = tf.cond( IOU_loss < loss_max, lambda : tf.cast( 1e-8, tf.float32 ), lambda : IOU_loss )

    IOU_loss = tf.cond( l_switch < 1, lambda : IOU_loss * alpha, lambda : IOU_loss )

    return IOU_loss



def class_loss( inputs, labels ):
    classloss = tf.square( labels - inputs )
    loss_sum = tf.reduce_sum( classloss )

    return loss_sum



def location_loss( x, y, width, height, l_x, l_y, l_width, l_height, alpha = 5 ):
    point_loss = ( tf.square( l_x - x ) + tf.square( l_y - y ) ) * alpha
    size_loss = ( tf.square( tf.sqrt( l_width ) - tf.sqrt( width ) ) + tf.square( tf.sqrt( l_height ) - tf.sqrt( height ) ) ) * alpha

    location_loss = point_loss + size_loss

    return location_loss




def IOU_calculator( x, y, width, height, l_x, l_y, l_width, l_height ):
    Cculate IOU

    :param x: net predicted x
    :param y: net predicted y
    :param width: net predicted width
    :param height: net predicted height
    :param l_x: label x
    :param l_y: label y
    :param l_width: label width
    :param l_height: label height
    :return: IOU

    x_max = calculate_max( x , width / 2 )
    y_max = calculate_max( y, height / 2 )
    x_min = calculate_min( x, width / 2 )
    y_min = calculate_min( y, height / 2 )

    l_x_max = calculate_max( l_x, width / 2 )
    l_y_max = calculate_max( l_y, height / 2 )
    l_x_min = calculate_min( l_x, width / 2 )
    l_y_min = calculate_min( l_y, height / 2 )

    '''--------Caculate Both Area's point--------'''
    xend = tf.minimum( x_max, l_x_max )
    xstart = tf.maximum( x_min, l_x_min )

    yend = tf.minimum( y_max, l_y_max )
    ystart = tf.maximum( y_min, l_y_min )

    area_width = xend - xstart
    area_height = yend - ystart

    '''--------Caculate the IOU--------'''
    area = area_width * area_height

    all_area = tf.cond( ( width * height + l_width * l_height - area ) <= 0, lambda : tf.cast( 1e-8, tf.float32 ), lambda : ( width * height + l_width * l_height - area ) )

    IOU = area / all_area

    IOU = tf.cond( area_width < 0, lambda : tf.cast( 1e-8, tf.float32 ), lambda : IOU )
    IOU = tf.cond( area_height < 0, lambda : tf.cast( 1e-8, tf.float32 ), lambda : IOU )

    return IOU



我是用的啟用函式是Leky Relu。因為tensorflow中沒有Leky Relu所以我自己寫了一個。其實本質就是在x的負方向梯度不為0的Relu函式。

def Leaky_Relu( input, alpha = 0.01 ):
    output = tf.maximum( input, tf.multiply( input, alpha ) )

    return output

我聲明瞭兩種卷積函式,一種是卷積操作後直接batch_normalization,Leky Relu直接走下去。還有一種是卷積操作後接batch_normalization,Leky Relu,然後再加上殘差網路的shortcut然後再次通過Leky Relu

def Res_conv2d( inputs, shortcut, filters, shape, stride = ( 1, 1 ) ):
    conv = conv2d( inputs, filters, shape )
    Res = Leaky_Relu( conv + shortcut )

    return Res
def conv2d( inputs, filters, shape, stride = ( 1, 1 ) ):
    layer = tf.layers.conv2d( inputs,
                              padding = 'SAME',
                              kernel_initializer=tf.truncated_normal_initializer( stddev=0.01 ) )

    layer = tf.layers.batch_normalization( layer, training = True )

    layer = Leaky_Relu( layer )

    return layer


def feature_extractor( inputs ):
    layer = conv2d( inputs, 32, [3, 3] )
    layer = conv2d( layer, 64, [3, 3], ( 2, 2 ) )
    shortcut = layer

    layer = conv2d( layer, 32, [1, 1] )
    layer = Res_conv2d( layer, shortcut, 64, [3, 3] )

    layer = conv2d( layer, 128, [3, 3], ( 2, 2 ) )
    shortcut = layer

    for _ in range( 2 ):
        layer = conv2d( layer, 64, [1, 1] )
        layer = Res_conv2d( layer, shortcut, 128, [3, 3] )

    layer = conv2d( layer, 256, [3, 3], ( 2, 2 ) )
    shortcut = layer

    for _ in range( 8 ):
        layer = conv2d( layer, 128, [1, 1] )
        layer = Res_conv2d( layer, shortcut, 256, [3, 3] )
    pre_scale3 = layer

    layer = conv2d( layer, 512, [3, 3], ( 2, 2 ) )
    shortcut = layer

    for _ in range( 8 ):
        layer = conv2d( layer, 256, [1, 1] )
        layer = Res_conv2d( layer, shortcut, 512, [3, 3] )
    pre_scale2 = layer

    layer = conv2d( layer, 1024, [3, 3], ( 2, 2 ) )
    shortcut = layer

    for _ in range( 4 ):
        layer = conv2d( layer, 512, [1, 1] )
        layer = Res_conv2d( layer, shortcut, 1024, [3, 3] )
    pre_scale1 = layer

    return pre_scale1, pre_scale2, pre_scale3

作者說,scale2, scale3從網路中間提取的引數會經過一個2x的操作。我的理解是,直接把輸出當成影象來縮放。

def get_layer2x( layer_final, pre_scale ):
    layer2x = tf.image.resize_images(layer_final,
                                     [2 * tf.shape(layer_final)[1], 2 * tf.shape(layer_final)[2]])
    layer2x_add = tf.concat( [layer2x, pre_scale], 3 )

    return layer2x_add


def scales( layer, pre_scale2, pre_scale3 ):
    layer_copy = layer
    layer = conv2d( layer, 512, [1, 1] )
    layer = conv2d( layer, 1024, [3, 3] )
    layer = conv2d(layer, 512, [1, 1])
    layer_final = layer
    layer = conv2d(layer, 1024, [3, 3])

    scale_1 = conv2d( layer, 255, [1, 1] )

    layer = conv2d( layer_final, 256, [1, 1] )
    layer = get_layer2x( layer, pre_scale2 )

    layer = conv2d( layer, 256, [1, 1] )
    layer= conv2d( layer, 512, [3, 3] )
    layer = conv2d( layer, 256, [1, 1] )
    layer = conv2d( layer, 512, [3, 3] )
    layer = conv2d( layer, 256, [1, 1] )
    layer_final = layer
    layer = conv2d( layer, 512, [3, 3] )
    scale_2 = conv2d( layer, 255, [1, 1] )

    layer = conv2d( layer_final, 128, [1, 1] )
    layer = get_layer2x( layer, pre_scale3 )

    for _ in range( 3 ):
        layer = conv2d( layer, 128, [1, 1] )
        layer = conv2d( layer, 256, [3, 3] )
    scale_3 = conv2d( layer, 255, [1, 1] )

    scale_1 = tf.abs( scale_1 )
    scale_2 = tf.abs( scale_2 )
    scale_3 = tf.abs( scale_3 )

    return scale_1, scale_2, scale_3


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument( '-c', '--conf', default = './config/eval_config.yml', help = 'the path to the eval_config file' )
    return parser.parse_args()

def main( FLAGS ):
    if not os.path.exists( FLAGS.save_dir ):
        os.makedirs( FLAGS.save_dir )

    input_image = reader.get_image( FLAGS.image_dir, FLAGS.image_width, FLAGS.image_height )
    output_image = np.copy( input_image )

    '''--------Create placeholder--------'''
    image = net.create_eval_placeholder( FLAGS.image_width, FLAGS.image_height )

    pre_scale1, pre_scale2, pre_scale3 = net.feature_extractor( image )
    scale1, scale2, scale3 = net.scales( pre_scale1, pre_scale2, pre_scale3 )

    with tf.Session() as sess:
        saver = tf.train.Saver()
        save_path = select_things.select_checkpoint( FLAGS.scale )
        last_checkpoint = tf.train.latest_checkpoint( save_path, 'checkpoint' )
        if last_checkpoint:
            saver.restore(sess, last_checkpoint)
            print( 'Success load model from: ', format( last_checkpoint ) )
            print( 'Model has not trained' )

        start_time = time.time()
        scale1, scale2, scale3 = sess.run( [scale1, scale2, scale3], feed_dict = {image: [output_image]} )

    if FLAGS.scale == 1:
        scale = scale1
    if FLAGS.scale == 2:
        scale = scale2
    if FLAGS.scale == 3:
        scale = scale3

    boxes_labels = eval_uitls.label_extractor( scale[0] )

    bdboxes = eval_uitls.get_bdboxes( boxes_labels )

    for bdbox in bdboxes:
        font = cv2.FONT_HERSHEY_SIMPLEX
        output_image = cv2.rectangle( output_image,
                                      ( int( bdbox[0] - bdbox[2] / 2 ), int( bdbox[1] - bdbox[3] / 2 ) ),
                                      ( int( bdbox[0] + bdbox[2] / 2 ), int( bdbox[1] + bdbox[3] / 2 ) ),
                                      ( 200, 0, 0 ),
                                      1 )
        # output_image = cv2.putText( output_image, bdbox[5],
        #                             ( bdbox[0] - bdbox[2] / 2, bdbox[1] - bdbox[3] / 2 ),
        #                             1.2,
        #                             (0, 255, 0),
        #                             2 )
    # output_image = np.multiply( output_image, 255 )

    generate_image = FLAGS.save_dir + '/res.jpg'
    if not os.path.exists( FLAGS.save_dir ):
        os.makedirs( FLAGS.save_dir )

    with open( generate_image, 'wb' ) as img:
        img.write( output_image )
        end_time = time.time()

    print( 'Use time: ', end_time - start_time )

    plt.imshow( output_image )


import tensorflow as tf
import numpy as np
import os
import argparse
import time
import utils.read_config as read_config

from utils import net, read_config, get_loss, IOU, extract_labels, select_things
import reader

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument( '-c', '--conf', default = './config/config.yml', help = 'the path to the config file' )
    return parser.parse_args()

def main( FLAGS ):

    scale_width, scale_height = select_things.select_scale( FLAGS.scale, FLAGS.width, FLAGS.height )
    '''--------Creat palceholder--------'''
    datas, labels = net.create_placeholder( FLAGS.batch_size, FLAGS.width, FLAGS.height, scale_width, scale_height )

    pre_scale1, pre_scale2, pre_scale3 = net.feature_extractor( datas )
    scale1, scale2, scale3 = net.scales( pre_scale1, pre_scale2, pre_scale3 )

    '''--------get labels_filenames and datas_filenames--------'''
    datas_filenames = reader.images( FLAGS.batch_size, FLAGS.datas_path )
    labels_fienames = reader.labels( FLAGS.batch_size, FLAGS.labels_path )
    normalize_labels = extract_labels.labels_normalizer( labels_fienames,
                                                          scale_height )

    '''---------partition the train data and val data--------'''
    train_filenames = datas_filenames[: int( len( datas_filenames ) * 0.9 )]
    train_labels = normalize_labels[: int( len( normalize_labels ) * 0.9 )]
    val_filenames = datas_filenames[len( datas_filenames ) - int( len( datas_filenames ) * 0.9 ) :]
    val_labels = normalize_labels[len( normalize_labels ) - int( len( normalize_labels ) * 0.9 ) :]

    '''--------calculate loss--------'''
    if FLAGS.scale == 1:
        loss = get_loss.calculate_loss( scale1, labels )

    if FLAGS.scale == 2:
        loss = get_loss.calculate_loss( scale2, labels )

    if FLAGS.scale == 3:
        loss = get_loss.calculate_loss( scale3, labels )

    optimizer = tf.train.AdamOptimizer( learning_rate=FLAGS.learning_rate ).minimize( loss )

    tf.summary.scalar( 'epoch_loss', loss )

    merged = tf.summary.merge_all()

    init = tf.initialize_all_variables()
    with tf.Session() as sess:
        saver = tf.train.Saver()
        save_path = select_things.select_checkpoint( FLAGS.scale )
        last_checkpoint = tf.train.latest_checkpoint( save_path, 'checkpoint' )
        if last_checkpoint:
            saver.restore( sess, last_checkpoint )
            print( 'Reuse model' )
            sess.run( init )

        for epoch in range( FLAGS.epoch ):
            epoch_loss = tf.cast( 0, tf.float32 )
            for i in range( len( train_filenames ) ):
                normalize_datas = []
                for data_filename in train_filenames[i]:
                    image = reader.get_image( data_filename, FLAGS.width, FLAGS.height )
                    image = np.array( image, np.float32 )

                    normalize_datas.append( image )

                normalize_datas = np.array( normalize_datas )

                _, batch_loss = sess.run( [optimizer, loss], feed_dict = {datas: normalize_datas, labels: train_labels[i]} )

                epoch_loss =+ batch_loss

            if epoch % 10 == 0:
                print( 'Cost after epoch %i: %f' % ( epoch, epoch_loss ) )

            if epoch % 50 == 0:
                val_loss = tf.cast( 0, tf.float32 )
                for i in range( len( val_filenames ) ):
                    normalize_datas = []
                    for val_filename in val_filenames[i]:
                        image = reader.get_image( val_filename, FLAGS.width, FLAGS.height )
                        image = np.array( image, np.float32 )
                        image = np.divide( image, 255 )

                        normalize_datas.append( image )

                    normalize_datas = np.array( normalize_datas )

                    batch_loss = sess.run( loss, feed_dict = {datas: normalize_datas, labels: val_labels[i]} )

                    val_loss =+ batch_loss

                print( 'VAL_Cost after epoch %i: %f' %( epoch, val_loss ) )
                saver.save( sess, save_path, global_step = epoch )

if __name__ == '__main__':
    args = parse_args()
    FLAGS = read_config.read_config_file( args.conf )
    main( FLAGS )