1. 程式人生 > 其它 >Mindinsight訓練優化過程視覺化

Mindinsight訓練優化過程視覺化

神經網路訓練本質上是高維非凸函式的優化過程,一般可以通過梯度下降方法發現最小值點(如圖1所示)。而一般的神經網路引數多達幾萬甚至幾十萬,較難直接在三維空間展示其優化地形。使用者通過本功能,能夠基於方向降維和繪製計算,將神經網路訓練路徑周圍的優化空間展示出來。

使用步驟

具體使用步驟共分為兩步,以LeNet為例,分類任務,資料集為MNIST,樣例程式碼如下:

  1. 訓練資料收集:在訓練過程中,利用SummaryCollector的形式收集多個模型前向網路權重,地形圖繪製所需引數(如期望繪製區間,地形圖解析度等)
import mindspore.dataset as ds
import mindspore.dataset.vision.c_transforms as CV
import mindspore.dataset.transforms.c_transforms as C
from mindspore.dataset.vision import Inter
from mindspore import dtype as mstype
import mindspore.nn as nn

from mindspore.common.initializer import Normal
from mindspore import set_context, GRAPH_MODE
from mindspore import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor, SummaryCollector
from mindspore import Model
from mindspore.nn import Accuracy
from mindspore import set_seed

set_seed(1)

def create_dataset(data_path, batch_size=32, repeat_size=1,
                   num_parallel_workers=1):
    """
    create dataset for train or test
    """
    # define dataset
    mnist_ds = ds.MnistDataset(data_path, shuffle=False)

    resize_height, resize_width = 32, 32
    rescale = 1.0 / 255.0
    shift = 0.0
    rescale_nml = 1 / 0.3081
    shift_nml = -1 * 0.1307 / 0.3081

    # define map operations
    resize_op = CV.Resize((resize_height, resize_width), interpolation=Inter.LINEAR)  # Bilinear mode
    rescale_nml_op = CV.Rescale(rescale_nml, shift_nml)
    rescale_op = CV.Rescale(rescale, shift)
    hwc2chw_op = CV.HWC2CHW()
    type_cast_op = C.TypeCast(mstype.int32)

    # apply map operations on images
    mnist_ds = mnist_ds.map(operations=type_cast_op, input_columns="label", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=resize_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=rescale_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=rescale_nml_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=hwc2chw_op, input_columns="image", num_parallel_workers=num_parallel_workers)

    # apply DatasetOps
    buffer_size = 10000
    mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size)  # 10000 as in LeNet train script
    mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True)
    mnist_ds = mnist_ds.repeat(repeat_size)

    return mnist_ds

class LeNet5(nn.Cell):
    """
    Lenet network

    Args:
        num_class (int): Number of classes. Default: 10.
        num_channel (int): Number of channels. Default: 1.

    Returns:
        Tensor, output tensor
    Examples:
    LeNet(num_class=10)

    """
    def __init__(self, num_class=10, num_channel=1, include_top=True):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid', weight_init=Normal(0.02))
        self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid', weight_init=Normal(0.02))
        self.relu = nn.ReLU()
        self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
        self.include_top = include_top
        if self.include_top:
            self.flatten = nn.Flatten()
            self.fc1 = nn.Dense(16 * 5 * 5, 120)
            self.fc2 = nn.Dense(120, 84)
            self.fc3 = nn.Dense(84, num_class)

    def construct(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        if not self.include_top:
            return x
        x = self.flatten(x)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

def train_lenet():
    set_context(mode=GRAPH_MODE, device_target="GPU")
    data_path = YOUR_DATA_PATH
    ds_train = create_dataset(data_path)

    network = LeNet5(10)
    net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
    net_opt = nn.Momentum(network.trainable_params(), 0.01, 0.9)
    time_cb = TimeMonitor(data_size=ds_train.get_dataset_size())
    config_ck = CheckpointConfig(save_checkpoint_steps=1875, keep_checkpoint_max=10)
    ckpoint_cb = ModelCheckpoint(prefix="checkpoint_lenet", config=config_ck)
    model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()})
    summary_dir = "./summary/lenet_test2"
    interval_1 = [x for x in range(1, 4)]
    interval_2 = [x for x in range(7, 11)]
    ##Collector landscape information
    summary_collector = SummaryCollector(summary_dir, keep_default_action=True,
                                         collect_specified_data={'collect_landscape': {'landscape_size': 10,
                                                                                       'unit': "epoch",
                                                                                       'create_landscape': {'train': True,
                                                                                                            'result': True},
                                                                                       'num_samples': 512,
                                                                                        'intervals': [interval_1,
                                                                                                      interval_2
                                                                                                      ]
                                                                                        }
                                                                },
                                        collect_freq=1)

    print("============== Starting Training ==============")
    model.train(10, ds_train, callbacks=[time_cb, ckpoint_cb, LossMonitor(), summary_collector])

if __name__ == "__main__":
    train_lenet()

  1. summary_dir設定了引數的儲存路徑。summary_collector為初始化的SummaryCollector例項。其中collector_specified_data中的collect_landscape以字典的形式包含了繪製地形圖所需要的所有引數設定:
    • landscape_size: 表示地形圖的解析度。10表示地形圖的解析度是10*10。解析度越大,地形圖紋理越細緻,同時計算消耗時間也會越久。預設為40。
    • unit: 表示訓練過程中儲存引數的間隔單位,分為epoch/step。使用step時,須在model.train中設定dataset_sink_model=False。預設為step
    • create_landscape: 表示繪製地形圖的方式,目前支援訓練過程地形圖(帶訓練軌跡)與訓練結果地形圖(不帶軌跡)。預設{’train‘: True, ’result‘: True}
    • num_samples: 表示繪製地形圖資料集的樣本數量。512表示地形圖所需樣本是512。樣本數越大,地形圖越精確,同時計算消耗時間也會越久。預設為2048。
    • intervals: 表示繪製地形圖的區間。如interval_1表示繪製帶訓練軌跡1-5 epoch地形圖。
  2. 地形圖繪製:利用訓練過程中儲存的模型引數,模型與資料集與訓練一致,啟動新的指令碼,正向計算生成地形圖資訊,不用再次進行訓練。(適用於單卡或多卡平行計算繪製地形圖)
import mindspore.dataset as ds
import mindspore.dataset.vision.c_transforms as CV
import mindspore.dataset.transforms.c_transforms as C
from mindspore.dataset.vision import Inter
from mindspore import dtype as mstype
import mindspore.nn as nn

from mindspore.common.initializer import Normal
from mindspore import Model
from mindspore.nn import Loss
from mindspore import SummaryLandscape

def create_dataset(data_path, batch_size=32, repeat_size=1,
                   num_parallel_workers=1):
    """
    create dataset for train or test
    """
    # define dataset
    mnist_ds = ds.MnistDataset(data_path, shuffle=False)

    resize_height, resize_width = 32, 32
    rescale = 1.0 / 255.0
    shift = 0.0
    rescale_nml = 1 / 0.3081
    shift_nml = -1 * 0.1307 / 0.3081

    # define map operations
    resize_op = CV.Resize((resize_height, resize_width), interpolation=Inter.LINEAR)  # Bilinear mode
    rescale_nml_op = CV.Rescale(rescale_nml, shift_nml)
    rescale_op = CV.Rescale(rescale, shift)
    hwc2chw_op = CV.HWC2CHW()
    type_cast_op = C.TypeCast(mstype.int32)

    # apply map operations on images
    mnist_ds = mnist_ds.map(operations=type_cast_op, input_columns="label", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=resize_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=rescale_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=rescale_nml_op, input_columns="image", num_parallel_workers=num_parallel_workers)
    mnist_ds = mnist_ds.map(operations=hwc2chw_op, input_columns="image", num_parallel_workers=num_parallel_workers)

    # apply DatasetOps
    buffer_size = 10000
    mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size)  # 10000 as in LeNet train script
    mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True)
    mnist_ds = mnist_ds.repeat(repeat_size)

    return mnist_ds

class LeNet5(nn.Cell):
    """
    Lenet network

    Args:
        num_class (int): Number of classes. Default: 10.
        num_channel (int): Number of channels. Default: 1.

    Returns:
        Tensor, output tensor
    Examples:
    LeNet(num_class=10)

    """
    def __init__(self, num_class=10, num_channel=1, include_top=True):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid', weight_init=Normal(0.02))
        self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid', weight_init=Normal(0.02))
        self.relu = nn.ReLU()
        self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
        self.include_top = include_top
        if self.include_top:
            self.flatten = nn.Flatten()
            self.fc1 = nn.Dense(16 * 5 * 5, 120)
            self.fc2 = nn.Dense(120, 84)
            self.fc3 = nn.Dense(84, num_class)

    def construct(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.max_pool2d(x)
        if not self.include_top:
            return x
        x = self.flatten(x)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

def callback_fn():
    network = LeNet5(10)
    net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
    metrics = {"Loss": Loss()}
    model = Model(network, net_loss, metrics=metrics)
    data_path = YOUR_DATA_PATH
    ds_eval = create_dataset(data_path)
    return model, network, ds_eval, metrics

if __name__ == "__main__":
    interval_1 = [x for x in range(1, 4)]
    interval_2 = [x for x in range(7, 11)]
    summary_landscape = SummaryLandscape('./summary/lenet_test2')
    # generate loss landscape
    summary_landscape.gen_landscapes_with_multi_process(callback_fn,
                                                        collect_landscape={"landscape_size": 10,
                                                                           "create_landscape": {"train": True,
                                                                                                "result": True},
                                                                           "num_samples": 512,
                                                                           "intervals": [interval_1, interval_2
                                                                                        ]},
                                                        device_ids=[1, 2])

  • callback_fn: 使用者需要定義函式callback_fn,該函式沒有輸入,返回model(mindspore.Model)network(mindspore.nn.Cell)dataset(mindspore.dataset)metrics(mindspore.nn.Metrics)
  • collect_landscape: 引數定義與SummaryCollector一致,這裡使用者可以自由修改繪圖引數。
  • device_ids: 指定地形圖繪製所需要device_ids,支援單機多卡計算。

繪製結束後,啟動MindInsight

# 啟動 
mindinsight start --port 8000 --summary-base-dir /home/workspace/dockers/study/logs/summary

損失函式多維分析

損失函式多維分析描述了訓練過程中模型的運動軌跡情況,使用者可以通過檢視損失函式多維分析來了解模型訓練過程的運動軌跡。

  1. 等值線圖、地形圖、3D圖:分別表示同一組資料的不同展示形式,使用者可以自由選擇檢視。
  2. 步驟選擇:使用者可以通過"請選擇區間範圍”來選擇展示不同區間的影象。
  3. 視覺顯示設定:通過調整其中引數,使用者可通過不同角度,不同地形圖顏色以及不同軌跡顏色和寬度來檢視影象。其中,在等值線圖(圖5)和地形圖可以調節等高線條數用來展示影象的密集程度。
  4. 訓練基礎資訊(圖5):訓練基礎資訊中會展示模型的基本資訊,如網路名稱、優化器、學習率(目前顯示固定學習率)、降維方式、取樣點解析度、step/epoch。

注意事項

  1. 在進行地形圖繪製時,繪製時間與模型引數大小、資料集num_sample大小以及解析度landscape_size大小有直接關係。模型、num_sample以及landscape_size越大,需要時間越長。例如,一個LeNet網路,40*40解析度,一張圖耗時4分鐘,使用2張卡計算時,一張圖的耗時可縮短為2分鐘。ResNet-50網路,同樣解析度情況下,使用4張卡繪製計算,一張圖耗時20分鐘。
  2. 在MindInsight啟動介面,訓練日誌檔案較大,MindInsight需要更多時間解析訓練日誌檔案,請耐心等待。
  3. 該功能目前僅支援通過mindspore.Model定義的模型。
  4. 目前僅支援後端:Ascend/GPU/CPU,模式:靜態圖模式,平臺:LINUX。
  5. 該功能目前僅支援單機單卡和單機多卡模式。
  6. 該功能在繪製地形圖時目前不支援資料下沉模式。