tensorflow estimator api train時的 checkpoint save 行為和 val時的chekpoint skip行為

阿新 • • 發佈：2019-02-12

INFO:tensorflow:Create CheckpointSaverHook.
2018-01-15 16:24:33.513942: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-01-15 16:24:34.390763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030 
] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:89:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2018-01-15 16:24:34.390813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 
 Ti, pci bus id: 0000:89:00.0, compute capability: 6.1)
2018-01-15 16:25:58.010092: I tensorflow/core/kernels/shuffle_dataset_op.cc:110] Filling up shuffle buffer (this may take a while): 499 of 1000
2018-01-15 16:26:07.689469: I tensorflow/core/kernels/shuffle_dataset_op.cc:121] Shuffle buffer filled.
INFO:tensorflow:Saving checkpoints for 
 1 into /train/mymodels/model.ckpt.
INFO:tensorflow:loss = 22.2663, step = 1
......
EBUG:tensorflow:Skipping evaluation due to same checkpoint /train/mymodels/model.ckpt-1 for step 100 as for step 50.

執行流程如下：

experiment.train_and_evaluate()

# 驗證部分用hook實現, 
if self._min_eval_frequency:
   self._train_monitors += [
       monitors.ValidationMonitor(
           input_fn=self._eval_input_fn,
           eval_steps=self._eval_steps,
           metrics=self._eval_metrics,
           every_n_steps=self._min_eval_frequency,
           name=eval_dir_suffix,
           hooks=self._eval_hooks)
   ]

# 訓練部分最終呼叫estimator._train_model(), 第一次訓練會儲存一下快照！！！
self.train(delay_secs=0)

訓練部分

experiment.train(delay_secs=0) -> experiment._estimator.train-> estimator._train_model()

#estimator._train_model()程式碼
# ...
      # 1. 增加loss監控 （通過hooks）
      # Check if the user created a loss summary, and add one if they didn't.
      # We assume here that the summary is called 'loss'. If it is not, we will
      # make another one with the name 'loss' to ensure it shows up in the right
      # graph in TensorBoard.
      if not any([x.op.name == 'loss'
                  for x in ops.get_collection(ops.GraphKeys.SUMMARIES)]):
        summary.scalar('loss', estimator_spec.loss)
      ops.add_to_collection(ops.GraphKeys.LOSSES, estimator_spec.loss)
      worker_hooks.extend(hooks)
      worker_hooks.extend([
          training.NanTensorHook(estimator_spec.loss),
          training.LoggingTensorHook(
              {
                  'loss': estimator_spec.loss,
                  'step': global_step_tensor
              },
              every_n_iter=100)
      ])
      worker_hooks.extend(estimator_spec.training_hooks)

     # 2. 建立saver 如果沒有提供saver則建立
     if not (estimator_spec.scaffold.saver or
              ops.get_collection(ops.GraphKeys.SAVERS)):
        ops.add_to_collection(
            ops.GraphKeys.SAVERS,
            training.Saver(
                sharded=True,
                max_to_keep=self._config.keep_checkpoint_max,
                keep_checkpoint_every_n_hours=(
                    self._config.keep_checkpoint_every_n_hours),
                defer_build=True,
                save_relative_paths=True))

      chief_hooks = []
      all_hooks = worker_hooks + list(estimator_spec.training_chief_hooks)
      saver_hooks = [
          h for h in all_hooks if isinstance(h, training.CheckpointSaverHook)]
      if (self._config.save_checkpoints_secs or
          self._config.save_checkpoints_steps):
        if not saver_hooks:
          # 3.  checkpoint saver hooks 這是checkpoint儲存的關鍵點
          chief_hooks = [
              training.CheckpointSaverHook(
                  self._model_dir,
                  save_secs=self._config.save_checkpoints_secs,
                  save_steps=self._config.save_checkpoints_steps,
                  scaffold=estimator_spec.scaffold)
          ]
          saver_hooks = [chief_hooks[0]]

CheckpointSaverHook

class CheckpointSaverHook(session_run_hook.SessionRunHook):
  """Saves checkpoints every N steps or seconds."""

  def __init__(self,
               checkpoint_dir,
               save_secs=None,
               save_steps=None,
               saver=None,
               checkpoint_basename="model.ckpt",
               scaffold=None,
               listeners=None):
    """Initializes a `CheckpointSaverHook`.

    Args:
      checkpoint_dir: `str`, base directory for the checkpoint files.
      save_secs: `int`, save every N secs.
      save_steps: `int`, save every N steps.
      saver: `Saver` object, used for saving.
      checkpoint_basename: `str`, base name for the checkpoint files.
      scaffold: `Scaffold`, use to get saver object.
      listeners: List of `CheckpointSaverListener` subclass instances.
        Used for callbacks that run immediately before or after this hook saves
        the checkpoint.

    Raises:
      ValueError: One of `save_steps` or `save_secs` should be set.
      ValueError: At most one of saver or scaffold should be set.
    """
    logging.info("Create CheckpointSaverHook.")
    if saver is not None and scaffold is not None:
      raise ValueError("You cannot provide both saver and scaffold.")
    self._saver = saver
    self._checkpoint_dir = checkpoint_dir
    self._save_path = os.path.join(checkpoint_dir, checkpoint_basename)
    self._scaffold = scaffold
    self._timer = SecondOrStepTimer(every_secs=save_secs,
                                    every_steps=save_steps)
    self._listeners = listeners or []

  def begin(self):
    self._summary_writer = SummaryWriterCache.get(self._checkpoint_dir)
    self._global_step_tensor = training_util._get_or_create_global_step_read()  # pylint: disable=protected-access
    if self._global_step_tensor is None:
      raise RuntimeError(
          "Global step should be created to use CheckpointSaverHook.")
    for l in self._listeners:
      l.begin()

  def before_run(self, run_context):  # pylint: disable=unused-argument
    if self._timer.last_triggered_step() is None:
      # We do write graph and saver_def at the first call of before_run.
      # We cannot do this in begin, since we let other hooks to change graph and
      # add variables in begin. Graph is finalized after all begin calls.
      training_util.write_graph(
          ops.get_default_graph().as_graph_def(add_shapes=True),
          self._checkpoint_dir,
          "graph.pbtxt")
      saver_def = self._get_saver().saver_def if self._get_saver() else None
      graph = ops.get_default_graph()
      meta_graph_def = meta_graph.create_meta_graph_def(
          graph_def=graph.as_graph_def(add_shapes=True),
          saver_def=saver_def)
      self._summary_writer.add_graph(graph)
      self._summary_writer.add_meta_graph(meta_graph_def)

    return SessionRunArgs(self._global_step_tensor)

  def after_run(self, run_context, run_values):
    stale_global_step = run_values.results
    #這個函式很關鍵！！！！！ 當 “第一次執行” 或者 “到了該執行checkpoint的時候” 它都會返回true
    if self._timer.should_trigger_for_step(stale_global_step+1):
      # get the real value after train op.
      global_step = run_context.session.run(self._global_step_tensor)
      if self._timer.should_trigger_for_step(global_step):
        self._timer.update_last_triggered_step(global_step)
        self._save(run_context.session, global_step)

  def end(self, session):
    last_step = session.run(self._global_step_tensor)
    if last_step != self._timer.last_triggered_step():
      self._save(session, last_step)
    for l in self._listeners:
      l.end(session, last_step)

  def _save(self, session, step):
    """Saves the latest checkpoint."""
    logging.info("Saving checkpoints for %d into %s.", step, self._save_path)

    for l in self._listeners:
      l.before_save(session, step)

    self._get_saver().save(session, self._save_path, global_step=step)
    self._summary_writer.add_session_log(
        SessionLog(
            status=SessionLog.CHECKPOINT, checkpoint_path=self._save_path),
        step)

    for l in self._listeners:
      l.after_save(session, step)

  def _get_saver(self):
    if self._saver is not None:
      return self._saver
    elif self._scaffold is not None:
      return self._scaffold.saver

    # Get saver from the SAVERS collection if present.
    collection_key = ops.GraphKeys.SAVERS
    savers = ops.get_collection(collection_key)
    if not savers:
      raise RuntimeError(
          "No items in collection {}. Please add a saver to the collection "
          "or provide a saver or scaffold.".format(collection_key))
    elif len(savers) > 1:
      raise RuntimeError(
          "More than one item in collection {}. "
          "Please indicate which one to use by passing it to the constructor.".
          format(collection_key))

    self._saver = savers[0]
    return savers[0]

SecondOrStepTimer.should_trigger_for_step

class SecondOrStepTimer(_HookTimer):
  """Timer that triggers at most once every N seconds or once every N steps.
  """

  def __init__(self, every_secs=None, every_steps=None):
    self.reset()
    self._every_secs = every_secs
    self._every_steps = every_steps

    if self._every_secs is None and self._every_steps is None:
      raise ValueError("Either every_secs or every_steps should be provided.")
    if (self._every_secs is not None) and (self._every_steps is not None):
      raise ValueError("Can not provide both every_secs and every_steps.")

    super(SecondOrStepTimer, self).__init__()

  def reset(self):
    self._last_triggered_step = None
    self._last_triggered_time = None

  def should_trigger_for_step(self, step):
    """Return true if the timer should trigger for the specified step.

    Args:
      step: Training step to trigger on.

    Returns:
      True if the difference between the current time and the time of the last
      trigger exceeds `every_secs`, or if the difference between the current
      step and the last triggered step exceeds `every_steps`. False otherwise.
    """
    # 如果是第一次執行
    if self._last_triggered_step is None:
      return True

    if self._last_triggered_step == step:
      return False

    if self._every_secs is not None:
      if time.time() >= self._last_triggered_time + self._every_secs:
        return True

    if self._every_steps is not None:
      if step >= self._last_triggered_step + self._every_steps:
        return True

    return False

  def update_last_triggered_step(self, step):
    current_time = time.time()
    if self._last_triggered_time is None:
      elapsed_secs = None
      elapsed_steps = None
    else:
      elapsed_secs = current_time - self._last_triggered_time
      elapsed_steps = step - self._last_triggered_step

    self._last_triggered_time = current_time
    self._last_triggered_step = step
    return (elapsed_secs, elapsed_steps)

  def last_triggered_step(self):
    return self._last_triggered_step

驗證部分

# experiment.train_and_evaluate()

 self._train_monitors += [
              monitors.ValidationMonitor(
                  input_fn=self._eval_input_fn,
                  eval_steps=self._eval_steps,
                  metrics=self._eval_metrics,
                  every_n_steps=self._min_eval_frequency,
                  name=eval_dir_suffix,
                  hooks=self._eval_hooks)
          ]

class ValidationMonitor(EveryN):
  """Runs evaluation of a given estimator, at most every N steps.

  Note that the evaluation is done based on the saved checkpoint, which will
  usually be older than the current step.

  Can do early stopping on validation metrics if `early_stopping_rounds` is
  provided.
  """

  # ...... 略

  def every_n_step_end(self, step, outputs):
    super(ValidationMonitor, self).every_n_step_end(step, outputs)

    # Check that we are not running evaluation on the same checkpoint.
    latest_path = saver_lib.latest_checkpoint(self._estimator.model_dir)
    if latest_path is None:
      logging.debug("Skipping evaluation since model has not been saved yet "
                    "at step %d.", step)
      return False
    if latest_path is not None and latest_path == self._latest_path:
      # 防止重複！！！！
      logging.debug("Skipping evaluation due to same checkpoint %s for step %d "
                    "as for step %d.", latest_path, step,
                    self._latest_path_step)
      return False
    self._latest_path = latest_path
    self._latest_path_step = step

tensorflow estimator api train時的 checkpoint save 行為和 val時的chekpoint skip行為

INFO:tensorflow:Create CheckpointSaverHook. 2018-01-15 16:24:33.513942: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your C

RocketMQ延時訊息的使用和延時級別的配置

RocketMQ 支援定時訊息，但是不支援任意時間精度，僅支援特定的 level，例如定時 5s， 10s， 1m 等。其中，level=0 級表示不延時，level=1 表示 1 級延時，level=2 表示 2 級延時，以此類推。如果要支援任意的時間精度，在Broker層面，必須要做訊息排序，

AttributeError: module 'tensorflow.tools.api.generator.api.train' has no attribute 'SummaryWriter'

AttributeError: module 'tensorflow.tools.api.generator.api.train' has no attribute 'SummaryWriter' summary_writer = tf.train.SummaryW

TensorFlow學習實踐（三）：使用TFRecord格式資料和tf.estimator API進行模型訓練和預測

本文以mnist為例，介紹如何使用TFRecord格式資料和tf.estimator API進行模型訓練和預測。參考：目錄一、資料輸入 def input_fn(filenames, training): dataset = tf.dat

對文件存儲的輕量封裝，屏蔽不同雲環境api差異，方便開發和部署時隨意切換存儲環境

clas -c iyu public baseurl config 開發 wpc autoload 文件存儲安裝 composer require pfinal/storage 本地存儲 Local 阿裏雲 AliOss 請先 composer requir

TensorFlow — 相關 API

平均值狀態例如完成 print 允許 ext 數列數據格式 TensorFlow — 相關 API TensorFlow 相關函數理解任務時間：時間未知 tf.truncated_normal truncated_normal( shape,

TensorFlow - 相關 API

再計算通道數 erro ali ural 現在 thead post false 來自：https://cloud.tencent.com/developer/labs/lab/10324 TensorFlow - 相關 API TensorFlow 相關函數理解任

tensorflow在訓練和驗證時監視不同的summary的操作

write scalar all glob sca val rain 不同 valid 如果想在訓練和驗證時監視不同的summary，將train summary ops和val summary ops放進不同的集合中即可。 train_writer = tf.summar

Spark2.2+ES6.4.2（三十二）：ES API之ndex的create（建立index時設定setting，並建立index後根據avro模板動態設定index的mapping）/update/delete/open/close

要想通過ES API對es的操作，必須獲取到TransportClient物件，讓後根據TransportClient獲取到IndicesAdminClient物件後，方可以根據IndicesAdminClient物件提供的方法對ES的index進行操作：create index,update inde

使用TensorFlow C++ API構建線上預測服務

使用TensorFlow C++ API構建線上預測服務執行環境：CentOS，TF-1.10 除了本機的tensorflow之外，仍需要安裝下面的tf。原始碼安裝後，看到tensorflow/contrib/makefile/gen/lib/libtensorflow-co

tensorflow objectdetecton API 檢測模型不出結果

檢測模型在經過上萬次迭代訓練自己的樣本後，嘗試檢測模型；用相似環境下的圖片作為檢測樣本。以下是檢測程式碼，copy自別處，修改自己的路徑一類，cmd下執行，或者在配置好環境的pycharm 下執行。 import matplotlib matplotlib.use('Agg')

報名 | 谷歌資深工程師手把手教你使用TensorFlow最新API構建學習模型

目前，深度學習的研究和應用大受追捧，各種開源的深度學習框架層出不窮。TensorFlow 作為目前最受歡迎的深度學習框架，已經在 GitHub 上獲得了 112194 個 star，受歡迎程式可見一斑。但如何學習 TensorFlow，以及如何通過 TensorFlow 讓自己在深度學習方面

save方法被呼叫時資料是如何被儲存的

傳送一個django.db.models.signals.pre_save訊號，以允許監聽該訊號的函式完成一些自定義操作。預處理資料。如果需要，對物件的每個字斷進行自動轉換。準備資料庫資料。要求咩歌字斷提供的當前值是能夠寫入到資料庫中的型別。插入資料到資料庫中。將預處理過，注備好的資料

tensorflow estimator 與 model_fn 是這樣溝通的

在自定義估計器過程中，搞清Estimator 與model_fn 及其他引數之間的關係十分中重要！總結一下，就是 estimator 拿著獲取到的引數往model_fn裡面灌，model_fn 是作為用資料的關鍵使用者。與scikit-learn和spark中的各種估計器相比，tensorflow的估計器抽

目標檢測中tensorflow常用API以及備選框篩選程式碼分析

目標檢測演算法中，因為產生的備選框特別多，需要刪減。而刪減的方法是NMS（非極大抑制演算法）。網上很多演算法是自己編寫功能程式碼。但是這不是tensorflow中自帶的功能，所以在使用tensorflow恢復模型的時候，sess並不能hold住他們。因此別人需要

TensorFlow Estimator 教程之----Checkpoints

本文介紹了 Estimators 模型的儲存和恢復。 TensorFlow提供了兩種模型格式： checkpoints：一種與語言相關的序列化格式。 SavedModel：一種獨立於語言且可恢復的序列化格式。本文主要講述checkpoints相關內容。關於

一直在等待，一直會等待 TensorFlow常見API--4

tf.nn.static_rnn tf.nn.static_rnn( cell, inputs, initial_state=None, dtype=None,

一直在等待，一直會等待 TensorFlow常見API--5

tf.train.Saver __init__( var_list=None, reshape=False, sharded=False, max_to_keep=5, keep_checkpoint_every_n_ho

TensorFlow Estimator 教程之----自定義Estimator

建立自定義 Estimator 本文件介紹了自定義 Estimator。具體而言，本文件介紹瞭如何建立自定義 Estimator 來實現和內建 Estimator DNNClassifier 類似的功能。本文示例程式碼的獲取要獲取本文的示例程式碼，請執行

Tensorflow object_detection API筆記

TF object_detection API 這個API是tensorflow官方提供的工程模板，之前曾經嘗試過但沒有跑通，這次看的比較深入，基本上熟悉了訓練、測試、評估的操作流程。實驗了VOC2007訓練、Pet資料集訓練等。下面記錄的是研究過程中的一些

tensorflow estimator api train時的 checkpoint save 行為 和 val時的chekpoint skip行為

訓練部分

驗證部分

相關推薦

tensorflow estimator api train時的 checkpoint save 行為和 val時的chekpoint skip行為