1. 程式人生 > 其它 >【NeurIPS】ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

【NeurIPS】ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

論文:https://openreview.net/forum?id=_WnAQKse_uK

程式碼:https://github.com/Annbless/ViTAE

1、Motivation

這個論文的思想非常簡單:將CNN和 VIT 結合,淺層用CNN,深層用VIT。 同時,在attention 分支新增一個卷積層分支。

2、Method

網路整體架構如下圖所示,包括三個 Reduction Cell (RC) 和若干 Normal Cell(NC)。

RC 模組

和 VIT 的 Transformer block 相比,RC多了一個 pyramid reduction ,就是多尺度空洞卷積並行,最終拼接成一個。同時,在 shortcut 裡,多了3個卷積。最後,還要 seq2img 轉成 feature map。

NC 模組

和VIT的 transformer block 有區別的地方就是計算 attention 那裡多了一個卷積分支。

3、有趣的地方

從openreview的意見來看,審稿人認可的 strong points:

  • The idea of injecting multi-scale features is interesting and promising.
  • The paper is well written and easy to follow.

同時,論文也存在一些薄弱環節:

  • The paper use an additional conv branch together with the self-attention branch to construct the new network architecture, it is obvious that the extra conv layers will help to improve the performance of the network. The proposed network modification looks a little bit incremental and not very interesting to me.
  • There are no results on the downstream object detection and segmentation tasks, since this paper aims to introduce the inductive bias on the visual structure.
  • The proposed method is mainly verified on small input images. Thus, I am a little bit concerned about its memory consumption and running speed when applied on large images (as segmentation or detection typically uses large image resolutions).