【NeurIPS】ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
阿新 • • 發佈:2022-03-28
1、Motivation
這個論文的思想非常簡單:將CNN和 VIT 結合,淺層用CNN,深層用VIT。 同時,在attention 分支新增一個卷積層分支。
2、Method
網路整體架構如下圖所示,包括三個 Reduction Cell (RC) 和若干 Normal Cell(NC)。
RC 模組
和 VIT 的 Transformer block 相比,RC多了一個 pyramid reduction ,就是多尺度空洞卷積並行,最終拼接成一個。同時,在 shortcut 裡,多了3個卷積。最後,還要 seq2img 轉成 feature map。
NC 模組
和VIT的 transformer block 有區別的地方就是計算 attention 那裡多了一個卷積分支。
3、有趣的地方
從openreview的意見來看,審稿人認可的 strong points:
- The idea of injecting multi-scale features is interesting and promising.
- The paper is well written and easy to follow.
同時,論文也存在一些薄弱環節:
- The paper use an additional conv branch together with the self-attention branch to construct the new network architecture, it is obvious that the extra conv layers will help to improve the performance of the network. The proposed network modification looks a little bit incremental and not very interesting to me.
- There are no results on the downstream object detection and segmentation tasks, since this paper aims to introduce the inductive bias on the visual structure.
- The proposed method is mainly verified on small input images. Thus, I am a little bit concerned about its memory consumption and running speed when applied on large images (as segmentation or detection typically uses large image resolutions).