AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

阿新 • • 發佈：2021-08-30

https://arxiv.org/pdf/2010.11929.pdf

---------------------------------------------------------

2021-08-30

transformer缺少cnn的平移不變性，區域性性：大規模資料集預訓練可解決

class PatchEmbeddin(nn.Module):
    def __init__(self,in_channel:int = 3,patch_size:int = 16,emb_size:int=768,img_size:int = 224):
        super(PatchEmbeddin, self). 
__init__()
        self.patch_size=patch_size
        self.projection=nn.Sequential(
            nn.Conv2d(in_channel,emb_size,kernel_size=patch_size,stride=patch_size),
            Rearrange("b e (h) (w) -> b (h w) e"),
        )
        self.cls_token=nn.Parameter(torch.randn(1,1,emb_size))
        self.position 
=nn.Parameter(torch.randn((img_size//patch_size)**2+1,emb_size))

    def forward(self,x:torch.Tensor)->torch.Tensor:
        b=x.size()[0]
        x=self.projection(x)
        cls_tokens=einops.repeat(self.cls_token,"() n e -> b n e",b=b)
        x=torch.cat([cls_tokens,x],dim=1)
        x+=self.position

         
return x


class MultiHeadAttention(nn.Module):
    def __init__(self,emb_size:int=768,num_headas:int=8,dropout:float=0):
        super(MultiHeadAttention, self).__init__()
        self.emb_size=emb_size
        self.num_heads=num_headas
        self.qkv=nn.Linear(emb_size,emb_size*3)
        self.ett_drop=nn.Dropout(dropout)
        self.projection=nn.Linear(emb_size,emb_size)

    def forward(self,x:torch.Tensor,mask:torch.Tensor=None)->torch.Tensor:
        qkvs=einops.rearrange(self.qkv(x),"b n (h d qkv) -> (qkv) b h n d",h=self.num_heads,qkv=3)
        queries,keys,values=qkvs[0],qkvs[1],qkvs[2]
        energy=torch.einsum("bhqd,bhkd -> bhqk",queries,keys)
        if mask is not None:
            fill_value=torch.finfo(torch.float32).min
            energy.mask_fill(~mask,fill_value)
        scaling=self.emb_size**(1/2)
        att=F.softmax(energy,dim=-1)/scaling
        att=self.ett_drop(att)
        out=torch.einsum("bhal,bhl -> bhav",att,values)
        out=einops.rearrange(out,"b h n d -> b n (h d)")
        out=self.projection(out)

        return out

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

https://arxiv.org/pdf/2010.11929.pdf --------------------------------------------------------- 2021-08-30

閱讀論文：《An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale》

閱讀論文：《An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale》來源：ICLR 2021 https://arxiv.org/abs/2010.11929

BoTNet:Bottleneck Transformers for Visual Recognition

【GiantPandaCV導語】基於Transformer的骨幹網路，同時使用卷積與自注意力機制來保持全域性性和區域性性。模型在ResNet最後三個BottleNeck中使用了MHSA替換3x3卷積。屬於早期的結合CNN+Transformer的工作。簡單來講

解決刪除映象時image is referenced in multiple repositories

1、檢視映象 docker images rt@123:~# docker images REPOSITORYTAGIMAGE IDCREATEDSIZE 192.168.0.1/jii/jenkins1.0.13391ef1391f618 hours ago206 MB

論文閱讀筆記：《SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation》

PEPSI++: Fast and Lightweight Network for Image Inpainting | 簡記

PEPSI : Fast Image Inpainting with Parallel Decoding Network 一篇不錯的筆記該論文側重講解：PEPSI 網路設計 | CAM | 判別器等改進；

Deep Residual Learning for Image Recognition 筆記

轉載於部落格 http://blog.csdn.net/cv_family_z/article/details/50328175 http://blog.csdn.net/u014114990/article/details/50505331

ECCV2020論文-稀疏性表示-Neural Sparse Representation for Image Restoration翻譯

Neural Sparse Representation for Image Restoration 用於影象復原的神經稀疏表示 Abstract 在基於稀疏編碼的影象恢復模型中，基於稀疏表示的魯棒性和有效性，我們研究了深度網路中神經元的稀疏性。我們的

殘差網路：《Deep Residual Learning for Image Recognition》

殘差網路：《Deep Residual Learning for Image Recognition》摘要：網路結構深度的表達對視覺識別任務而言至關重要，論文提出了一種殘差網路結構塊，使得網路的準確度能夠隨著深度的加深而升高。網路結構

nodemon command is not recognized in terminal for node js server

nodemon command is not recognized in terminal for node js server You need to install it globally npm install -g nodemon

2d圖片匯入後像素變大（變小）了 imported image is smaller than in explorer

技術標籤：javaopencvpythonunityvue 現象使用Sprite.Create函式通過讀取圖片來建立Sprite

Linux docker 刪除映象檔案以及解決刪除報錯image is being used by stopped container 的問題

技術標籤：Linuxlinuxdocker 1.使用命令檢視所有映象 docker images 2.根據ID刪除需要刪除的映象

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text 2021-07-22 08:54:20

What is Wrong with Linear Regression for Classification?What is Wrong with Linear Regression for Classification?

1. A linear model does not output probabilities, but it treats the classes as numbers (0 and 1) and fits the best hyperplane (for a single feature, it is a line) that minimizes the distances between

【ARXIV2104】Attention in Attention Network for Image Super-Resolution

程式碼地址：https://github.com/haoyuc/A2N 1、Motivation 注意力機制在超解析度領域應用較多，作者提出兩個問題：

XXX is outside of valid range for type java.lang.Integer

java.lang.RuntimeException: org.springframework.dao.DataIntegrityViolationException: Error attempting to get column \'CHECK_KEY\' from result set. Cause: java.sql.SQLDataException: Value \'1,024,400,

.《First order motion for Image Animation》心得

　　《First order motion for Image Animation》 1.論文理解：很難在同一框架下，完成表情遷移，動作遷移等多專案研究。像表情遷移，其中有固定的眼，鼻子，嘴巴。像動作遷移，有固定關節等。目標不一樣很難實現。

[論文理解]An artificial intelligence-based deep learning algorithm for the diagnosis of diabetic neuropathy using corneal confocal microscopy: a development and validation study

基於人工智慧的角膜共焦顯微鏡診斷糖尿病神經病變的深度學習演算法：開發和驗證研究，2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

本文是對BERT本文的翻譯和名詞透析 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.

opencv安裝實錄附十幾行C++實現的一個人臉識別demo - 良知猶存 - 部落格園 OpenCV的全稱是Open Source Computer Vision Library，是一個跨平臺的計算機視覺庫。OpenCV是由Intel公司發起並參與開發，以BSD許可證授權

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

相關推薦