Chuhui Xue_ECCV2018_Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

阿新 • • 發佈：2018-11-13

Chuhui Xue_ECCV2018_Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

作者和程式碼

關鍵詞

文字檢測、多方向、FCN、$$xywh\theta$$、multi-stage、border

方法亮點

採用Bootstrapping進行資料擴增
增加border-loss

方法概述

本文方法是直接回歸的方法，除了學習text/non-text分類任務，四個點到邊界的迴歸任務（類似EAST），還增加了四條邊界的border學習任務，最後輸出不是直接用prediction的bounding box，而是用了text score map和四個border map來獲得textline。

方法細節

bootstrapping樣本擴增

簡單說其實就是對文字的polygon做一些重複區域性取樣，豐富文字patch的多樣性。

Fig. 2: Illustration of the bootstrapping based scene text sampling: Given an image with a text line as annotated by the green box, three example text line segments are extracted as highlighted by red boxes where the centers of the sampling windows are taken randomly along the center line of the text line (the shrunk part in yellow color). The rest text regions outside of the sampling windows are filled by inpainting.

具體步驟：

確定中心線，進行shrink 0.1L'
沿中心線隨機均勻取樣點（確定框中心）
確定框大小（H = 0.9H'，W 從[0.2L'，2$$d_{min}$$]選取隨機值。$$d_{min}$$表示中心點到兩條短邊的距離的最小值）
把處在原文字框內，但在取樣框外（上圖的綠色框內紅色框外）的部分都進行inpainting（塗抹成單色），以此得到新的訓練圖和text的groundTruth

是否有進行bootstrapping的效果圖如下：

作者認為，進行augmentation之後，feature map上特徵更具有一致性，所以distance map更加平滑。

It can be seen that the inclusion of the augmented images helps to produce more consistent text feature maps as well as smoother geometrical distance maps (for regression of text boxes) which leads to more complete instead of broken scene text detections.

我的理解是，一段很長的文字，處在不同段的文字appearance特徵可能很不一樣。例如Fig 2中有的地方光很強，有的地方光照很弱，因此，bootstrapping取樣的是某個文字的segment，越短則這段text的特徵越一致，特徵學習就會越concentrate在更一致的區域上（其實是把一個學習問題難度降低了）

Fig. 3: The inclusion of augmented images improves the scene text detection: With the inclusion of the augmented images in training, more consistent text feature maps and more complete scene text detections are produced as shown in (d) and (e), as compared with those produced by the baseline model (trained using original training images only) shown in (b) and (c). The coloring in the text feature maps shows the distance information predicted by regressor (blue denotes short distances and red denotes long distance).

增加四個border-pixel的classification

增加4個要學習的border/non-border的classification map。

Fig. 4: Semantics-aware text border detection: Four text border segments are automatically extracted for each text annotation box including a pair of shortedge text border segments in yellow and red colors and a pair of long-edge text border segments in green and blue colors. The four types of text border segments are treated as four types of objects and used to train deep network models, and the trained model is capable of detecting the four types of text border segments as illustrated in Fig. 5c.

具體的四個border的大小如下。上下取0.2H'，左右取0.8H'，比較特別的是左右的寬度取得比較大，是H'。主要是擔心同一文字行文字黏連問題比較多，而且不像上下border具有很長的邊（面積大一些）。上下邊界主要為了解決多行文字黏連問題。border要學習的是從text到background的transition（...the extracted text border segments capture the transition from text to background or vice versa...）。

作者認為增加短的border之所以可以提升效果是因為處在文字中心部分的畫素離兩條短邊比較遠（長條文字），容易產生迴歸誤差，導致檢測結果不精確，而增加border畫素的loss可以幫助解決這個問題。另一方面，增加兩條長border的目的是為了解決挨的比較近的上下兩行文字行的黏連問題。

The reason is that text pixels around the middle of texts are far from the text box vertices for long words or text lines which can easily introduce regression errors and lead to inaccurate localization as illustrated in Fig. 5b. At the other end, the long text border segments also help for better scene text detection performance. In particular, the long text border segments can be exploited to separate text lines when neighboring text lines are close to each other.

是否增加border的效果對比如下圖：

Fig. 5: The use of semantics-aware text borders improves scene text detection: With the identified text border semantics information as illustrated in (c), scene texts can be localized much more accurately as illustrated in (d) as compared with the detections without using the border semantics information as illustrated in (b). Green boxes give correct detections and red boxes give false detections.

損失函式

總的損失

分類損失：採用DIces Coefficient損失（confidence score of each pixel being a text pixel ）
迴歸損失：採用IOU損失（distances from each pixel to four sides of text boundaries）
邊界損失：採用DIces Coefficient損失（confidence score of each pixel being a border pixel）

後處理文字線形成演算法

二值化五個map（1個text region map，4個text border map）：採用mean_score
算出region map和4個border map的overlap
提取文字行（上下邊）
提取左右邊界
merge四個邊界組成的boungding box
NMS

實驗結果

ICDAR2013

MSRA-TD500

ICDAR2017-RCTW

ICDAR2017-MLT

總結與收穫

這篇方法的bootstrap來擴增樣本的方式很有意思，雖然是在目標檢測領域裡早有人這麼用了，但這是第一次引入到OCR。另外，增加border loss的思路也很直接，與Yue Wu_ICCV2017_Self-Organized Text Detection With Minimal Post-Processing via Border Learning的border有點像。

Chuhui Xue_ECCV2018_Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

Chuhui Xue_ECCV2018_Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping 作者和程式碼關鍵詞文字檢測、多方向、FCN、$$xywh\theta$$、multi-st

Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation

摘要先前基於深度學習的最先進的場景文字檢測方法可大致分為兩類。第一類將場景文字視為一般物件的型別，並遵循一般物件檢測範例，通過迴歸文字框位置來定位場景文字，但是受到場景文字的任意方向和大縱橫比的困擾。第二個直接分割文字區域，但大多數需要複雜的後期處理。在本文中，我們提出了一種方法，它結合了兩種

Fangfang Wang_CVPR2018_Geometry-Aware Scene Text Detection With Instance Transformation Network

Han Hu——【ICCV2017】WordSup_Exploiting Word Annotations for Character based Text Detection 作者和程式碼 caffe檢測torch7識別程式碼關鍵詞文字檢測、多方向、直接回歸、$$xywh\theta$$ 、

【論文速讀】XiangBai_CVPR2018_Rotation-Sensitive Regression for Oriented Scene Text Detection

XiangBai_CVPR2018_Rotation-Sensitive Regression for Oriented Scene Text Detection 作者和程式碼 caffe程式碼關鍵詞文字檢測、多方向、SSD、$$xywh\theta$$、one-stage，開源方法亮點

【論文速讀】Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation[2018-CPVR]

方法概述該方法用一個端到端網路完成文字檢測整個過程——除了基礎卷積網路（backbone）外，包括兩個並行分支和一個後處理。第一個分支是通過一個DSSD網路進行角點檢測來提取候選文字區域，第二個分支是利用類似於RFCN進行網格劃分的方式來做position-sensitive的segmentation。後

Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation 論文詳解

Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation發表於2018年的cvpr，該文章通過結合角點檢測和影象分割來對影象文字進行定位 Introduction 目前文字定

ICDAR Focused Scene Text Detection任務的資料集

主要內容： ICDAR自然場景文字識別有兩個難度不同的挑戰：難度較小的Focused Scene Text Detection 與難度較大的 Incidental Scene Text Detection。本文介紹前者。主要介紹其tasks與Localizat

（Paper）Robust Text Detection in Natural Scene Images

這篇文章是2014年PAMI上的文章，是目前文字檢測領域的state of the art. 該演算法是基於MSERs的，主要內容有：演算法流程 1 Character candidates extraction 使用MSERs演

awesome scene text

運行 is-a 集成 ever targe origin xtra 窗口 words awesome scene textscene text recognition scene text spotter scene text detection Awes

Multi-Oriented Text Detection with Fully Convolutional Networks

2.相關工作自然影象中的文字檢測已經受到計算機視覺和文件分析社群的廣泛關注。然而，大多數文字檢測方法主要以兩種方式集中於檢測水平或近水平文字：1）定位單詞的邊界框[4,3,17,15,18,33,5,6]，2）組合檢測和識別程式成為端到端的文字識別方法[8,28]。場景文字檢測和識別的綜

Single-Shot Object Detection with Enriched Semantics

整合一下能夠查到的資料，然後結合自己的理解，算是對這篇文章的一個小小的總結吧。這是CVPR2018的一篇關於小目標檢測的文章，出發點是作者認為小目標的檢測資訊隨著層數的增加而不斷地丟失了，所以想利用語義分割強化淺層的特徵資訊（這裡強化可能用得不準確，但是我是這麼理解的）。整體的網路框架分為三個部分Detect

【論文速讀】Sheng Zhang_AAAI2018_Feature Enhancement Network_A Refined Scene Text Detector

Sheng Zhang_AAAI2018_Feature Enhancement Network_A Refined Scene Text Detector 作者關鍵詞文字檢測、水平文字、Faster- RCNN、xywh、multi-stage 方法亮點 Feature Enhanc

深度學習論文翻譯解析（二）：An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

論文標題：An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 論文作者： Baoguang Shi, Xiang B

Chuhui Xue_ECCV2018_Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

Chuhui Xue_ECCV2018_Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

作者和程式碼

關鍵詞

方法亮點

方法概述