1. 程式人生 > >awesome scene text

awesome scene text

運行 is-a 集成 ever targe origin xtra 窗口 words

awesome scene text

scene text recognition scene text spotter scene text detection

Awesome Scene text


IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

mutli-oriented text
Abstract
Incidental scene text detection, especially for multi-oriented text regions, is one of the most challenging tasks in many computer vision applications. Different from the common object detection task, scene text often suffers from a large variance of aspect ratio, scale, and orientation. To solve this problem, we propose a novel end-to-end scene text detector IncepText from an instance-aware segmentation perspective. We design a novel Inception-Text module and introduce deformable PSROI pooling to deal with multi-oriented text detection. Extensive experiments on ICDAR2015, RCTW-17, and MSRA-TD500 datasets demonstrate our method‘s superiority in terms of both effectiveness and efficiency. Our proposed method achieves 1st place result on ICDAR2015 challenge and the state-of-the-art performance on other datasets. Moreover, we have released our implementation as an OCR product which is available for public access.
摘要


任意場景文本檢測,特別是對於多方向文本區域,是許多計算機視覺應用中最具挑戰性的任務之一。不同於普通的物體檢測任務,場景文本經常會遭受長寬比、比例和方向的巨大變化。為了解決這個問題,我們從實例感知分割的角度提出了一種新的端到端場景文本檢測器。我們設計了一個新的初始文本模塊,並引入了可變形PSROI池來處理多方向文本檢測。在icdar 2015、RCTW - 17和MSRA - TD500數據集上的大量實驗證明了我們的方法在有效性和效率方面的優勢。我們提出的方法在icdar 2015挑戰賽上取得了第一名的成績,在其他數據集上也取得了最新的性能。此外,我們已經發布了我們作為OCR產品的實現,可供公眾訪問。

  • pdf
  • code

Shape Robust Text Detection with Progressive Scale Expansion Network

mutli-oriented text
Abstract
The challenges of shape robust text detection lie in two aspects: 1) most existing quadrangular bounding box based detectors are difficult to locate texts with arbitrary shapes, which are hard to be enclosed perfectly in a rectangle; 2) most pixel-wise segmentation-based detectors may not separate the text instances that are very close to each other. To address these problems, we propose a novel Progressive Scale Expansion Network (PSENet), designed as a segmentation-based detector with multiple predictions for each text instance. These predictions correspond to different kernels produced by shrinking the original text instance into various scales. Consequently, the final detection can be conducted through our progressive scale expansion algorithm which gradually expands the kernels with minimal scales to the text instances with maximal and complete shapes. Due to the fact that there are large geometrical margins among these minimal kernels, our method is effective to distinguish the adjacent text instances and is robust to arbitrary shapes. The state-of-the-art results on ICDAR 2015 and ICDAR 2017 MLT benchmarks further confirm the great effectiveness of PSENet. Notably, PSENet outperforms the previous best record by absolute 6.37% on the curve text dataset SCUT-CTW1500.

摘要
形狀魯棒文本檢測的挑戰在於兩個方面: 1 )大多數現有的基於四邊形包圍盒的檢測器難以定位具有任意形狀的文本,這些文本很難被完美地包圍在矩形中;2 )大多數基於像素分割的檢測器可能不會分離彼此非常接近的文本實例。為了解決這些問題,我們提出了一種新的漸進式擴展網絡( PSENet ),它被設計為基於分段的檢測器,對每個文本實例都有多個預測。這些預測對應於不同的“內核”,這些內核是通過將原始文本實例縮小到不同的尺度而產生的。因此,最終的檢測可以通過我們的漸進尺度擴展算法來進行,該算法將最小尺度的內核逐漸擴展到最大和完整形狀的文本實例。由於這些最小核之間有很大的幾何邊界,我們的方法可以有效地區分相鄰的文本實例,並且對任意形狀都是魯棒的。ICDAR 2015年和ICDAR 2017年MLT基準的最新結果進一步證實了PSENet的巨大有效性。值得註意的是,在曲線文本數據集SCUT - ctw 1500上,PSENet的性能優於前一最佳記錄6.37 %。

  • pdf
  • code

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

mutli-oriented text Arbitrary Shapes
Abstract
Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40% in F-measure.

摘要
在深層神經網絡和大規模數據集的驅動下,場景文本檢測方法在過去幾年中取得了長足的進步,不斷刷新各種標準基準上的性能記錄。然而,受用於描述文本的表示(軸對齊矩形、旋轉矩形或四邊形)的限制,現有方法在處理更多自由形式的文本實例時可能會有所欠缺,例如彎曲文本,這在現實世界場景中非常常見。為了解決這個問題,我們提出了一種更靈活的場景文本表示,稱為TextSnake,它能夠以水平、定向和彎曲的形式有效地表示文本實例。在TextSnake中,文本實例被描述為以對稱軸為中心的有序重疊磁盤序列,每個磁盤都與潛在的可變半徑和方向相關聯。這種幾何屬性是通過完全卷積網絡( FCN )模型來估計的。在實驗中,基於TextSnake的文本檢測器在全文本和SCUT - ctw 1500 (兩個新發布的特別強調自然圖像中的彎曲文本的基準)以及廣泛使用的數據集ICDAR 2015和MSRA - TD 500上實現了最新或可比的性能。具體來說,TextSnake在F - measure中的總文本數超過基線40 %。

  • pdf

Sliding Line Point Regression for Shape Robust Scene Text Detection

mutli-oriented text Arbitrary Shapes
Abstract
Traditional text detection methods mostly focus on quadrangle text. In this study we propose a novel method named sliding line point regression (SLPR) in order to detect arbitrary-shape text in natural scene. SLPR regresses multiple points on the edge of text line and then utilizes these points to sketch the outlines of the text. The proposed SLPR can be adapted to many object detection architectures such as Faster R-CNN and R-FCN. Specifically, we first generate the smallest rectangular box including the text with region proposal network (RPN), then isometrically regress the points on the edge of text by using the vertically and horizontally sliding lines. To make full use of information and reduce redundancy, we calculate x-coordinate or y-coordinate of target point by the rectangular box position, and just regress the remaining y-coordinate or x-coordinate. Accordingly we can not only reduce the parameters of system, but also restrain the points which will generate more regular polygon. Our approach achieved competitive results on traditional ICDAR2015 Incidental Scene Text benchmark and curve text detection dataset CTW1500.

摘要
傳統的文本檢測方法主要集中在四邊形文本上。在本研究中,我們提出了一種新的方法——滑動線點回歸( SLPR ),用於檢測自然場景中任意形狀的文本。SLPR回歸文本行邊緣的多個點,然後利用這些點繪制文本的輪廓。所提出的SLPR可以適用於許多物體檢測體系結構,例如更快的R - CNN和R - FCN。具體來說,我們首先生成最小的矩形框,包括帶有區域建議網絡( RPN )的文本,然後使用垂直和水平滑動線等距離回歸文本邊緣的點。為了充分利用信息並減少冗余,我們通過矩形框位置來計算目標點的x坐標或y坐標,並且只回歸剩余的y坐標或x坐標。因此,我們不僅可以減少系統的參數,還可以抑制產生更多正多邊形的點。我們的方法在傳統的icdar 2015附帶場景文本基準和曲線文本檢測數據集ctw 1500上取得了競爭結果。

  • pdf

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

mutli-oriented text
Abstract
This paper introduces a novel rotation-based framework for arbitrary-oriented text detection in natural scene images. We present the Rotation Region Proposal Networks (RRPN), which are designed to generate inclined proposals with text orientation angle information. The angle information is then adapted for bounding box regression to make the proposals more accurately fit into the text region in terms of the orientation. The Rotation Region-of-Interest (RRoI) pooling layer is proposed to project arbitrary-oriented proposals to a feature map for a text region classifier. The whole framework is built upon a region-proposal-based architecture, which ensures the computational efficiency of the arbitrary-oriented text detection compared with previous text detection systems. We conduct experiments using the rotation-based framework on three real-world scene text detection datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.

摘要
本文介紹了一種新的基於旋轉的自然場景圖像中任意方向文本檢測框架。我們提出了旋轉區域提議網絡( RRPN ),該網絡被設計用於生成帶有文本方向角度信息的傾斜提議。然後,角度信息被調整為邊界框回歸,以使提案在方向上更準確地適合文本區域。提出了旋轉感興趣區域( RRoI )池層來將任意方向的建議投影到文本區域分類器的特征圖中。整個框架建立在基於區域提議的架構上,與以前的文本檢測系統相比,這確保了任意方向文本檢測的計算效率。我們在三個真實場景文本檢測數據集上使用基於旋轉的框架進行了實驗,並證明了它在有效性和效率方面優於以前的方法。

  • pdf
  • code

Detecting Curve Text in the Wild: New Dataset and New Solution

mutli-oriented text Arbitrary Shapes
Abstract
Scene text detection has been made great progress in recent years. The detection manners are evolving from axis-aligned rectangle to rotated rectangle and further to quadrangle. However, current datasets contain very little curve text, which can be widely observed in scene images such as signboard, product name and so on. To raise the concerns of reading curve text in the wild, in this paper, we construct a curve text dataset named CTW1500, which includes over 10k text annotations in 1,500 images (1000 for training and 500 for testing). Based on this dataset, we pioneering propose a polygon based curve text detector (CTD) which can directly detect curve text without empirical combination. Moreover, by seamlessly integrating the recurrent transverse and longitudinal offset connection (TLOC), the proposed method can be end-to-end trainable to learn the inherent connection among the position offsets. This allows the CTD to explore context information instead of predicting points independently, resulting in more smooth and accurate detection. We also propose two simple but effective post-processing methods named non-polygon suppress (NPS) and polygonal non-maximum suppression (PNMS) to further improve the detection accuracy. Furthermore, the proposed approach in this paper is designed in an universal manner, which can also be trained with rectangular or quadrilateral bounding boxes without extra efforts. Experimental results on CTW-1500 demonstrate our method with only a light backbone can outperform state-of-the-art methods with a large margin. By evaluating only in the curve or non-curve subset, the CTD + TLOC can still achieve the best results.

摘要
近年來,場景文本檢測取得了很大進展。檢測方式正在從軸對齊矩形發展到旋轉矩形,並進一步發展到四邊形。然而,當前數據集包含非常少的曲線文本,這可以在招牌、產品名稱等場景圖像中廣泛觀察到。為了引起人們對在野外閱讀曲線文本的關註,在本文中,我們構建了一個名為ctw 1500的曲線文本數據集,其中包括1500幅圖像中的10k多條文本註釋( 1000條用於培訓,500條用於測試)。基於這個數據集,我們開拓性地提出了一種基於多邊形的曲線文本檢測器( CTD ),它可以直接檢測曲線文本而無需經驗組合。此外,通過無縫集成反復出現的橫向和縱向偏移連接( TLOC ),所提出的方法可以進行端到端的訓練,以了解位置偏移之間的內在聯系。這允許CTD探索上下文信息,而不是獨立預測點,從而實現更平滑和準確的檢測。我們還提出了兩種簡單但有效的後處理方法,即非多邊形抑制( NPS )和多邊形非最大抑制( PNMS ),以進一步提高檢測精度。此外,本文提出的方法是以通用的方式設計的,也可以用矩形或四邊形邊界框來訓練,而無需額外的努力。在CTW - 1500上的實驗結果表明,我們的方法僅用一根輕骨架就能大大優於最先進的方法。通過僅在曲線或非曲線子集中進行評估,CTD + TLOC仍然可以獲得最佳結果。

  • pdf
  • code

FOTS: Fast Oriented Text Spotting with a Unified Network

mutli-oriented text
Abstract
Incidental scene text spotting is considered one of the most difficult and valuable challenges in the document analysis community. Most existing methods treat text detection and recognition as separate tasks. In this work, we propose a unified end-to-end trainable Fast Oriented Text Spotting (FOTS) network for simultaneous detection and recognition, sharing computation and visual information among the two complementary tasks. Specially, RoIRotate is introduced to share convolutional features between detection and recognition. Benefiting from convolution sharing strategy, our FOTS has little computation overhead compared to baseline text detection network, and the joint training method learns more generic features to make our method perform better than these two-stage methods. Experiments on ICDAR 2015, ICDAR 2017 MLT, and ICDAR 2013 datasets demonstrate that the proposed method outperforms state-of-the-art methods significantly, which further allows us to develop the first real-time oriented text spotting system which surpasses all previous state-of-the-art results by more than 5% on ICDAR 2015 text spotting task while keeping 22.6 fps.

摘要
任意場景文本被認為是文檔分析社區中最困難和最有價值的挑戰之一。大多數現有的方法將文本檢測和識別作為單獨的任務。在這項工作中,我們提出了一個統一的端到端可訓練的快速定向文本定位( FOTS )網絡,用於同時檢測和識別,在兩個互補任務之間共享計算和視覺信息。特別地,ROI rotate被引入到檢測和識別之間共享卷積特征。得益於卷積共享策略,與基線文本檢測網絡相比,我們的FOTS計算開銷很小,並且聯合訓練方法學習了更多的通用特征,使得我們的方法比這兩階段方法性能更好。在ICDAR 2015年、ICDAR 2017年MLT和ICDAR 2013年數據集上的實驗表明,所提出的方法明顯優於最先進的方法,這進一步允許我們開發第一個面向實時的文本定位系統,在保持22.6 fps的情況下,在ICDAR 2015年文本定位任務上,該系統超過了所有先前最先進的結果5 %。

  • pdf
  • code

TextBoxes++: A Single-Shot Oriented Scene Text Detector

mutli-oriented text
Abstract
Scene text detection is an important step of scene text recognition system and also a challenging problem. Different from general object detection, the main challenges of scene text detection lie on arbitrary orientations, small sizes, and significantly variant aspect ratios of text in natural images. In this paper, we present an end-to-end trainable fast scene text detector, named TextBoxes++, which detects arbitrary-oriented scene text with both high accuracy and efficiency in a single network forward pass. No post-processing other than an efficient non-maximum suppression is involved. We have evaluated the proposed TextBoxes++ on four public datasets. In all experiments, TextBoxes++ outperforms competing methods in terms of text localization accuracy and runtime. More specifically, TextBoxes++ achieves an f-measure of 0.817 at 11.6fps for 1024x1024 ICDAR 2015 Incidental text images, and an f-measure of 0.5591 at 19.8fps for 768x768 COCO-Text images. Furthermore, combined with a text recognizer, TextBoxes++ significantly outperforms the state-of-the-art approaches for word spotting and end-to-end text recognition tasks on popular benchmarks.

摘要
場景文本檢測是場景文本識別系統的重要步驟,也是一個具有挑戰性的問題。不同於一般物體檢測,場景文本檢測的主要挑戰在於自然圖像中文本的任意方向、小尺寸和顯著變化的縱橫比。在本文中,我們提出了一種端到端可訓練的快速場景文本檢測器TextBoxes++,它在一次網絡前向傳遞中以高精度和高效率檢測任意方向的場景文本。除了有效的非最大抑制之外,不涉及任何後處理。我們已經在四個公共數據集上評估了TextBoxes++。在所有實驗中,TextBoxes++ 在文本定位精度和運行時間方面都優於競爭方法。更具體地說,對於1024 * 1024 ICDAR 2015附帶文本圖像,TextBoxes++ 在11.6 fps下獲得了0.817的f值,對於768 * 768 COCO - Text圖像,在19.8 fps下獲得了0.5591的f值。此外,與文本識別器相結合,TextBoxes++ 在流行基準測試中顯著優於最先進的單詞識別和端到端文本識別任務。

  • pdf
  • code

R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection

mutli-oriented text
Abstract
In this paper, we propose a novel method called Rotational Region CNN (R2CNN) for detecting arbitrary-oriented texts in natural scene images. The framework is based on Faster R-CNN [1] architecture. First, we use the Region Proposal Network (RPN) to generate axis-aligned bounding boxes that enclose the texts with different orientations. Second, for each axis-aligned text box proposed by RPN, we extract its pooled features with different pooled sizes and the concatenated features are used to simultaneously predict the text/non-text score, axis-aligned box and inclined minimum area box. At last, we use an inclined non-maximum suppression to get the detection results. Our approach achieves competitive results on text detection benchmarks: ICDAR 2015 and ICDAR 2013.

摘要
本文提出了一種新的檢測自然場景圖像中任意方向文本的方法——旋轉區域CNN ( R2CNN )。該框架基於更快的R - CNN [ 1 ]架構。首先,我們使用區域提議網絡( RPN )來生成軸對齊的邊界框,這些邊界框以不同的方向包圍文本。其次,對於RPN提出的每一個軸對齊文本框,我們提取不同大小的集合特征,並使用這些連接特征同時預測文本/非文本分數、軸對齊框和傾斜最小面積框。最後,我們使用傾斜的非最大抑制來獲得檢測結果。我們的方法在文本檢測基準方面取得了有競爭力的結果: ICDAR 2015年和ICDAR 2013年。

  • pdf
  • code-caffe
  • code-tensorflow

PixelLink: Detecting Scene Text via Instance Segmentation

mutli-oriented text Arbitrary Shapes
Abstract
Most state-of-the-art scene text detection algorithms are deep learning based methods that depend on bounding box regression and perform at least two kinds of predictions: text/non-text classification and location regression. Regression plays a key role in the acquisition of bounding boxes in these methods, but it is not indispensable because text/non-text prediction can also be considered as a kind of semantic segmentation that contains full location information in itself. However, text instances in scene images often lie very close to each other, making them very difficult to separate via semantic segmentation. Therefore, instance segmentation is needed to address this problem. In this paper, PixelLink, a novel scene text detection algorithm based on instance segmentation, is proposed. Text instances are first segmented out by linking pixels within the same instance together. Text bounding boxes are then extracted directly from the segmentation result without location regression. Experiments show that, compared with regression-based methods, PixelLink can achieve better or comparable performance on several benchmarks, while requiring many fewer training iterations and less training data.

摘要
大多數最先進的場景文本檢測算法是基於深度學習的方法,這些方法依賴於邊界框回歸並執行至少兩種預測:文本/非文本分類和位置回歸。回歸在這些方法中的邊界框的獲取中起著關鍵作用,但是這並不是不可或缺的,因為文本/非文本預測本身也可以被認為是一種包含完整位置信息的語義分割。然而,場景圖像中的文本實例通常彼此非常接近,這使得它們很難通過語義分割來分離。因此,需要實例分割來解決這個問題。本文提出了一種新的基於實例分割的場景文本檢測算法pixelink。文本實例首先通過將同一實例中的像素鏈接在一起而被分割出來。然後直接從分割結果中提取文本邊界框,而無需位置回歸。實驗表明,與基於回歸的方法相比,pixelink可以在幾個基準上獲得更好或可比的性能,同時需要更少的訓練叠代和更少的訓練數據。

  • pdf
  • code

EAST: An Efficient and Accurate Scene Text Detector

mutli-oriented text
Abstract
Previous approaches for scene text detection have already achieved promising performances across various benchmarks. However, they usually fall short when dealing with challenging scenarios, even when equipped with deep neural network models, because the overall performance is determined by the interplay of multiple stages and components in the pipelines. In this work, we propose a simple yet powerful pipeline that yields fast and accurate text detection in natural scenes. The pipeline directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images, eliminating unnecessary intermediate steps (e.g., candidate aggregation and word partitioning), with a single neural network. The simplicity of our pipeline allows concentrating efforts on designing loss functions and neural network architecture. Experiments on standard datasets including ICDAR 2015, COCO-Text and MSRA-TD500 demonstrate that the proposed algorithm significantly outperforms state of-the-art methods in terms of both accuracy and efficiency. On the ICDAR 2015 dataset, the proposed algorithm achieves an F-score of 0.7820 at 13.2fps at 720p resolution.

摘要
以前用於場景文本檢測的方法已經在各種基準測試中取得了有希望的性能。然而,在處理具有挑戰性的場景時,它們通常是不夠的,即使配備了深層神經網絡模型,因為整體性能取決於管道中多個階段和組件的相互作用。在這項工作中,我們提出了一個簡單而強大的管道,可以在自然場景中快速準確地檢測文本。流水線用單個神經網絡直接預測全圖中任意方向和四邊形形狀的單詞或文本行,消除了不必要的中間步驟(例如,候選聚合和單詞劃分)。我們管道的簡單性允許集中精力設計損失函數和神經網絡結構。在包括ICDAR 2015、COCO - Text和MSRA - TD500在內的標準數據集上的實驗表明,所提出的算法在準確性和效率上都明顯優於最先進的方法。在ICDAR 2015數據集上,在720 p分辨率下,該算法在13.2 fps下獲得了0.7820的F分數。

  • pdf
  • code
  • code-AdvancedEAST
  • code-pytorch

Single Shot Text Detector with Regional Attention

Abstract
We present a novel single-shot text detector that directly outputs word-level bounding boxes in a natural image. We propose an attention mechanism which roughly identifies text regions via an automatically learned attentional map. This substantially suppresses background interference in the convolutional features, which is the key to producing accurate inference of words, particularly at extremely small sizes. This results in a single model that essentially works in a coarse-to-fine manner. It departs from recent FCN- based text detectors which cascade multiple FCN models to achieve an accurate prediction. Furthermore, we develop a hierarchical inception module which efficiently aggregates multi-scale inception features. This enhances local details, and also encodes strong context information, allow- ing the detector to work reliably on multi-scale and multi- orientation text with single-scale images. Our text detector achieves an F-measure of 77% on the ICDAR 2015 bench- mark, advancing the state-of-the-art results in [18, 28].

摘要
我們提出了一種新的單鏡頭文本檢測器,它直接輸出自然圖像中的單詞級邊框。我們提出了一種註意機制,通過自動學習的註意地圖粗略識別文本區域。這基本上抑制了卷積特征中的背景幹擾,這是產生精確單詞推斷的關鍵,尤其是在非常小的尺寸下。這導致一個基本上以從粗到細的方式工作的單一模型。它不同於最近基於FCN的文本檢測器,後者級聯多個FCN模型以實現精確預測。此外,我們開發了一個分層初始模塊,它有效地聚集了多尺度初始特征。這增強了局部細節,也編碼了強上下文信息,允許檢測器在多尺度和多方向文本上可靠地工作,並帶有單尺度圖像。我們的文本檢測器在ICDAR 2015年基準測試中實現了77 %的F值,提高了[ 18,28 ]的最新結果。

  • pdf
  • code

Detecting Multi-Oriented Text with Corner-based Region Proposals

Abstract
Previous approaches for scene text detection usually rely on manually defined sliding windows. In this paper, an intuitive region-based method is presented to detect multi-oriented text without any prior knowledge regarding the textual shape. We first introduce a Corner-based Region Proposal Network (CRPN) that employs corners to estimate the possible locations of text instances instead of shifting a set of default anchors. The proposals generated by CRPN are geometry adaptive, which makes our method robust to various text aspect ratios and orientations. Moreover, we design a simple embedded data augmentation module inside the region-wise subnetwork, which not only ensures the model utilizes training data more efficiently, but also learns to find the most representative instance of the input images for training. Experimental results on public benchmarks confirm that the proposed method is capable of achieving comparable performance with the state-of-the-art methods. On the ICDAR 2013 and 2015 datasets, it obtains F-measure of 0.876 and 0.845 respectively.

摘要
以前用於場景文本檢測的方法通常依賴於手動定義的滑動窗口。本文提出了一種直觀的基於區域的方法來檢測多方向文本,而無需任何關於文本形狀的先驗知識。我們首先引入了一個基於角點的區域提議網絡( CRPN ),它使用角點來估計文本實例的可能位置,而不是移動一組默認錨點。CRPN產生的建議是幾何自適應的,這使得我們的方法對不同的文本長寬比和方向具有魯棒性。此外,我們在區域子網內設計了一個簡單的嵌入式數據增強模塊,這不僅確保了模型更有效地利用訓練數據,還學會了找到最具代表性的訓練輸入圖像實例。公共基準測試的實驗結果證實,所提出的方法能夠達到與最先進的方法相當的性能。在ICDAR 2013年和2015年數據集上,它分別獲得了0.876和0.845的F值。

  • pdf
  • code

An end-to-end TextSpotter with Explicit Alignment and Attention

Abstract
Text detection and recognition in natural images have long been considered as two separate tasks that are processed sequentially. Training of two tasks in a unified framework is non-trivial due to significant dif- ferences in optimisation difficulties. In this work, we present a conceptually simple yet efficient framework that simultaneously processes the two tasks in one shot. Our main contributions are three-fold: 1) we propose a novel text-alignment layer that allows it to precisely compute convolutional features of a text instance in ar- bitrary orientation, which is the key to boost the per- formance; 2) a character attention mechanism is introduced by using character spatial information as explicit supervision, leading to large improvements in recognition; 3) two technologies, together with a new RNN branch for word recognition, are integrated seamlessly into a single model which is end-to-end trainable. This allows the two tasks to work collaboratively by shar- ing convolutional features, which is critical to identify challenging text instances. Our model achieves impressive results in end-to-end recognition on the ICDAR2015 dataset, significantly advancing most recent results, with improvements of F-measure from (0.54, 0.51, 0.47) to (0.82, 0.77, 0.63), by using a strong, weak and generic lexicon respectively. Thanks to joint training, our method can also serve as a good detec- tor by achieving a new state-of-the-art detection performance on two datasets.

摘要
長期以來,自然圖像中的文本檢測和識別一直被認為是兩個獨立的任務,並按順序進行處理。由於優化難度的巨大差異,在統一框架中對兩項任務的培訓並非微不足道。在這項工作中,我們提出了一個概念上簡單但有效的框架,它一次同時處理兩項任務。我們的主要貢獻有三點: 1 )我們提出了一種新的文本對齊層,它允許精確計算文本實例在任意方向上的卷積特征,這是提高性能的關鍵;2 )通過使用字符空間信息作為顯式監督,引入了字符註意機制,導致識別的大幅改進;3 )兩項技術以及一個新的RNN分支用於單詞識別,無縫集成到一個可端到端訓練的單一模型中。這使得這兩項任務能夠通過切分卷積特征協同工作,這對識別具有挑戰性的文本實例至關重要。我們的模型在對icdar 2015數據集的端到端識別方面取得了令人印象深刻的結果,通過分別使用強、弱和通用詞典,F - measure從( 0.54,0.51,0.47 )提高到( 0.82,0.77,0.63 ),顯著提升了最近的結果。由於聯合訓練,我們的方法還可以通過在兩個數據集上實現新的最先進的檢測性能,成為一個很好的檢測器。

  • pdf
  • code

awesome scene text