使用深度學習破解字元驗證碼(轉)
驗證碼是根據隨機字元生成一幅圖片,然後在圖片中加入干擾象素,使用者必須手動填入,防止有人利用機器人自動批量註冊、灌水、發垃圾廣告等等 。
驗證碼的作用是驗證使用者是真人還是機器人;設計理念是對人友好,對機器難。
上圖是常見的字元驗證碼,還有一些驗證碼使用提問的方式。
我們先來看看破解驗證碼的幾種方式:
- 人力打碼(基本上,打碼任務都是大型網站的驗證碼,用於自動化註冊等等)
- 找到能過驗證碼的漏洞
- 最後一種是字元識別,這是本帖的關注點
我上網查了查,用Tesseract OCR、OpenCV等等其它方法都需把驗證碼分割為單個字元,然後識別單個字元。分割驗證碼可是人的強項,如果字元之間相互重疊,那機器就不容易分割了。
本帖實現的方法不需要分割驗證碼,而是把驗證碼做為一個整體進行識別。
相關論文:
- Multi-digit Number Recognition from Street View Imagery using Deep CNN
- CAPTCHA Recognition with Active Deep Learning
- http://matthewearl.github.io/2016/05/06/cnn-anpr/
使用深度學習+訓練資料+大量計算力,我們可以在幾天內訓練一個可以破解驗證碼的模型,當然前提是獲得大量訓練資料。
獲得訓練資料方法:
- 手動(累死人系列)
- 破解驗證碼生成機制,自動生成無限多的訓練資料
- 打入敵人內部(臥底+不要臉+不要命+多大仇系列)
我自己做一個驗證碼生成器,然後訓練CNN模型破解自己做的驗證碼生成器。
我覺的驗證碼機制可以廢了,單純的增加驗證碼難度只會讓人更難識別,使用CNN+RNN,機器的識別準確率不比人差。Google已經意識到了這一點,他們現在使用機器學習技術檢測異常流量。
驗證碼生成器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
from captcha.image import ImageCaptcha # pip install captcha import numpy as np import matplotlib.pyplot as plt from PIL import Image import random
# 驗證碼中的字元, 就不用漢字了 number = ['0','1','2','3','4','5','6','7','8','9'] alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] ALPHABET = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'] # 驗證碼一般都無視大小寫;驗證碼長度4個字元 def random_captcha_text(char_set=number+alphabet+ALPHABET, captcha_size=4): captcha_text = [] for i in range(captcha_size): c = random.choice(char_set) captcha_text.append(c) return captcha_text
# 生成字元對應的驗證碼 def gen_captcha_text_and_image(): image = ImageCaptcha()
captcha_text = random_captcha_text() captcha_text = ''.join(captcha_text)
captcha = image.generate(captcha_text) #image.write(captcha_text, captcha_text + '.jpg') # 寫到檔案
captcha_image = Image.open(captcha) captcha_image = np.array(captcha_image) return captcha_text, captcha_image
if __name__ == '__main__': # 測試 text, image = gen_captcha_text_and_image()
f = plt.figure() ax = f.add_subplot(111) ax.text(0.1, 0.9,text, ha='center', va='center', transform=ax.transAxes) plt.imshow(image)
plt.show() |
左上角文字對應驗證碼影象
訓練
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
from gen_captcha import gen_captcha_text_and_image from gen_captcha import number from gen_captcha import alphabet from gen_captcha import ALPHABET
import numpy as np import tensorflow as tf
text, image = gen_captcha_text_and_image() print("驗證碼影象channel:", image.shape) # (60, 160, 3) # 影象大小 IMAGE_HEIGHT = 60 IMAGE_WIDTH = 160 MAX_CAPTCHA = len(text) print("驗證碼文字最長字元數", MAX_CAPTCHA) # 驗證碼最長4字元; 我全部固定為4,可以不固定. 如果驗證碼長度小於4,用'_'補齊
# 把彩色影象轉為灰度影象(色彩對識別驗證碼沒有什麼用) def convert2gray(img): if len(img.shape) > 2: gray = np.mean(img, -1) # 上面的轉法較快,正規轉法如下 # r, g, b = img[:,:,0], img[:,:,1], img[:,:,2] # gray = 0.2989 * r + 0.5870 * g + 0.1140 * b return gray else: return img
""" cnn在影象大小是2的倍數時效能最高, 如果你用的影象大小不是2的倍數,可以在影象邊緣補無用畫素。 np.pad(image,((2,3),(2,2)), 'constant', constant_values=(255,)) # 在影象上補2行,下補3行,左補2行,右補2行 """
# 文字轉向量 char_set = number + alphabet + ALPHABET + ['_'] # 如果驗證碼長度小於4, '_'用來補齊 CHAR_SET_LEN = len(char_set) def text2vec(text): text_len = len(text) if text_len > MAX_CAPTCHA: raise ValueError('驗證碼最長4個字元')
vector = np.zeros(MAX_CAPTCHA*CHAR_SET_LEN) def char2pos(c): if c =='_': k = 62 return k k = ord(c)-48 if k > 9: k = ord(c) - 55 if k > 35: k = ord(c) - 61 if k > 61: raise ValueError('No Map') return k for i, c in enumerate(text): idx = i * CHAR_SET_LEN + char2pos(c) vector[idx] = 1 return vector # 向量轉回文字 def vec2text(vec): char_pos = vec.nonzero()[0] text=[] for i, c in enumerate(char_pos): char_at_pos = i #c/63 char_idx = c % CHAR_SET_LEN if char_idx < 10: char_code = char_idx + ord('0') elif char_idx <36: char_code = char_idx - 10 + ord('A') elif char_idx < 62: char_code = char_idx- 36 + ord('a') elif char_idx == 62: char_code = ord('_') else: raise ValueError('error') text.append(chr(char_code)) return "".join(text)
""" #向量(大小MAX_CAPTCHA*CHAR_SET_LEN)用0,1編碼 每63個編碼一個字元,這樣順利有,字元也有 vec = text2vec("F5Sd") text = vec2text(vec) print(text) # F5Sd vec = text2vec("SFd5") text = vec2text(vec) print(text) # SFd5 """
# 生成一個訓練batch def get_next_batch(batch_size=128): batch_x = np.zeros([batch_size, IMAGE_HEIGHT*IMAGE_WIDTH]) batch_y = np.zeros([batch_size, MAX_CAPTCHA*CHAR_SET_LEN])
# 有時生成影象大小不是(60, 160, 3) def wrap_gen_captcha_text_and_image(): while True: text, image = gen_captcha_text_and_image() if image.shape == (60, 160, 3): return text, image
for i in range(batch_size): text, image = wrap_gen_captcha_text_and_image() image = convert2gray(image)
batch_x[i,:] = image.flatten() / 255 # (image.flatten()-128)/128 mean為0 batch_y[i,:] = text2vec(text)
return batch_x, batch_y
####################################################################
X = tf.placeholder(tf.float32, [None, IMAGE_HEIGHT*IMAGE_WIDTH]) Y = tf.placeholder(tf.float32, [None, MAX_CAPTCHA*CHAR_SET_LEN]) keep_prob = tf.placeholder(tf.float32) # dropout
# 定義CNN def crack_captcha_cnn(w_alpha=0.01, b_alpha=0.1): x = tf.reshape(X, shape=[-1, IMAGE_HEIGHT, IMAGE_WIDTH, 1])
#w_c1_alpha = np.sqrt(2.0/(IMAGE_HEIGHT*IMAGE_WIDTH)) # #w_c2_alpha = np.sqrt(2.0/(3*3*32)) #w_c3_alpha = np.sqrt(2.0/(3*3*64)) #w_d1_alpha = np.sqrt(2.0/(8*32*64)) #out_alpha = np.sqrt(2.0/1024)
# 3 conv layer w_c1 = tf.Variable(w_alpha*tf.random_normal([3, 3, 1, 32])) b_c1 = tf.Variable(b_alpha*tf.random_normal([32])) conv1 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(x, w_c1, strides=[1, 1, 1, 1], padding='SAME'), b_c1)) conv1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') conv1 = tf.nn.dropout(conv1, keep_prob)
w_c2 = tf.Variable(w_alpha*tf.random_normal([3, 3, 32, 64])) b_c2 = tf.Variable(b_alpha*tf.random_normal([64])) conv2 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv1, w_c2, strides=[1, 1, 1, 1], padding='SAME'), b_c2)) conv2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') conv2 = tf.nn.dropout(conv2, keep_prob)
w_c3 = tf.Variable(w_alpha*tf.random_normal([3, 3, 64, 64])) b_c3 = tf.Variable(b_alpha*tf.random_normal([64])) conv3 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv2, w_c3, strides=[1, 1, 1, 1], padding='SAME'), b_c3)) conv3 = tf.nn.max_pool(conv3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') conv3 = tf.nn.dropout(conv3, keep_prob)
# Fully connected layer w_d = tf.Variable(w_alpha*tf.random_normal([8*20*64, 1024])) b_d = tf.Variable(b_alpha*tf.random_normal([1024])) dense = tf.reshape(conv3, [-1, w_d.get_shape().as_list()[0]]) dense = tf.nn.relu(tf.add(tf.matmul(dense, w_d), b_d)) dense = tf.nn.dropout(dense, keep_prob)
w_out = tf.Variable(w_alpha*tf.random_normal([1024, MAX_CAPTCHA*CHAR_SET_LEN])) b_out = tf.Variable(b_alpha*tf.random_normal([MAX_CAPTCHA*CHAR_SET_LEN])) out = tf.add(tf.matmul(dense, w_out), b_out) #out = tf.nn.softmax(out) return out
# 訓練 def train_crack_captcha_cnn(): output = crack_captcha_cnn() # loss #loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(output, Y)) loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=output, labels=Y)) # 最後一層用來分類的softmax和sigmoid有什麼不同? # optimizer 為了加快訓練 learning_rate應該開始大,然後慢慢衰 optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)
predict = tf.reshape(output, [-1, MAX_CAPTCHA, CHAR_SET_LEN]) max_idx_p = tf.argmax(predict, 2) max_idx_l = tf.argmax(tf.reshape(Y, [-1, MAX_CAPTCHA, CHAR_SET_LEN]), 2) correct_pred = tf.equal(max_idx_p, max_idx_l) accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
saver = tf.train.Saver() with tf.Session() as sess: sess.run(tf.global_variables_initializer())
step = 0 while True: batch_x, batch_y = get_next_batch(64) _, loss_ = sess.run([optimizer, loss], feed_dict={X: batch_x, Y: batch_y, keep_prob: 0.75}) print(step, loss_) # 每100 step計算一次準確率 if step % 100 == 0: batch_x_test, batch_y_test = get_next_batch(100) acc = sess.run(accuracy, feed_dict={X: batch_x_test, Y: batch_y_test, keep_prob: 1.}) print(step, acc) # 如果準確率大於50%,儲存模型,完成訓練 if acc > 0.5: saver.save(sess, "crack_capcha.model", global_step=step) break
step += 1
train_crack_captcha_cnn() |
CNN需要大量的樣本進行訓練,由於時間和資源有限,測試時我只使用數字做為驗證碼字符集。如果使用數字+大小寫字母CNN網路有4*62個輸出,只使用數字CNN網路有4*10個輸出。
TensorBoard是個好東西,既能用來除錯也能幫助理解Graph。
訓練完成時的準確率(超過50%我就不訓練了):
使用訓練的模型識別驗證碼:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
def crack_captcha(captcha_image): output = crack_captcha_cnn()
saver = tf.train.Saver() with tf.Session() as sess: saver.restore(sess, tf.train.latest_checkpoint('.'))
predict = tf.argmax(tf.reshape(output, [-1, MAX_CAPTCHA, CHAR_SET_LEN]), 2) text_list = sess.run(predict, feed_dict={X: [captcha_image], keep_prob: 1})
text = text_list[0].tolist() vector = np.zeros(MAX_CAPTCHA*CHAR_SET_LEN) i = 0 for n in text: vector[i*CHAR_SET_LEN + n] = 1 i += 1 return vec2text(vector)
text, image = gen_captcha_text_and_image() image = convert2gray(image) image = image.flatten() / 255 predict_text = crack_captcha(image) print("正確: {} 預測: {}".format(text, predict_text)) |
loss和準確率曲線:
為了成為真正的碼農,本熊貓要開始研習TensorFlow原始碼了,應該能學到不少玩意。
如要轉載,請保持本文完整,並註明作者@斗大的熊貓和本文原始地址: http://blog.topspeedsnail.com/archives/10858