反反爬蟲--破解字型加密
阿新 • • 發佈:2019-01-05
今天在爬取58同城租房資訊的時候發現58同城中的價格在html文件中是以亂碼的,但是在頁面中是正常顯示
字型加密是爬取網頁的過程中比較麻煩的問題。
字型加密一般是網頁修改了預設的字元編碼集,在網頁上載入的他們自己定義的字型檔案作為字型的樣式,可以正確地顯示數字,但是在原始碼上同樣的二進位制數由於未載入自定義的字型檔案就由計算機預設編碼成了亂碼。
一般來說,通用的解決辦法是找到字型檔案,分析檔案中的對映關係。一般來說,字型檔案都是作為樣式加在加密字型的部位。
在樣式中基本上可以確定fangchan-secret是加密字型檔案
在58同城的原始碼中,字型檔案是通過base64加密後放在了head中的style標籤中,並且每次在頁面重新整理的時候會重新整理這個隨機字串,其中的對映關係會變。
頁面進行爬取的時候可以使用正則表示式提取出來
在頁面不重新整理的情況下取出style中的隨機字型檔案,與4300對應的亂碼
price_code = '餼龒麣麣' base64_srt = 'AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8Z/YQAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQUNAcHAAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOpxi4RfDzz1AAsIAAAAAADYVl37AAAAANhWXfsAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAACQAGAAUAAwAKAAEACAAEAAIABwAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAJAACVjwAAlY8AAAAGAACZPAAAmTwAAAAFAACaSwAAmksAAAADAACeOgAAnjoAAAAKAACeowAAnqMAAAABAACfZAAAn2QAAAAIAACfkgAAn5IAAAAEAACfpAAAn6QAAAACAACfpQAAn6UAAAAHAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA'
首先進行base64解碼,轉化成為二進位制形式
import base64
def make_font_file(base64_string: str):
bin_data = base64.decodebytes(base64_string.encode())
# 儲存字型檔案物件
with open('text.otf','wb') as f:
f.write(bin_data)
return bin_data
將位元組檔案轉換為xml檔案
from fontTools.ttLib import TTFont def convert_font_to_xml(bin_data): #TTFont接受的必須是一個檔案物件 font = TTFont('text.otf') font.saveXML("text.xml") # 獲取對應關係 font = TTFont(BytesIO(make_font_file(base64_srt))) unilist = font['cmap'].tables[0].ttFont.getGlyphOrder() c = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap
字型的xml檔案中,下圖部分開始可知(沒擷取完),glyph00000沒有意義,glyph00001對應0,glyph00001對應1以此類推
根據網頁抓取的亂碼的unicode編碼,獲取其對對應的字源,即可獲取所對應的數字
def get_num(string):
ret_list = []
for char in string:
decode_num = ord(char)
num = c[decode_num]
num = int(num[-2:])-1
ret_list.append(num)
return ret_list
整理後的程式碼
price_code = '餼龒麣麣'
base64_srt = 'AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TL...'
import base64
from io import BytesIO
from fontTools.ttLib import TTFont
class ParseNum:
def __init__(self,base64_srt):
font = TTFont(BytesIO(self.make_font_file(base64_srt)))
# unilist = font['cmap'].tables[0].ttFont.getGlyphOrder()
self.c = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap
def make_font_file(self,base64_string: str):
bin_data = base64.decodebytes(base64_string.encode())
# 返回二進位制的字型檔案資料
return bin_data
def get_num(self,string):
ret_str = ''
for char in string:
decode_num = ord(char)
num = self.c[decode_num]
num = int(num[-2:]) - 1
ret_str += str(num)
return ret_str
parse = ParseNum(base64_srt)
money = parse.get_num(price_code)
print(money)