煎蛋網爬蟲之JS逆向解析img路徑
阿新 • • 發佈:2019-01-15
ces param 得到 exception res lex image pytho pool
圖片使用js onload事件加載
<p><img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)" /><span class="img-hash">Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc=</span></p>
找到soureces 文件中對應的js 方法jandan_load_img
通過debugger js 將Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc= 傳入函數jdugRtgCtw78dflFjGXBvN6TBHAoKvZ7xu base64_decode得到img路經
再通過正則表達式將img路徑中的(/W+)替換為large
爬取代碼如下:
import base64 import re import requests from concurrent.futures import ThreadPoolExecutor from random import choice from lxml import etree from user_agent_list import USER_AGENTS headers = {‘user-agent‘: choice(USER_AGENTS)} def fetch_url(url): ‘‘‘ :param url: 路徑 :return: html ‘‘‘ try: r = requests.get(url, headers=headers) r.raise_for_status() r.encoding = r.apparent_encoding if r.status_code in [200, 201]: return r.text except Exception as e: print(e) def downloadone(url): html = fetch_url(url) data = etree.HTML(html) img_hash_list = data.xpath(‘//*[@class="img-hash"]/text()‘) for img_hash in img_hash_list: img_path = ‘http:‘ + bytes.decode(base64.b64decode(img_hash)) img_path = re.sub(r‘mw\d+‘, ‘large‘, img_path) img_name = img_path.rsplit(‘/‘, 1)[1] with open(‘jiandan/‘+img_name, ‘wb‘) as f: r = requests.get(img_path) f.write(r.content) def main(): url_list = [] for _ in range(1, 44): url = ‘http://jandan.net/ooxx/page-{}‘.format(_) url_list.append(url) with ThreadPoolExecutor(4) as executor: executor.map(downloadone, url_list) if __name__ == ‘__main__‘: main()
煎蛋網爬蟲之JS逆向解析img路徑