python獲取js裡window物件
阿新 • • 發佈:2020-12-06
python環境依賴
pip install PyExecJS pip install lxml pip install beautifulsoup4 pip install requests
nodejs環境依賴
全域性安裝命令
npm install jsdom -g 或者 yarn add jsdom -g
安裝後下面這些程式碼可以正常執行了
const jsdom = require("jsdom"); const { JSDOM } = jsdom; const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`); window = dom.window; document= window.document; XMLHttpRequest = window.XMLHttpRequest;
在全域性安裝jsdom後,在node裡按上面的寫法是沒有問題的,但是我們要在python中使用的話,不能在全域性安裝
如果在全域性安裝,使用時會報如下錯誤,說找不到jsdom
execjs._exceptions.ProgramError: Error: Cannot find module 'jsdom'
解決辦法有兩種
1.就是在python執行檔案所在的執行目錄下,使用npm安裝jsdom
2. 使用cwd引數,指定模組的所在目錄,比如,我們在全域性安裝的jsdom,在cmd裡通過npm root -g 可以檢視全域性模組安裝路徑: C:\Users\w001\AppData\Roaming\npm\node_modules
我們使用時,程式碼可以按下面的寫法寫
import execjs with open(r'要執行的.js','r',encoding='utf-8') as f: js = f.read() ct = execjs.compile(js,cwd=r'C:\Users\w001\AppData\Roaming\npm\node_modules') print(ct.call('Rohr_Opt.reload','1')) print(js.eval("window.pageData"))
python 爬蟲的例子
#!/usr/bin/env python # -*- coding:utf-8 -*- # @Author: Irving Shi import execjs import json import requests from bs4 import BeautifulSoup headers= { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36" } def get_company(key): res = requests.get("https://aiqicha.baidu.com/s?q=" + key, headers=headers) soup = BeautifulSoup(res.text, features="lxml") tag = soup.find_all("script")[2].decode_contents() tag = """const jsdom = require("jsdom"); const { JSDOM } = jsdom; const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`); window = dom.window; document = window.document; XMLHttpRequest = window.XMLHttpRequest; """ + tag js = execjs.compile(tag, cwd=r'C:\Users\Administrator\AppData\Roaming\npm\node_modules') res = js.eval("window.pageData").get("result").get("resultList")[0] return res res = get_company("91360000158304717T") # for i in res.items(): # print(i) pid = res.get("pid") r = requests.get("https://aiqicha.baidu.com/detail/basicAllDataAjax?pid=" + pid, headers=headers) data = json.loads(r.text).get("data").get("basicData") for i in data.items(): print(i)
使用python的execjs執行js,會有這個錯誤:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 41: illegal multibyte sequence
這個問題原因是檔案編碼問題,具體可以 Google 一下,這裡直接解決方法是通過修改subprocess.py
中的Popen類的構造方法__init__
中encoding引數的預設值為utf-8
。
改前
_child_created = False # Set here since __del__ checks it def __init__(self, args, bufsize=-1, executable=None, stdin=None, stdout=None, stderr=None, preexec_fn=None, close_fds=_PLATFORM_DEFAULT_CLOSE_FDS, shell=False, cwd=None, env=None, universal_newlines=False, startupinfo=None, creationflags=0, restore_signals=True, start_new_session=False, pass_fds=(), *, encoding=None, errors=None): """Create new Popen instance.""" _cleanup() # Held while anything is calling waitpid before returncode has been # updated to prevent clobbering returncode if wait() or poll() are # called from multiple threads at once. After acquiring the lock, # code must re-check self.returncode to see if another thread just # finished a waitpid() call. self._waitpid_lock = threading.Lock()
改後
_child_created = False # Set here since __del__ checks it def __init__(self, args, bufsize=-1, executable=None, stdin=None, stdout=None, stderr=None, preexec_fn=None, close_fds=_PLATFORM_DEFAULT_CLOSE_FDS, shell=False, cwd=None, env=None, universal_newlines=False, startupinfo=None, creationflags=0, restore_signals=True, start_new_session=False, pass_fds=(), *, encoding="utf-8", errors=None): """Create new Popen instance.""" _cleanup() # Held while anything is calling waitpid before returncode has been # updated to prevent clobbering returncode if wait() or poll() are # called from multiple threads at once. After acquiring the lock, # code must re-check self.returncode to see if another thread just # finished a waitpid() call. self._waitpid_lock = threading.Lock()
因為修改原始碼的緣故建議大家在虛擬環境venv中用
pip install virtualenv