20170820_python實時獲取某網站留言信息
阿新 • • 發佈:2017-08-20
brush 輸入 mozilla ssi 獲取 eight bdr num html 錄失敗,而且抓取的網頁內容都是亂碼,用了str(‘info‘, encoding=‘utf-8‘)才有所好轉
主要用的是request和bs4,遇到最大的問題是目標站是gb2312編碼,python3的編碼雖然比2的處理要好得多但還是好麻煩,
最開始寫的是用cookie模擬登陸,但是這個在實際使用中很麻煩,需要先登陸目標網站,然後把cookie復制下來拷貝到代碼中...懶惰是
第一動力!
準備用火狐的httpfox獲取下目標站post的數據和地址,發現火狐瀏覽器自動升級到了55.x,插件只能用在35.x版本,然後用chrome發現這
個網站提交post請求是打開了一個新的頁面,然後新頁面再點F12就晚了,看不到post了,然後百度一番發現可以設置新標簽頁開啟F12!如圖:
然後就知道了這個網站都post了什麽數據,開始用requests模擬post,但是發現每次都登
發現根本就沒有登錄成功,然後提示輸入賬號密碼登錄。
靈光一閃!!!
估計是我post的數據是utf8而目標站接收post時是gb2312,根本看不懂啊!果斷把用戶名(用戶名是中文!!!) username.encode("gb2312")之後順利登錄成功!然後又
開啟了session保持cookie,持久化登錄。然後每分鐘判斷下最後一個id是否等於保存的id,判斷是否進行抓取。
#-*-coding:utf-8-*- #編碼聲明 import requests,re,time,json,os from bs4 import BeautifulSoup from time import strftime,gmtime LOGIN_URL = ‘http://www.3456.tv/Default.aspx‘ #請求的URL地址 username = ‘用戶名‘ password = ‘password‘ DATA = {"web_top_two2$txtName":username.encode("gb2312"), "web_top_two2$txtPass":password, ‘__VIEWSTATE‘:‘/wEPDwULLTEyNzc4MjM2OTBkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBRh3ZWJfdG9wX3R3bzIkaW1nQnRuTG9naW6/pqbjQqV358GfYjdoiOK+Ek4VWA==‘,‘__EVENTVALIDATION‘:‘/wEWBAL3y5PLCgLHgt+5BgL3r9v/CgLX77PND5R1XxTeGn4lXvBDrb6OdRyc4Xlk‘,‘web_top_two2$imgBtnLogin.x‘:‘22‘,‘web_top_two2$imgBtnLogin.y‘:‘8‘} #登錄系統的賬號密碼,也是我們請求數據 HEADERS = {‘User-Agent‘ : ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36‘} #模擬登陸的瀏覽器 S = requests.Session() login = S.post(LOGIN_URL,data=DATA,headers=HEADERS) #模擬登陸操作 def getData(num): url = ‘http://www.3456.tv/user/list_proxyall.html‘ res = S.get(url) content = res.content return content def getLast(): url = ‘http://www.3456.tv/user/list_proxyall.html‘ res = S.get(url) content = res.content soup = BeautifulSoup(content,‘html.parser‘) tb = soup.find_all(‘tr‘,style=‘text-align:center;‘) for tag in tb: see = tag.find(‘a‘, attrs={‘class‘:‘see‘}) seestr = see[‘onclick‘] seenum = re.sub("\D", "", seestr) break return seenum def isNew(): newlastid = getLast() with open(‘lastid.txt‘) as txt: last = txt.read() if int(newlastid) != int(last): print(‘當前時間:‘ + strftime("%H-%M") + ‘,發現新留言,獲取中!‘) getNewuser() else: print(‘當前時間:‘ + strftime("%H-%M") + ‘,暫時沒有新留言‘) def getNewuser(): url = ‘http://www.3456.tv/user/list_proxyall.html‘ res = S.get(url) content = res.content soup = BeautifulSoup(content,‘html.parser‘) tb = soup.find_all(‘tr‘,style=‘text-align:center;‘) with open(‘lastid.txt‘) as txt: last = txt.read() userinfo = ‘‘ for tag in tb: see = tag.find(‘a‘, attrs={‘class‘:‘see‘}) seestr = see[‘onclick‘] seenum = re.sub("\D", "", seestr) if int(seenum) == int(last): break userinfo += (str(seeInfo(int(seenum)), encoding = "utf-8") + ‘\n‘) userfilename = strftime("%H-%M") + ‘.txt‘ with open( userfilename, ‘w‘) as f: f.write(str(userinfo)) os.system(userfilename) with open(‘lastid.txt‘, ‘w‘) as f2: f2.write(str(getLast())) print(‘本次抓取完成,當前時間:‘ + strftime("%H-%M") + ‘,60秒後繼續執行‘) def seeInfo(id): url = ‘http://www.3456.tv/user/protel.html‘ info = {‘id‘:id} res = S.get(url,data=info) content = res.content return content setsleep = 60 #修改這個設置每次抓取間隔,60為60秒 print(‘this time is today first time start?‘) firststr = input(‘input yes or no and press enter: ‘) if firststr == ‘yes‘: print(‘正在抓取中...‘) lastid = getLast() with open(‘lastid.txt‘, ‘w‘) as f: f.write(str(lastid)) print(‘當前時間:‘ + strftime("%H:%M") + ‘,當前第一條數據id為‘ + lastid) print(str(setsleep) + ‘秒後繼續執行‘) else: print(str(setsleep) + ‘秒後繼續執行‘) while 1: isNew() time.sleep(int(setsleep))
20170820_python實時獲取某網站留言信息