python使用關鍵字爬取url
阿新 • • 發佈:2018-10-21
sea urn ade port 加載失敗 aid closed ive __main__
python網路爬蟲 --------- 使用百度輸入的關鍵字搜索內容然後爬取搜索內容的url
開發環境:windows7+python3.6.3
開發語言:Python
開發工具:pycharm
第三方軟件包:需安裝lxml4.0,如果只安裝lxml會出錯,要需要lxml中的etree
廢話不多說,貼上代碼:
爬取數據保存以TXT格式保存,等會嘗試使用Excel表格跟數據庫保存。
1 import requests,time 2 from lxml import etree 3 4 5 def Redirect(url): 6 try : 7 res = requests.get(url,timeout=10)View Code8 url = res.url 9 except Exception as e: 10 print(‘4‘,e) 11 time.sleep(1) 12 return url 13 14 def baidu_search(wd,pn_max,sav_file_name): 15 url = ‘http://www.baidu.com/s‘ 16 return_set = set() 17 18 for page in range(pn_max): 19 pn = page*10 20 querystring = {‘wd‘:wd,‘pn‘:pn} 21 headers = { 22 ‘pragma‘:‘no-cache‘, 23 ‘accept-encoding‘: ‘gzip,deflate,br‘, 24 ‘accept-language‘ : ‘zh-CN,zh;q=0.8‘, 25 ‘upgrade-insecure-requests‘ : ‘1‘, 26 ‘user-agent‘: "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0", 27 ‘accept‘: "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", 28 ‘cache-control‘: "no-cache", 29 ‘connection‘: "keep-alive", 30 } 31 32 try : 33 response = requests.request(‘GET‘,url,headers=headers,params=querystring) 34 print(‘!!!!!!!!!!!!!!‘,response.url) 35 selector = etree.HTML(response.text,parser = etree.HTMLParser(encoding=‘utf-8‘)) 36 except Exception as e: 37 print(‘頁面加載失敗‘,e) 38 continue 39 with open(sav_file_name,‘a+‘) as f: 40 for i in range(1,10): 41 try : 42 context = selector.xpath(‘//*[@id="‘+str(pn+i)+‘"]/h3/a[1]/@href‘) 43 print(len(context),context[0]) 44 i = Redirect(context[0]) 45 print(‘context=‘+context[0]) 46 print (‘i=‘+i) 47 f.write(i) 48 f.write(‘\n‘) 49 break 50 return_set.add(i) 51 f.write(‘\n‘) 52 except Exception as e: 53 print(i,return_set) 54 print(‘3‘,e) 55 56 return return_set 57 58 if __name__ == ‘__main__‘: 59 wd = ‘網絡貸款‘ 60 pn = 100 61 save_file_name = ‘save_url_soup.txt‘ 62 return_set = baidu_search(wd,pn,save_file_name)
python使用關鍵字爬取url