python爬蟲示例
阿新 • • 發佈:2018-12-15
python爬蟲即編寫python指令碼處理web網頁,使用特定的演算法,抓取所需要的內容:
以下以爬取糗事百科的段子為例進行說明,程式碼如下:
import urllib.request import re def jokeCrawler(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" } req = urllib.request.Request(url, headers=headers) response = urllib.request.urlopen(req) html = response.read().decode("utf-8") # HTML = str(response.read()) pat = r'<div class="author clearfix">(.*?)<span class="stats-vote"><i class="number">' re_joke = re.compile(pat, re.S) divList = re_joke.findall(html) # print(divList) # print(len(divList)) dic = {} for div in divList: re_u = re.compile(r'<h2>(.*?)</h2>', re.S) username = re_u.findall(div) # print(type(username)) username = username[0] # print(username) re_d = re.compile(r'<div class="content">\n<span>(.*?)</span>', re.S) duanzi = re_d.findall(div) # print(type(username)) duanzi = duanzi[0] # print(duanzi) dic[username] = duanzi return dic # with open(r"D:\pythonPro\star\pacong\file\file3.html", "w", encoding='utf-8') as f: # f.write(HTML) url = "https://www.qiushibaike.com/text/page/2/" info = jokeCrawler(url) for k, v in info.items(): print(k + "說\n" + v)
其中:表示式 .* 就是單個字元匹配任意次,即貪婪匹配。 表示式 .*? 是滿足條件的情況只匹配一次,即最小匹配。