1. 程式人生 > >簡單的爬取貼吧案例

簡單的爬取貼吧案例

思路:用迴圈爬取n(任意數字)頁程式碼,然後儲存在檔案裡
一共三個函式:

  • get_html函式是用來爬取頁面
  • save_html函式用來把爬取來的頁面程式碼儲存在檔案中
  • main作為主函式
    程式碼如下:
from urllib.request import urlopen,Request
from urllib.parse import urlencode

def get_html(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
} request = Request(url, headers=headers) response = urlopen(request) info = response.read() print(info.decode()) return info def save_html(filename,html_bytes): with open(filename,"wb") as f: f.write(html_bytes) def main(): content=input("要下載的內容:") num=
input('下載的頁數:') base_url="http://tieba.baidu.com/f?ie=utf-8{}" for pn in range(int(num)): args={ "pn":pn*50, "kw":content } filename="第"+str(pn+1)+"頁.html" url=base_url.format(urlencode(args)) print("正在下載"+filename) html_bytes=
get_html(url) save_html(filename,html_bytes) if __name__ == '__main__': main()