簡單的爬取貼吧案例
阿新 • • 發佈:2018-11-10
思路:用迴圈爬取n(任意數字)頁程式碼,然後儲存在檔案裡
一共三個函式:
- get_html函式是用來爬取頁面
- save_html函式用來把爬取來的頁面程式碼儲存在檔案中
- main作為主函式
程式碼如下:
from urllib.request import urlopen,Request
from urllib.parse import urlencode
def get_html(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}
request = Request(url, headers=headers)
response = urlopen(request)
info = response.read()
print(info.decode())
return info
def save_html(filename,html_bytes):
with open(filename,"wb") as f:
f.write(html_bytes)
def main():
content=input("要下載的內容:")
num= input('下載的頁數:')
base_url="http://tieba.baidu.com/f?ie=utf-8{}"
for pn in range(int(num)):
args={
"pn":pn*50,
"kw":content
}
filename="第"+str(pn+1)+"頁.html"
url=base_url.format(urlencode(args))
print("正在下載"+filename)
html_bytes= get_html(url)
save_html(filename,html_bytes)
if __name__ == '__main__':
main()