python網頁爬蟲淺析
阿新 • • 發佈:2017-09-30
python
Python網頁爬蟲簡介:
有時候我們需要把一個網頁的圖片copy 下來。通常手工的方式是鼠標右鍵 save picture as ...
python 網頁爬蟲可以一次性把所有圖片copy 下來。
步驟如下:
1. 讀取要爬蟲的html
2. 對爬下來的html 進行存儲並處理:
存儲原始html
過濾生成list
正則匹配出picture的連接
3. 根據連接保存圖片到本地
主要的難點:
熟悉urllib ,
正則匹配查找圖片鏈接
代碼如下:
import urllib.request import os import re
def getHtml(url): #get html page = urllib.request.urlopen(url) html = page.read() return html def write(html, htmlfile): #write html into a file name html.txt try: f = open(htmlfile, mode=‘w‘) f.writelines(str(html)) f.close() except TypeError: print ("write html file failed")
def getImg2(html, initialFile, finalFile): reg = ‘"*‘ #split string html with " and write in file name re.txt imgre1 = re.compile(reg) imglist = re.split(imgre1, str(html)) f1 = open(initialFile, mode=‘w‘) for index in imglist: f1.write("\n") f1.write(index) f1.close reg2 = "^https.*jpg" # match items start with "https" and ends with "jpg" imgre2 = re.compile(reg2) f2 = open(initialFile, mode=‘r‘) f3 = open(finalFile, mode=‘w‘) tempre = f2.readlines() for index in tempre: temp = re.match(imgre2,index) if temp != None: f3.write(index) #f3.write("\n") f2.close() f3.close()
def saveImg2(imagefile): #save image f_imglist2 = open(imagefile, mode=‘r‘) templist = f_imglist2.readlines() x = 0 for index in templist: urllib.request.urlretrieve(index,‘%s.jpg‘ %x) x = x + 1
html = "https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gbk&word=%BA%FB%B5%FB&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=111111" htmlfile = "D:\\New\\html.txt" SplitFile = "D:\\New\\re.txt" imgefile = "D:\\New\\imglist.txt"
html = getHtml(html) print("get html complete!") getImg2(html, SplitFile, imgefile) print("get Image link list complete! ") saveImg2(imgefile) print("Save Image complete!")
python網頁爬蟲淺析