Python3 實現爬取網站下所有URL方式
阿新 • • 發佈:2020-01-17
獲取首頁元素資訊:
目標 test_URL:http://www.xxx.com.cn/
首先檢查元素,a 標籤下是我們需要爬取得連結,通過獲取連結路徑,定位出我們需要的資訊
soup = Bs4(reaponse.text,"lxml") urls_li = soup.select("#mainmenu_top > div > div > ul > li")
首頁的URL連結獲取:
完成首頁的URL連結獲取,具體程式碼如下:
''' 遇到不懂的問題?Python學習交流群:821460695滿足你的需求,資料都已經上傳群檔案,可以自行下載! ''' def get_first_url(): list_href = [] reaponse = requests.get("http://www.xxx.com.cn",headers=headers) soup = Bs4(reaponse.text,"lxml") urls_li = soup.select("#mainmenu_top > div > div > ul > li") for url_li in urls_li: urls = url_li.select("a") for url in urls: url_href = url.get("href") list_href.append(head_url+url_href) out_url = list(set(list_href)) for reg in out_url: print(reg)
遍歷第一次返回的結果:
從第二步獲取URL的基礎上,遍歷請求每個頁面,獲取頁面中的URL連結,過濾掉不需要的資訊
具體程式碼如下:
def get_next_url(urllist): url_list = [] for url in urllist: response = requests.get(url,headers=headers) soup = Bs4(response.text,"lxml") urls = soup.find_all("a") if urls: for url2 in urls: url2_1 = url2.get("href") if url2_1: if url2_1[0] == "/": url2_1 = head_url + url2_1 url_list.append(url2_1) if url2_1[0:24] == "http://www.xxx.com.cn": url2_1 = url2_1 url_list.append(url2_1) else: pass else: pass else: pass else: pass url_list2 = set(url_list) for url_ in url_list2: res = requests.get(url_) if res.status_code ==200: print(url_) print(len(url_list2))
遞迴迴圈遍歷:
遞迴實現爬取所有url,在get_next_url()函式中呼叫自身,程式碼如下:
get_next_url(url_list2)
全部程式碼如下:
import requests from bs4 import BeautifulSoup as Bs4 head_url = "http://www.xxx.com.cn" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/72.0.3626.121 Safari/537.36" } def get_first_url(): list_href = [] reaponse = requests.get(head_url,"lxml") urls_li = soup.select("#mainmenu_top > div > div > ul > li") for url_li in urls_li: urls = url_li.select("a") for url in urls: url_href = url.get("href") list_href.append(head_url+url_href) out_url = list(set(list_href)) return out_url def get_next_url(urllist): url_list = [] for url in urllist: response = requests.get(url,"lxml") urls = soup.find_all("a") if urls: for url2 in urls: url2_1 = url2.get("href") if url2_1: if url2_1[0] == "/": url2_1 = head_url + url2_1 url_list.append(url2_1) if url2_1[0:24] == "http://www.xxx.com.cn": url2_1 = url2_1 url_list.append(url2_1) else: pass else: pass else: pass else: pass url_list2 = set(url_list) for url_ in url_list2: res = requests.get(url_) if res.status_code ==200: print(url_) print(len(url_list2)) get_next_url(url_list2) if __name__ == "__main__": urllist = get_first_url() get_next_url(urllist)
以上這篇Python3 實現爬取網站下所有URL方式就是小編分享給大家的全部內容了,希望能給大家一個參考,也希望大家多多支援我們。