1. 程式人生 > 其它 >xpath解析資料(爬取全國城市名稱)

xpath解析資料(爬取全國城市名稱)

技術標籤:pythonpythonxpathhtml資料分析

目標網站:https://www.aqistudy.cn/historydata/

# 開發時間:2020/12/27 22:00
# 開發工具:PyCharm
# 開發者:Friday
# 網址 https://www.aqistudy.cn/historydata/
import requests
from lxml import etree

if __name__ == "__main__":
    headers = {
        'Referer': 'http://pic.netbian.com/4kmeinv/index_2.html'
, 'user_agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36' } url = 'https://www.aqistudy.cn/historydata/' response = requests.get(url = url, headers = headers) page_text = response.text tree = etree.HTML(page_text)
#方法一: # # 熱門城市 # host_city_list = tree.xpath('//div[@class="bottom"]/ul/li') # host_name_list = [] # for li in host_city_list: # host_name = li.xpath('./a/text()')[0] # host_name_list.append(host_name) # # print(host_name_list) # # #1. # # all_city_list = []
# # all_city_ul_list = tree.xpath('//div[@class="bottom"]/ul') # # for ul in all_city_ul_list: # # get_li_list = ul.xpath('./div/li') # # for li in get_li_list: # # name = li.xpath('./a/text()')[0] # # host_name_list.append(name) # #2. # # all_city_li = tree.xpath('//div[@class="bottom"]/ul/div[2]/li') # # for li in all_city_li: # # name = li.xpath('./a/text()')[0] # # host_name_list.append(name) # print(host_name_list) # print(len(host_name_list)) #方法二: a_list = tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a') all_city_names = [] for a in a_list: city_name = a.xpath('./text()')[0] all_city_names.append(city_name) print(all_city_names) print(len(all_city_names))

總結:檢視網頁的程式碼結構,比較容易想到的就是進行兩次xpath解析,分別獲取“熱門城市”和“全部城市”的li標籤,但仔細思考,還是可以進一步優化的,由於我們要爬取的城市名稱都在a標籤下,所以我們可以利用xpath同時解析出兩者所對應的a標籤,然後再統一操作。