xpath解析資料(爬取全國城市名稱)
阿新 • • 發佈:2020-12-28
技術標籤:pythonpythonxpathhtml資料分析
# 開發時間:2020/12/27 22:00
# 開發工具:PyCharm
# 開發者:Friday
# 網址 https://www.aqistudy.cn/historydata/
import requests
from lxml import etree
if __name__ == "__main__":
headers = {
'Referer': 'http://pic.netbian.com/4kmeinv/index_2.html' ,
'user_agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://www.aqistudy.cn/historydata/'
response = requests.get(url = url, headers = headers)
page_text = response.text
tree = etree.HTML(page_text)
#方法一:
# # 熱門城市
# host_city_list = tree.xpath('//div[@class="bottom"]/ul/li')
# host_name_list = []
# for li in host_city_list:
# host_name = li.xpath('./a/text()')[0]
# host_name_list.append(host_name)
# # print(host_name_list)
#
# #1.
# # all_city_list = []
# # all_city_ul_list = tree.xpath('//div[@class="bottom"]/ul')
# # for ul in all_city_ul_list:
# # get_li_list = ul.xpath('./div/li')
# # for li in get_li_list:
# # name = li.xpath('./a/text()')[0]
# # host_name_list.append(name)
# #2.
# # all_city_li = tree.xpath('//div[@class="bottom"]/ul/div[2]/li')
# # for li in all_city_li:
# # name = li.xpath('./a/text()')[0]
# # host_name_list.append(name)
# print(host_name_list)
# print(len(host_name_list))
#方法二:
a_list = tree.xpath('//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a')
all_city_names = []
for a in a_list:
city_name = a.xpath('./text()')[0]
all_city_names.append(city_name)
print(all_city_names)
print(len(all_city_names))
總結:檢視網頁的程式碼結構,比較容易想到的就是進行兩次xpath解析,分別獲取“熱門城市”和“全部城市”的li標籤,但仔細思考,還是可以進一步優化的,由於我們要爬取的城市名稱都在a標籤下,所以我們可以利用xpath同時解析出兩者所對應的a標籤,然後再統一操作。