全國(不包括港澳臺)行政區劃代碼爬取
之前在驗證身份證是否符合規則,其中有一項是驗證前六位數是否是實際存在的區劃代碼,就從國家統計局:http://www.stats.gov.cn/找了數據。最新的是2019年1月31號發布的數據http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/index.html, 整體思路依次獲取省、市、縣/區、鎮、村的信息,根據上一級的鏈接獲取下一級的數據。主頁面如下,用開發人員查看工具分析頁面(F12可以直接調出來),
分析頁面信息:
a、獲取省的信息,省的信息是在一個class="provincetr"的tr裏面,要提取的信息包括名稱、鏈接;
b、市、縣\區、鎮信息頁面差不多,只有class不一樣,分別在class="citytr"、“countytr”、“towntr”的tr裏面,要提取的信息包括名稱、鏈接、區劃代碼,這裏註意一些區劃,比如市轄區是沒有下一級的,要特殊判斷一下。大部分區劃是市、縣\區、鎮的順序,也有一些沒有縣\區(貌似就廣東省中山市、東莞市,海南市儋州市),市直接到鎮,也要特殊處理;
c、村的信息在class=“villagetr”的tr裏面,要提取的信息包括名稱、城鄉分類代碼、區劃代碼;
以上獲取的信息都可以經過上一級獲取的鏈接得到下一級的信息,那麽就先獲取上一級的數據到CSV,再根據上一級的數據獲取下一級數據,省、市、縣\區數量比較少,可以直接獲取頁面信息,鎮、村的數量比較多,加了代理、爬取過程隨機停留一會,分段爬取。以下是獲取省、市、縣\區信息的代碼
1 import urllib.request 2 import requests 3 import csv 4 import random 5from urllib import parse 6 from bs4 import BeautifulSoup 7 import time 8 user_agent_list = [ 9 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", 10 "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", 11 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", 12 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", 13 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", 14 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", 15 "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", 16 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", 17 "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", 18 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", 19 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", 20 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", 21 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", 22 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", 23 "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", 24 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", 25 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", 26 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" 27 ] 28 head = {‘User-Agent‘: random.choice(user_agent_list)} 29 def get_province(url,file_str):#解析省頁面信息,獲取省信息,返回類似("北京市","http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/11.html")組成的鏈表 30 province_list=[] 31 my_html= requests.get(url, headers=head) 32 my_html.encoding = ‘GB2312‘ 33 my_soup = BeautifulSoup(my_html.text, ‘lxml‘) 34 my_tr = my_soup.find_all("tr", class_="provincetr") 35 for my_td in my_tr: 36 my_a=my_td.find_all("td") 37 for my_href in my_a: 38 my_url=parse.urljoin(url,my_href.a["href"]) 39 province_list.append((my_href.a.get_text(),my_url)) 40 with open(file_str, ‘a‘, newline=‘‘, encoding=‘gb2312‘,errors=‘ignore‘)as f: 41 write = csv.writer(f) 42 for province_item in province_list: 43 write.writerow([1,province_item[0],province_item[1]]) 44 f.close() 45 46 47 def get_info(url,class_str,file_str,*upper_name_list):#獲取市、縣\區、鎮信息,url是要獲取信息的網址,class_str是要提取信息的html內容的class,upper_name_list是上級路徑,個數可變 48 if(url==""): 49 return 50 info_list=[] 51 head = {‘User-Agent‘: random.choice(user_agent_list)} 52 my_html= requests.get(url, headers=head) 53 time.sleep(random.random())#隨機暫停01秒 54 my_html.encoding = ‘GB2312‘ 55 my_soup = BeautifulSoup(my_html.text, ‘lxml‘) 56 my_tr = my_soup.find_all("tr", class_=class_str)#從html頁面提取類型class_str的元素 57 for my_td in my_tr: 58 if(my_td.find("td").a):#有些有鏈接的,有些沒有鏈接的 59 my_href=my_td.find("td").a["href"] 60 my_href=parse.urljoin(url,my_href) 61 my_code=my_td.find("td").a.get_text() 62 my_name=my_td.find("td").next_sibling.a.get_text() 63 info_list.append((my_name,my_code,my_href)) 64 else: 65 my_href="" 66 my_code=my_td.find("td").get_text() 67 my_name=my_td.find("td").next_sibling.get_text() 68 info_list.append((my_name,my_code,my_href)) 69 with open(file_str, ‘a‘, newline=‘‘, encoding=‘gb2312‘,errors=‘ignore‘)as f: 70 write = csv.writer(f) 71 for info_item in info_list: 72 write.writerow([len(upper_name_list)+1,]+list(upper_name_list)+[info_item[0],info_item[1],info_item[2]]) 73 f.close() 74 return 1 75 76 77 if __name__ == ‘__main__‘: 78 base_url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/index.html"#鏈接網址 79 get_province(base_url,"province_code.csv")#獲取省信息 80 81 province_list=[]#根據省鏈接獲取市信息 82 with open(‘province_code.csv‘,‘r‘)as f: 83 read=csv.reader(f) 84 for province in read: 85 province_list.append(province) 86 f.close() 87 for province in province_list: 88 print(province) 89 #print(province[2],"citytr","city_code.csv",province[1]) 90 city=get_info(province[2],"citytr","city_code.csv",province[1]) 91 92 city_list=[]#根據市鏈接獲取縣\區信息 93 with open(‘city_code.csv‘,‘r‘)as f: 94 read=csv.reader(f) 95 for city in read: 96 city_list.append(city) 97 f.close() 98 for city in city_list: 99 print(city) 100 #print(city[4],"countytr","county_code.csv",city[1],city[2]) 101 town=get_info(city[4],"countytr","county_code.csv",city[1],city[2])
最後獲取省、市、縣\區的信息如下:
在獲取縣\區信息的時候中途會有獲取失敗現象,有各種原因:Max retries exceeded with url、服務器拒絕連接、我使用的網絡本身會偶爾斷開一下
加了IP代理,使用requests.session,測試了幾次,獲取獲取縣\區信息失敗的次數減少,但還是有獲取失敗。村的數量大概是縣\區的幾十倍,後來想了一個方法,獲取失敗的時候,暫停一會,繼續獲取,直到成功為止,代碼大概如下:
try:#獲取頁面內容,如果失敗了,停止10s後,繼續獲取頁面內容 my_html= s.get(url, headers=head,proxies = proxy) except: print("#############################################################") time.sleep(10) a=get_info(url,class_str,file_str,*upper_name_list) else: ……
最終代碼如下:
import urllib.request import requests import csv import random from urllib import parse from bs4 import BeautifulSoup import time user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] user_ip_list = [ "http://171.41.80.142:9999", "http://171.41.80.231:9999", "http://112.85.172.58:9999", "http://111.79.199.161:9999", "http://110.52.235.184:9999", "http://110.52.235.198:9999", "http://122.193.244.244:9999", "http://223.241.78.26:8010", "http://110.52.235.54:9999", "http://116.209.53.214:9999", "http://112.85.130.221:9999", "http://60.190.250.120:8080", "http://183.148.151.218:9999", "http://183.63.101.62:53281", "http://112.85.164.249:9999", ] requests.adapters.DEFAULT_RETRIES = 5#重連次數 head = {‘User-Agent‘: random.choice(user_agent_list)} proxy={‘proxies‘: random.choice(user_ip_list)} def get_province(url,file_str):#解析省頁面信息,獲取省信息,返回類似("北京市","http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/11.html")組成的鏈表 province_list=[] my_html= requests.get(url, headers=head) my_html.encoding = ‘GB2312‘ my_soup = BeautifulSoup(my_html.text, ‘lxml‘) my_tr = my_soup.find_all("tr", class_="provincetr") for my_td in my_tr: my_a=my_td.find_all("td") for my_href in my_a: my_url=parse.urljoin(url,my_href.a["href"]) province_list.append((my_href.a.get_text(),my_url)) with open(file_str, ‘a‘, newline=‘‘, encoding=‘gb2312‘,errors=‘ignore‘)as f: write = csv.writer(f) for province_item in province_list: write.writerow([1,province_item[0],province_item[1]]) f.close() def get_info(url,class_str,file_str,*upper_name_list):#獲取市、縣\區、鎮信息,url是要獲取信息的網址,class_str是要提取信息的html內容的class,upper_name_list是上級路徑,個數可變 if(url==""): return info_list=[] head = {‘User-Agent‘: random.choice(user_agent_list)} proxy={‘proxies‘: random.choice(user_ip_list)} s = requests.session() s.keep_alive = False # 關閉多余連接 try:#獲取頁面內容,如果失敗了,停止10s後,繼續獲取頁面內容 my_html= s.get(url, headers=head,proxies = proxy) except: print("#############################################################") time.sleep(10) a=get_info(url,class_str,file_str,*upper_name_list) else: #my_html= requests.get(url, headers=head) #time.sleep(random.random())#隨機暫停0~1秒 my_html.encoding = ‘GB2312‘ my_soup = BeautifulSoup(my_html.text, ‘lxml‘) my_tr = my_soup.find_all("tr", class_=class_str)#從html頁面提取類型class_str的元素 for my_td in my_tr: if(my_td.find("td").a):#有些有鏈接的,有些沒有鏈接的 my_href=my_td.find("td").a["href"] my_href=parse.urljoin(url,my_href) my_code=my_td.find("td").a.get_text() my_name=my_td.find("td").next_sibling.a.get_text() info_list.append((my_name,my_code,my_href)) else: my_href="" my_code=my_td.find("td").get_text() my_name=my_td.find("td").next_sibling.get_text() info_list.append((my_name,my_code,my_href)) with open(file_str, ‘a‘, newline=‘‘, encoding=‘gb2312‘,errors=‘ignore‘)as f: write = csv.writer(f) for info_item in info_list: write.writerow([len(upper_name_list)+1,]+list(upper_name_list)+[info_item[0],info_item[1],info_item[2]]) f.close() return 1 def get_village(url,file_str,*upper_name_list):#獲取行政村的信息,行政村和市、縣\區、鎮頁面有較大區別,獨立一個函數獲取行政村信息 if(url==""): return village_list=[] head = {‘User-Agent‘: random.choice(user_agent_list)} proxy={‘proxies‘: random.choice(user_ip_list)} s = requests.session() s.keep_alive = False # 關閉多余連接 try:#獲取頁面內容,如果失敗了,停止10s後,繼續獲取頁面內容 my_html= s.get(url, headers=head,proxies = proxy) except: print("#############################################################") time.sleep(10) a=get_village(url,file_str,*upper_name_list) else: #my_html= requests.get(url, headers=head) my_html.encoding = ‘GB2312‘ my_soup = BeautifulSoup(my_html.text, ‘lxml‘) my_tr = my_soup.find_all("tr", class_="villagetr")#從html頁面提取類型class_str的元素 for my_td in my_tr: my_code=my_td.find("td").get_text() my_class_code=my_td.find("td").next_sibling.get_text() my_name=my_td.find("td").next_sibling.next_sibling.get_text() village_list.append((my_name,my_class_code,my_code)) with open(file_str, ‘a‘, newline=‘‘, encoding=‘gb2312‘,errors=‘ignore‘)as f: write = csv.writer(f) for village_item in village_list: write.writerow([len(upper_name_list)+1,]+list(upper_name_list)+[village_item[0],village_item[1],village_item[2]]) f.close() return 1 if __name__ == ‘__main__‘: base_url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/index.html"#鏈接網址 get_province(base_url,"province_code.csv")#獲取省信息 province_list=[]#根據省鏈接獲取市信息 with open(‘province_code.csv‘,‘r‘)as f: read=csv.reader(f) for province in read: province_list.append(province) f.close() for province in province_list: print(province) #print(province[2],"citytr","city_code.csv",province[1]) city=get_info(province[2],"citytr","city_code.csv",province[1]) city_list=[]#根據市鏈接獲取縣\區信息 with open(‘city_code.csv‘,‘r‘)as f: read=csv.reader(f) for city in read: city_list.append(city) f.close() for city in city_list: print(city) #print(city[4],"countytr","county_code.csv",city[1],city[2]) town=get_info(city[4],"countytr","county_code.csv",city[1],city[2]) city_list=[]#根據市鏈接獲取縣\區信息,大部分區劃是市、縣\區、鎮的順序,也有一些沒有縣\區,市直接到鎮,貌似就廣東省中山市、東莞市,海南市儋州市 with open(‘city_code.csv‘,‘r‘)as f:#從市裏面獲取鎮信息 read=csv.reader(f) for city in read: city_list.append(city) f.close() for city in city_list: print(city) #print(city[4],"countytr","county_code.csv",city[1],city[2]) town=get_info(city[4],"towntr","town_code_1.csv",city[1],city[2],"") county_list=[] with open(‘county_code.csv‘,‘r‘)as f:#從縣\區裏面獲取鎮信息 read=csv.reader(f) for county in read: county_list.append(county) f.close() for county in county_list: print(county) #print(city[4],"countytr","county_code.csv",city[1],city[2]) town=get_info(county[5],"towntr","town_code_2.csv",county[1],county[2],county[3]) town_list=[] with open(‘town_code_1.csv‘,‘r‘)as f:#從縣\區裏面獲取鎮信息 read=csv.reader(f) for town in read: town_list.append(town) f.close() with open(‘town_code_2.csv‘,‘r‘)as f:#從縣\區裏面獲取鎮信息 read=csv.reader(f) for town in read: town_list.append(town) f.close() for town in town_list: print(town) #print(city[4],"countytr","county_code.csv",city[1],city[2]) village=get_village(town[6],"village_code.csv",town[1],town[2],town[3],town[4])#url在第7列,也就是town[6],town[1]~[4]是上級區劃
最後獲取了全部行政村的數據,64W左右:
全國(不包括港澳臺)行政區劃代碼爬取