1. 程式人生 > 實用技巧 >2020年《財富》中國500強排行榜資料爬取,看看都有哪些

2020年《財富》中國500強排行榜資料爬取,看看都有哪些

前言

一個簡單的demo,python爬蟲,其實是以前的存貨,很久很久沒有寫爬蟲了,渣渣更渣了啊!

爬取財富中文網,2020年《財富》中國500強排行榜相關資料,資料都在網頁原始碼裡,結構也比較清晰,基本上一個請求頁面可以搞定所有資料,一個老哥要的資料,用來做資料分析!

新人可以用來練手的網站,這裡給出參考demo,僅供參考和學習使用!

fake_useragent庫本地使用

來自於吾愛破解網站(ID:jxt441621944)上的分享,fake_useragent庫也是本渣渣比較喜歡使用的一個庫,比較方便吧,好用倒也說不上,看著用吧,fake_useragent庫本地使用方法,這裡給大家整理和打包了一下!

UserAgent就是使用者代{過}{濾}理,是一串字串,相當於是瀏覽器的身份證明,在寫爬蟲的時候頻繁更換請求頭中的UserAgent可以避免觸發反爬機制(配合代{過}{濾}理IP食用更佳)。

fake_useragent就是可以獲得一個隨機的使用者代{過}{濾}理的庫。

fake_useragent庫總共250條UA!

UA庫檔案:fake_ua.txt

py呼叫檔案:fakeua.py

import random

with open('fake_ua.txt', 'r') as f:
    fake_ua = [fua.strip() for fua in f.readlines()]

print(random.choice(fake_ua))

python爬蟲2020年《財富》中國500強排行榜資料爬取demo

import requests,random
from lxml import etree
import xlsxwriter


class Httprequest(object):
    ua_list = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36Chrome 17.0
', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0Firefox 4.0.1', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11', ] @property #把方法變成屬性的裝飾器 def random_headers(self): return { 'User-Agent': random.choice(self.ua_list) } class Get_data(Httprequest): def __init__(self): self.url="http://www.fortunechina.com/fortune500/c/2020-07/27/content_369925.htm" self.murl="http://www.fortunechina.com/fortune500/c/2020-05/18/content_365275.htm" def get_data(self): html=requests.get(self.url,headers=self.random_headers,timeout=5).content.decode('utf-8') #print(html) req = etree.HTML(html) rankings=req.xpath('//table[@class="wt-table"]/tbody/tr/td[1]/text()') last_rankings=req.xpath('//table[@class="wt-table"]/tbody/tr/td[2]/text()') companys=req.xpath('//table[@class="wt-table"]/tbody/tr/td[3]/a/text()') incomes=req.xpath('//table[@class="wt-table"]/tbody/tr/td[4]/text()') profits=req.xpath('//table[@class="wt-table"]/tbody/tr/td[5]/text()') # print(ranking) # print(last_ranking) # print(company) # print(income) # print(profit) data_list=[] for ranking,last_ranking,company,income,profit in zip( rankings, last_rankings, companys, incomes, profits ): data = [ ranking,last_ranking,company,income,profit ] print(data) data_list.append(data) print('\n') self.write_to_xlsx(data_list) def write_to_xlsx(self, data_list): workbook = xlsxwriter.Workbook('{}_search_results.xlsx'.format("2020年《財富》中國500強排行榜")) # 建立一個Excel檔案 worksheet = workbook.add_worksheet("2020年《財富》中國500強排行榜") title = ['排名', '上年排名', '公司名稱(中文)', '營業收入(百萬元)', '利潤(百萬元)'] # 表格title worksheet.write_row('A1', title) for index, data in enumerate(data_list): # content = content.rstrip() # keyword, rank, include_num, chart_url, title, game_id, company_num, long_words_num = data num0 = str(index + 2) row = 'A' + num0 # data = [name, size, game_id] worksheet.write_row(row, data) workbook.close() def get_mdata(self): html=requests.get(self.murl,headers=self.random_headers,timeout=5).content.decode('utf-8') #print(html) req = etree.HTML(html) rankings=req.xpath('//table[@class="wt-table"]/tbody/tr/td[1]/text()') companys=req.xpath('//table[@class="wt-table"]/tbody/tr/td[2]/a/text()') incomes=req.xpath('//table[@class="wt-table"]/tbody/tr/td[3]/text()') profits=req.xpath('//table[@class="wt-table"]/tbody/tr/td[4]/text()') data_list=[] for ranking,company,income,profit in zip( rankings,companys, incomes, profits ): data = [ ranking,company,income,profit ] print(data) data_list.append(data) print('\n') self.write_to_mxlsx(data_list) def write_to_mxlsx(self, data_list): workbook = xlsxwriter.Workbook('{}_search_results.xlsx'.format("2020年《財富》美國500強排行榜")) # 建立一個Excel檔案 worksheet = workbook.add_worksheet("2020年《財富》美國500強排行榜") title = ['排名', '公司名稱(中文)', '營業收入(百萬美元)', '利潤(百萬美元)'] # 表格title worksheet.write_row('A1', title) for index, data in enumerate(data_list): # content = content.rstrip() # keyword, rank, include_num, chart_url, title, game_id, company_num, long_words_num = data num0 = str(index + 2) row = 'A' + num0 # data = [name, size, game_id] worksheet.write_row(row, data) workbook.close() if __name__=="__main__": spider=Get_data() #spider.get_data() spider.get_mdata()