python的BeautifulSoup實現抓取網頁資料

阿新 • • 發佈：2019-01-15

1環境：pycharm，python3.4

2.原始碼解析

import requests

import re

from bs4 import BeautifulSoup

#通過requests.get獲取整個網頁的資料

def getHtmlText(url):
try:
r = requests.get(url)
# to cheack r.status_code is your expected
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:

return "craw failed"

#下圖是網頁中的內容：

#解析頁面內容，通過find_all函式find所有的a標籤的內容，返回一個list，

然後通過正則表示式匹配re.findall(r"[s][zh]\d{6}", href，得到href中的諸如sh201002或者sz201001這樣的號碼；將所有得號碼賦給一個list儲存

def getHtmlList(lst, list_url):

html = getHtmlText(list_url)
soup = BeautifulSoup(html, "html.parser")
a = soup.find_all('a')

for i in a:
try:
href = i.attrs['href']
lst.append(re.findall(r"[s][zh]\d{6}", href)[0])
except:
continue
print("lst:", lst)

#下面函式作用：通過上一個函式得到的號碼，傳給這個函式，拼成另一個url，獲取這個url網頁中的資料

抓取過程同理；不同之處是這裡通過soup.find("div", attrs={'class': 'stock-bets'})即：find函式抓取所有div標籤，屬性是stock-bets的內容，)，find只返回第一個符合條件的結果，所以soup.find()後面可以直接接.text或者get_text()來獲得標籤中的文字；find_all()得到的所有符合條件的結果和soup.select()一樣都是列表list

def getHtmlInfo(lst, info_url, fpath):
for l in lst:
info_url = info_url + l + ".html"
html = getHtmlText(info_url)
try:
if html == "":
continue
soup = BeautifulSoup(html, "html.parser")
betsInfo = soup.find("div", attrs={'class': 'stock-bets'})

infoDict = {}
name = betsInfo.find_all(attrs={'class': 'bets-name'})[0]
infoDict.update({"name": name.text.split()[0]})

#進一步得到標籤是dd和dt的所有資料，所以這裡用的find_all,返回一個list，這裡dd是key值，dt可看成value值，存入字典
keylist = betsInfo.find_all("dd")
keyvalue = betsInfo.find_all("dt")

for i in range(len(keylist)):
try:
key = keylist[i].text
val = keyvalue[i].text
dict2 = {'key': 'val'}
infoDict.update(dict2)
# infoDict[key] = val
except:
print("error")

#將我們要的資料寫入檔案

with open(fpath, 'a', encoding='utf-8') as f:
f.write(str(infoDict) + '\n')

except:
continue
#主函式，呼叫上面的抓取網頁的函式即可
def main():
list_url = "http://quote.eastmoney.com/stocklist.html"
info_url = "https://gupiao.baidu.com/stock/"
output_file = './stockInfo.txt'
slist = []
getHtmlList(slist, list_url)
getHtmlInfo(slist, info_url, output_file)

main()

python的BeautifulSoup實現抓取網頁資料

python的BeautifulSoup實現抓取網頁資料

有搜尋條件根據url抓取網頁資料(java爬取網頁資料)

python抓取網頁資料處理後視覺化

Python抓取網頁資料的終極辦法

C語言實現抓取網頁原始碼

Java抓取網頁資料（原網頁+Javascript返回資料）

抓取網頁資料 A標籤的HREF 值

實現抓取網頁圖片（JAVA實現）

Python爬蟲 BeautifulSoup抓取網頁資料並儲存到資料庫MySQL

node.js 小爬蟲抓取網頁資料（2）

抓取網頁資料並解析Android

goLang 多執行緒抓取網頁資料

【php網頁爬蟲】php抓取網頁資料

R語言實現簡單的網頁資料抓取

Android系統匯入burp證書實現抓取資料包

kettle抓取網頁上的資料儲存到資料表中

php抓取網頁內容，獲取網頁資料

Python爬蟲：十分鐘實現從資料抓取到資料API提供

c# 抓取網頁驗證碼並post資料

Python抓取網頁動態資料——selenium webdriver的使用

python的BeautifulSoup實現抓取網頁資料

相關推薦