Python selenium抓取微博內容

阿新 • • 發佈：2019-02-10

（一）程式設計環境

作業系統：Win 10
程式語言：Python 3.6

（二）安裝selenium

這裡使用selenium實現。
如果沒有安裝過python的selenium庫，則安裝命令如下

pip install selenium

（三）下載ChromeDriver

因為selenium要用到瀏覽器的驅動，這裡我用的是Google Chrome瀏覽器，所以要先下載ChromeDriver.exe並放到C:\Program Files (x86)\Google\Chrome\Application\目錄下。
注意，放到別的目錄下也行，只要在程式碼裡填上正確的路徑即可。

（四）登入微博

通常而言，m站的網頁結構比pc站要簡單的多，咱們可以從m站入手。微博m站登入介面的網址是 https://passport.weibo.cn/signin/login
在Chrome瀏覽器中開啟此地址，並右鍵點選介面的任何位置–>檢視網頁原始碼，發現郵箱/手機號框的id為loginName，密碼輸入框的id為loginPassword，登入按紐的id為loginAction。

from selenium import webdriver
import time

#全域性變數
driver = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe" 
)


def loginWeibo(username, password):
    driver.get('https://passport.weibo.cn/signin/login')
    time.sleep(3)

    driver.find_element_by_id("loginName").send_keys(username)
    driver.find_element_by_id("loginPassword").send_keys(password)
    driver.find_element_by_id("loginAction").click()

    #driver.close()

執行後可以看到自動登入的過程及登入成功的介面

（五）爬取微博內容

從微博爬取內容有兩種方式：
（1）申請成為新浪開發者並呼叫微博API
（2）使用爬蟲程式
因為微博API有好多限制，比如只能獲取某使用者的最近的10條微博內容，而不能獲取全部歷史微博內容。

這裡咱們採用爬蟲方式。
程式如下：

from selenium import webdriver
import time
import re

#全域性變數
driver = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")


def loginWeibo(username, password):
    driver.get('https://passport.weibo.cn/signin/login')
    time.sleep(3)

    driver.find_element_by_id("loginName").send_keys(username)
    driver.find_element_by_id("loginPassword").send_keys(password)
    driver.find_element_by_id("loginAction").click()

    #這裡只是看一下cookie內容，下面不會用到這個cookie值，因為driver會把cookie自動帶過去
    cookies = driver.get_cookies()
    cookie_list = []
    for dict in cookies:
        cookie = dict['name'] + '=' + dict['value']
        cookie_list.append(cookie)
    cookie = ';'.join(cookie_list)
    print (cookie)

    #driver.close()


def visitUserInfo(userId):
    driver.get('http://weibo.cn/' + userId)

    print('********************')   
    print('使用者資料')

    # 1.使用者id
    print('使用者id:' + userId)

    # 2.使用者暱稱
    strName = driver.find_element_by_xpath("//div[@class='ut']")
    strlist = strName.text.split(' ')
    nickname = strlist[0]
    print('暱稱:' + nickname)

    # 3.微博數、粉絲數、關注數
    strCnt = driver.find_element_by_xpath("//div[@class='tip2']")
    pattern = r"\d+\.?\d*"      # 匹配數字，包含整數和小數
    cntArr = re.findall(pattern, strCnt.text)
    print(strCnt.text)
    print("微博數：" + str(cntArr[0]))
    print("關注數：" + str(cntArr[1]))
    print("粉絲數：" + str(cntArr[2]))

    print('\n********************')
    # 4.將使用者資訊寫到檔案裡
    with open("userinfo.txt", "w", encoding = "gb18030") as file:
        file.write("使用者ID：" + userId + '\r\n')
        file.write("暱稱：" + nickname + '\r\n')
        file.write("微博數：" + str(cntArr[0]) + '\r\n')
        file.write("關注數：" + str(cntArr[1]) + '\r\n')
        file.write("粉絲數：" + str(cntArr[2]) + '\r\n')


def visitWeiboContent(userId):
    pageList = driver.find_element_by_xpath("//div[@class='pa']")
    print(pageList.text)
    pattern = r"\d+\d*"         # 匹配數字，只包含整數
    pageArr = re.findall(pattern, pageList.text)
    totalPages = pageArr[1]     # 總共有多少頁微博
    print(totalPages)

    pageNum = 1                 # 第幾頁
    numInCurPage = 1            # 當前頁的第幾條微博內容
    curNum = 0                  # 全部微博中的第幾條微博
    contentPath = "//div[@class='c'][{0}]"
    #while(pageNum <= 3):   
    while(pageNum <= int(totalPages)):
        try:
            contentUrl = "http://weibo.cn/" + userId + "?page=" + str(pageNum)
            driver.get(contentUrl)
            content = driver.find_element_by_xpath(contentPath.format(numInCurPage)).text
            #print("\n" + content)                  # 微博內容，包含原創和轉發
            if "設定:面板.圖片.條數.隱私" not in content:
                numInCurPage += 1
                curNum += 1
                with open("weibocontent.txt", "a", encoding = "gb18030") as file:
                    file.write(str(curNum) + '\r\n' + content + '\r\n\r\n') 
            else:
                pageNum += 1                        # 抓取新一頁的內容
                numInCurPage = 1                    # 每一頁都是從第1條開始抓
                time.sleep(20)                      # 要隔20秒，否則會被封
        except exception as e:
            print("curNum:" + curNum)
            print(e)
        finally:
            pass
    print("Load weibo content finished!")       


if __name__ == '__main__':
    username = '******'             # 輸入微博賬號
    password = '******'             # 輸入密碼
    loginWeibo(username, password)      # 要先登入，否則抓取不了微博內容

    time.sleep(3)
    uid = 'xywyw'                       # “尋醫問藥”的個性域名
    visitUserInfo(uid)                  # 獲取使用者基本資訊
    visitWeiboContent(uid)              # 獲取微博內容

執行結果：
生成了兩個檔案
1）userinfo.txt，用來存放使用者的基本資料

2）weiboconten.txt用來存放該使用者的所有歷史微博內容

注意，程式裡每爬一頁，都需要睡眠20秒。這樣可以防止被封，被封會提示HTTP錯誤403

猜測新浪是根據cookie來封的，不是根據使用者名稱和IP。因為403頁面出現進，把瀏覽器的cookie清空，webo.cn就可以立即訪問。如果不清空的話，等半個小時左右，也可以繼續訪問。
這裡20秒是多次試驗得到的經驗值。若改成10秒，爬取200多條後會出現403錯誤；若改成15秒，爬取500多條後會出現403錯誤。

TopCoder & Codeforces & AtCoder交流QQ群：648202993
更多內容請關注微信公眾號

Python selenium抓取微博內容

（一）程式設計環境

（二）安裝selenium

（三）下載ChromeDriver

（四）登入微博

（五）爬取微博內容

Python selenium抓取微博內容

使用新浪微博官方API抓取微博資料（Python版）

抓取微博資料,如何防護爬蟲被牆

抓取微博數據,如何防護爬蟲被墻

抓取微博熱點後繪製詞雲

Python 非同步爬取微博資料練習

python beautifulsoup 抓取網頁正文內容

php curl 正則表示式抓取微博使用者資訊練習

[python爬蟲] Selenium爬取新浪微博內容及使用者資訊

[Python爬蟲] 之十五：Selenium +phantomjs根據微信公眾號抓取微信文章

Python 爬蟲 ajax爬取馬雲爸爸微博內容

python 爬蟲如何用selenium抓取網頁內容

基於scrapy的分散式爬蟲抓取新浪微博個人資訊和微博內容存入MySQL

python爬蟲(17)爬出新高度_抓取微信公眾號文章（selenium+phantomjs）（上）

python爬蟲（17）爬出新高度_抓取微信公眾號文章（selenium+phantomjs）（下）（windows版本）

最新python爬蟲抓取新浪微博千萬級資料，scrapy思路+架構+原始碼

通過selenium抓取新浪微博

python 爬取微博信息

Python爬蟲實例（二）使用selenium抓取鬥魚直播平臺數據

用python爬取微博數據並生成詞雲

Python selenium抓取微博內容

（一）程式設計環境

（二）安裝selenium

（三）下載ChromeDriver

（四）登入微博

（五）爬取微博內容

相關推薦