Python爬取百度貼吧數據

阿新 • • 發佈：2017-07-25

utf-8 支持我 family encode code word keyword 上一條時間

　　本渣除了工作外，在生活上還是有些愛好，有些東西，一旦染上，就無法自拔，無法上岸，從此走上一條不歸路。花鳥魚蟲便是我堅持了數十年的愛好。

　　本渣還是需要上班，才能支持我的業余愛好。上班時間還是盡量訪問外網，少做一些和工作不太相關的事。有段時間，像是中毒一樣，經常想關註百度貼吧中牡丹鸚鵡，及玄鳳鸚鵡的交易圖。

　　於是就寫出一下代碼：

import requests
from lxml import etree

url = r"http://tieba.baidu.com/p/5197963751"
url = r"http://tieba.baidu.com/p/5195568368 
"
# url = r"http://tieba.baidu.com/p/5004763771"
keyword = "廣州"

s = requests.session()


def findgz(pageindex):
    r = s.get("{1}?pn={0}".format(pageindex, url))
    # print(r.text.encode("utf-8"))
    htmlpage = etree.HTML(r.text)

    divlist = htmlpage.xpath(
        "//div[@class=‘d_post_content j_d_post_content  clearfix‘] 
")
    print("第{0}頁".format(pageindex))
    for x in divlist:
        for y in x.xpath(‘text()‘):
            if keyword in y:
                for z in x.xpath(‘text()‘):
                    print(z.replace(‘ ‘, ‘‘))
                else:
                    print(‘\n‘)


r = s.get(url)
tmphtml  
= etree.HTML(r.text)
maxpageindex = tmphtml.xpath("//a[text()=‘尾頁‘]")[0].get("href").split("=")[-1]
print("總共{0}頁".format(maxpageindex))

[findgz(x) for x in range(1, int(maxpageindex) + 1)]

　　輸出如下：

總共8頁第1頁 1.百度昵稱：aiiye1234 2.交易物品：白臉黃臉 3.物品價格：400-1000 4.聯系方式：扣扣822616382 5.地理位置：廣州 6.其它備註：開始學吃了 7.物品圖片：

Python爬取百度貼吧數據

utf-8 支持我 family encode code word keyword 上一條時間　　本渣除了工作外，在生活上還是有些愛好，有些東西，一旦染上，就無法自拔，無法上岸，從此走上一條不歸路。花鳥魚蟲便是我堅持了數十年的愛好。　　本渣還是需要上班，才能支持我的

Python爬取百度貼吧數據

Python爬取百度貼吧數據

python爬取百度貼吧指定內容

Python爬取百度貼吧標題

Python爬取百度貼吧圖片指令碼

Python爬取百度貼吧回帖中的微訊號（基於簡單http請求）

Python爬取百度貼吧的圖片

Python簡易爬蟲爬取百度貼吧圖片

Python爬蟲實例（一）爬取百度貼吧帖子中的圖片

Python爬蟲-爬取百度貼吧

Python爬蟲教程：爬取百度貼吧

實戰python 爬蟲爬取百度貼吧圖片

python網路爬蟲學習(二)一個爬取百度貼吧的爬蟲程式

python爬蟲爬取百度貼吧（入門練習）

python爬蟲(13)爬取百度貼吧帖子

Python爬蟲例項--爬取百度貼吧小說

requests+xpath+map爬取百度貼吧

Python開發簡單爬蟲（二）---爬取百度百科頁面數據

ulrlib案例-爬取百度貼吧

完整的爬蟲程序爬取百度貼吧的圖片

XPath：爬取百度貼吧圖片，並儲存本地

Python爬取百度貼吧數據

相關推薦