BeautifulSoup抓取百度貼吧

阿新 • • 發佈：2017-07-11

爬蟲 python beautifulsoup 百度貼吧

BeautifulSoup是python一種原生的解析文件的模塊，區別於scrapy，scrapy是一種封裝好的框架，只需要按結構進行填空，而BeautifulSoup就需要自己造輪子，相對scrapy麻煩一點但也更加靈活一些

以爬取百度貼吧內容示例說明。

# -*- coding:utf-8 -*-
__author__=‘fengzhankui‘
import urllib2
from bs4 import  BeautifulSoup
class Item(object):
    title=None
    firstAuthor=None
    firstTime=None
    reNum=None
    content=None
    lastAuthor=None
    lastTime=None
class GetTiebaInfo(object):
    def __init__(self,url):
        self.url=url
        self.pageSum=5
        self.urls=self.getUrls(self.pageSum)
        self.items=self.spider(self.urls)
        self.pipelines(self.items)
    def getUrls(self,pageSum):
        urls=[]
        pns=[str(i*50) for i in range(pageSum)]
        ul=self.url.split(‘=‘)
        for pn in pns:
            ul[-1]=pn
            url=‘=‘.join(ul)
            urls.append(url)
        return urls
    def spider(self,urls):
        items=[]
        for url in urls:
            htmlContent=self.getResponseContent(url)
            soup=BeautifulSoup(htmlContent,‘lxml‘)
            tagsli = soup.find_all(‘li‘,class_=[‘j_thread_list‘,‘clearfix‘])[2:]
            for tag in tagsli:
                if tag.find(‘div‘,attrs={‘class‘: ‘threadlist_abs threadlist_abs_onlyline ‘})==None:
                    continue
                item=Item()
                item.title=tag.find(‘a‘,attrs={‘class‘:‘j_th_tit‘}).get_text().strip()
                item.firstAuthor=tag.find(‘span‘,attrs={‘class‘:‘frs-author-name-wrap‘}).a.get_text().strip()
                item.firstTime = tag.find(‘span‘, attrs={‘title‘: u‘創建時間‘.encode(‘utf8‘)}).get_text().strip()
                item.reNum = tag.find(‘span‘, attrs={‘title‘: u‘回復‘.encode(‘utf8‘)}).get_text().strip()
                item.content = tag.find(‘div‘,attrs={‘class‘: ‘threadlist_abs threadlist_abs_onlyline ‘}).get_text().strip()
                item.lastAuthor = tag.find(‘span‘,attrs={‘class‘: ‘tb_icon_author_rely j_replyer‘}).a.get_text().strip()
                item.lastTime = tag.find(‘span‘, attrs={‘title‘: u‘最後回復時間‘.encode(‘utf8‘)}).get_text().strip()
                items.append(item)
        return items
    def pipelines(self,items):
        with open(‘tieba.txt‘,‘a‘) as fp:
            for item in items:
                fp.write(‘title:‘+item.title.encode(‘utf8‘)+‘\t‘)
                fp.write(‘firstAuthor:‘+item.firstAuthor.encode(‘utf8‘) + ‘\t‘)
                fp.write(‘reNum:‘+item.reNum.encode(‘utf8‘) + ‘\t‘)
                fp.write(‘content:‘ + item.content.encode(‘utf8‘) + ‘\t‘)
                fp.write(‘lastAuthor:‘ + item.lastAuthor.encode(‘utf8‘) + ‘\t‘)
                fp.write(‘lastTime:‘ + item.lastTime.encode(‘utf8‘) + ‘\t‘)
                fp.write(‘\n‘)
    def getResponseContent(self,url):
        try:
            response=urllib2.urlopen(url.encode(‘utf8‘))
        except:
            print ‘fail‘
        else:
            return response.read()
if __name__==‘__main__‘:
    url=u‘http://tieba.baidu.com/f?kw=戰狼2&ie=utf-8&pn=50‘
    GetTiebaInfo(url)

代碼說明：

這個例子是按照scrapy那樣的結構，定義一個item類，然後抽取url中的html，再然後交給第三個方法進行處理，由於貼吧都有置頂的條目，因為匹配class類名默認都是按in處理的，不能and處理，所以不能精確匹配類名，在tag循環過濾的時候才會有過濾置頂內容的條件篩選

BeautifulSoup抓取百度貼吧

爬蟲 python beautifulsoup 百度貼吧 BeautifulSoup是python一種原生的解析文件的模塊，區別於scrapy，scrapy是一種封裝好的框架，只需要按結構進行填空，而BeautifulSoup就需要自己造輪子，相對scrapy麻煩一點但也更加靈活一些以爬取百度

BeautifulSoup抓取百度貼吧

BeautifulSoup抓取百度貼吧

requests+xpath+map爬取百度貼吧

Python爬取百度貼吧數據

Python簡易爬蟲爬取百度貼吧圖片

Python爬蟲實例（一）爬取百度貼吧帖子中的圖片

ulrlib案例-爬取百度貼吧

完整的爬蟲程序爬取百度貼吧的圖片

python爬取百度貼吧指定內容

XPath：爬取百度貼吧圖片，並儲存本地

爬取百度貼吧圖片

使用者輸入關鍵字，爬取百度貼吧

PHP爬蟲-爬取百度貼吧首頁違規主題貼

爬取百度貼吧中的圖片以及視訊

Python爬蟲-爬取百度貼吧

Python爬蟲教程：爬取百度貼吧

Python爬取百度貼吧標題

教你分分鐘爬取百度貼吧，新手可操作（附原始碼及解析）

Python爬取百度貼吧圖片指令碼

實戰python 爬蟲爬取百度貼吧圖片

Python爬取百度貼吧回帖中的微訊號（基於簡單http請求）

BeautifulSoup抓取百度貼吧

相關推薦