BeautifulSoup抓取百度貼吧
阿新 • • 發佈:2017-07-11
爬蟲 python beautifulsoup 百度貼吧
BeautifulSoup是python一種原生的解析文件的模塊,區別於scrapy,scrapy是一種封裝好的框架,只需要按結構進行填空,而BeautifulSoup就需要自己造輪子,相對scrapy麻煩一點但也更加靈活一些
以爬取百度貼吧內容示例說明。
# -*- coding:utf-8 -*- __author__=‘fengzhankui‘ import urllib2 from bs4 import BeautifulSoup class Item(object): title=None firstAuthor=None firstTime=None reNum=None content=None lastAuthor=None lastTime=None class GetTiebaInfo(object): def __init__(self,url): self.url=url self.pageSum=5 self.urls=self.getUrls(self.pageSum) self.items=self.spider(self.urls) self.pipelines(self.items) def getUrls(self,pageSum): urls=[] pns=[str(i*50) for i in range(pageSum)] ul=self.url.split(‘=‘) for pn in pns: ul[-1]=pn url=‘=‘.join(ul) urls.append(url) return urls def spider(self,urls): items=[] for url in urls: htmlContent=self.getResponseContent(url) soup=BeautifulSoup(htmlContent,‘lxml‘) tagsli = soup.find_all(‘li‘,class_=[‘j_thread_list‘,‘clearfix‘])[2:] for tag in tagsli: if tag.find(‘div‘,attrs={‘class‘: ‘threadlist_abs threadlist_abs_onlyline ‘})==None: continue item=Item() item.title=tag.find(‘a‘,attrs={‘class‘:‘j_th_tit‘}).get_text().strip() item.firstAuthor=tag.find(‘span‘,attrs={‘class‘:‘frs-author-name-wrap‘}).a.get_text().strip() item.firstTime = tag.find(‘span‘, attrs={‘title‘: u‘創建時間‘.encode(‘utf8‘)}).get_text().strip() item.reNum = tag.find(‘span‘, attrs={‘title‘: u‘回復‘.encode(‘utf8‘)}).get_text().strip() item.content = tag.find(‘div‘,attrs={‘class‘: ‘threadlist_abs threadlist_abs_onlyline ‘}).get_text().strip() item.lastAuthor = tag.find(‘span‘,attrs={‘class‘: ‘tb_icon_author_rely j_replyer‘}).a.get_text().strip() item.lastTime = tag.find(‘span‘, attrs={‘title‘: u‘最後回復時間‘.encode(‘utf8‘)}).get_text().strip() items.append(item) return items def pipelines(self,items): with open(‘tieba.txt‘,‘a‘) as fp: for item in items: fp.write(‘title:‘+item.title.encode(‘utf8‘)+‘\t‘) fp.write(‘firstAuthor:‘+item.firstAuthor.encode(‘utf8‘) + ‘\t‘) fp.write(‘reNum:‘+item.reNum.encode(‘utf8‘) + ‘\t‘) fp.write(‘content:‘ + item.content.encode(‘utf8‘) + ‘\t‘) fp.write(‘lastAuthor:‘ + item.lastAuthor.encode(‘utf8‘) + ‘\t‘) fp.write(‘lastTime:‘ + item.lastTime.encode(‘utf8‘) + ‘\t‘) fp.write(‘\n‘) def getResponseContent(self,url): try: response=urllib2.urlopen(url.encode(‘utf8‘)) except: print ‘fail‘ else: return response.read() if __name__==‘__main__‘: url=u‘http://tieba.baidu.com/f?kw=戰狼2&ie=utf-8&pn=50‘ GetTiebaInfo(url)
代碼說明:
這個例子是按照scrapy那樣的結構,定義一個item類,然後抽取url中的html,再然後交給第三個方法進行處理,由於貼吧都有置頂的條目,因為匹配class類名默認都是按in處理的,不能and處理,所以不能精確匹配類名,在tag循環過濾的時候才會有過濾置頂內容的條件篩選
BeautifulSoup抓取百度貼吧