Python 爬去糗事百科內容講解

阿新 • • 發佈：2019-01-06

參考：http://blog.csdn.net/flyingfishmark/article/details/51251534

爬取前我們先看一下我們的目標：

1.抓取糗事百科熱門段子

2.過濾帶有圖片的段子

3.段子的釋出人，段子內容，好笑數，評論數

# -*- coding: utf-8 -*-
import urllib2
import re
page = 1
url = 'http://www.qiushibaike.com/8hr/page/'+str(page)+'/?s=4872200'
user_agent ='Mozilla/4.0(compatible;MSIE 5.5;Windows NT)'
headers={'User-Agent':user_agent}
try:
request = urllib2.Request(url,headers=headers)
response = urllib2.urlopen(request)

#字串在Python內部的表示是unicode編碼，因此，在做編碼轉換時，通常需要以unicode作為中間編碼，即先將其他編碼的字串解碼（decode）成unicode，再從unicode編碼（encode）成另一種編碼。
#decode的作用是將其他編碼的字串轉換成unicode編碼，如str1.decode('gb2312')，表示將gb2312編碼的字串str1轉換成unicode編碼。
#encode的作用是將unicode編碼轉換成其他編碼的字串，如str2.encode('gb2312')，表示將unicode編碼的字串str2轉換成gb2312編碼。
content = response.read().decode('utf-8')

'''
.*或者.+我清楚意思，就是匹配任意長度的任意字元，後面加個問號是表示非貪婪匹配,匹配儘可能短的字串

貪婪匹配：在滿足匹配時，匹配儘可能長的字串，預設情況下，採用貪婪匹配
string pattern1 = @"a.*c"; // greedy match
Regex regex = new Regex(pattern1);
regex.Match("abcabc"); // return "abcabc"

非貪婪匹配：在滿足匹配時，匹配儘可能短的字串，使用?來表示非貪婪匹配

string pattern1 = @"a.*?c"; // non-greedy match
Regex regex = new Regex(pattern1);
regex.Match("abcabc"); // return "abc"
'''
#re.S 即為’ . ’並且包括換行符在內的任意字元（’ . ’不包括換行符）
pattern = re.compile('<div.*?author clearfix">.*?<a.*?<img.*?>.*?</a>.*?<a.*?>.*?<h2>(.*?)</h2>.*?content">'+
'(.*?)</div>.*?<div class="stats">.*?vote".*?number">(.*?)</i>.*?stats-comments".*?<a.*?number">(.*?)</i>',re.S)

#Python通過re模組提供對正則表示式的支援。
#使用re的一般步驟是先使用re.compile()函式，將正則表示式的字串形式編譯為Pattern例項，然後使用Pattern例項處理文字並獲得匹配結果

'''
import re
pattern = re.compile('[a-zA-Z]')
result = pattern.findall('as3SiOPdj#@23awe')
print result
# ['a', 's', 'S', 'i', 'O', 'P', 'd', 'j', 'a', 'w', 'e']
'''

items = re.findall(pattern,content)
for item in items :
print item[0],item[1],item[2],item[3]
#print response.read()
except urllib2.URLError, e:
if hasattr(e, "code"):#hasattr(object, name),判斷物件object是否包含名為name的特性
print e.code
if hasattr(e, "reason"):
print e.reason

Python 爬去糗事百科內容講解

Python 爬去糗事百科內容講解

利用python爬取糗事百科的用戶及段子

Python 爬取糗事百科段子

Python :爬取糗事百科段子

python爬取糗事百科資料並儲存到sqlite中，命令列讀出

Python爬蟲-爬取糗事百科段子

scrapy框架爬蟲爬取糗事百科之 Python爬蟲從入門到放棄第不知道多少天（1）

Python爬蟲爬取糗事百科(xpath+re)

使用python的requests、xpath和多執行緒爬取糗事百科的段子

Python爬蟲從入門到精通(3): BeautifulSoup用法總結及多執行緒爬蟲爬取糗事百科

python抓取糗事百科文字內容

python—多協程爬取糗事百科熱圖

HtmlAgilityPack抓取糗事百科內容

案例_(多線線程)爬取糗事百科

爬取糗事百科案例

requests爬取糗事百科頁面

Scrapy框架的應用———爬取糗事百科檔案

用BeautifulSoup爬取糗事百科段子

NO.33——XPath選擇器爬取糗事百科段子

Python爬蟲之糗事百科段子寫入MySQL資料庫

Python 爬去糗事百科內容講解

相關推薦