python爬取csdn的部落格內容

阿新 • • 發佈：2018-12-31

首先說明爬蟲的大體結構可以通用，不過針對字串的匹配是不能通用的，比如你用爬取csdn的程式碼想去爬取部落格園就是不行的了，因為爬取的字元是根據對應的html內容設定的。

使用python構建爬蟲程式有一個簡單的套路，我總結為3步走：

1.re.compile設定查詢的字串樣式

2.page = urllib.urlopen開啟網頁，page.read讀取網頁內容

3.re.search/re.findall查詢到給定字串樣式的具體內容

下面我給出一個查詢csdn部落格中文章的URL的例子：

首先隨便開啟一個部落格（比如我隨即找了一個部落格：http://blog.csdn.net/mynameishuangshuai），檢視一下原始碼，慢慢去找一下關於帶有href標籤的段落，發現了下面這一段（或者你用windows的瀏覽器也可以直接用滑鼠去點文章標題就可以看到了）

這個是我找到的，點一下藍色的uri就可以跳轉到具體的文章內容上面了，所以我們的目標就是從這個網頁中提取到這個uri

1.使用正則表示式設定字串樣式

pattern = re.compile('<span class="link_title"><a href="(.*?)">', re.S)

2.獲取這個網頁的html

page = urllib.urlopen(pageURL)
html = page.read()

3.找到樣式對應的字串

items = re.findall(pattern, html)

這樣就是很簡單的爬蟲步驟，其實我所做的爬取內容大部分都要套用這個模式，除此之外還有很小的一部分內容是寫入檔案。
目前的功能是在呼叫指令碼時輸入一個你要獲取的部落格的url，就可以完全提取這個人的所有文章，比如

python spider.py http://blog.csdn.net/

myiloveuuu

就可以提取到myiloveuuu這個使用者的所有文章了

具體程式碼如下

#-*- coding:utf-8 -*-

import urllib
import urllib2
import re
import os
import sys

class Spider:
	def __init__(self, blogURL):
		self.num = 0
		self.blogURLHead = 'http://blog.csdn.net'
		self.siteURL = blogURL
		self.basePath = './GetContent'
		if os.path.exists(self.basePath) == False:
			os.mkdir(self.basePath)
	
	#獲取有幾頁部落格
	def getPageNum(self):
		pattern = re.compile('<span>(.*?)</span><strong>1</strong>')
		page = urllib.urlopen(self.siteURL)
		html = page.read()
		content = re.findall(pattern, html)
		contentList = content[0].split()
		string = contentList[-1]
		pageNum = filter(str.isdigit, string)

		return pageNum

	#找到所有的標題
	def getAllTitleURL(self, pageURL):
		pattern = re.compile('<span class="link_title"><a href="(.*?)">', re.S)
		page = urllib.urlopen(pageURL)
		html = page.read()
		items = re.findall(pattern, html)
		return items
	
	#獲取一篇文章的內容
	def getBrief(self, pageURL):
		pattern = re.compile('<div id="article_content"(.*?)</div>', re.S)
		page = urllib.urlopen(pageURL)
		html = page.read()
		result = re.search(pattern, html)
		content = result.group(0)
		return content.strip()

	#儲存文章
	def saveBrief(self, content, fileName, pageURL):
		conList = pageURL.split('/')
		dirPath = conList[1]
		dirPath = self.basePath + '/' + dirPath
		if os.path.exists(dirPath) == False:
			os.mkdir(dirPath)
		fileName = dirPath + '/' + fileName
		f = open(fileName, "w+")
		f.write(content)

	#查詢對應的文章的標題
	def findPageTitle(self, pageURI, pageURL):
		keyInfo = "<span class=\"link_title\"><a href=\"" + pageURI + "\">(.*?)</a>"
		pattern = re.compile(keyInfo, re.S);
		page = urllib.urlopen(pageURL)
		html = page.read()
		title = re.findall(pattern, html)
		titlePattern = re.compile(r'<[^>]+>', re.S)
		result = titlePattern.sub('', title[0])
		return result.strip()

	def subHtmlLabel(self, context):
#[^>]表示匹配除去‘>’符號外的所有其他符號，+表示這類符號出現次數不限，即該字串匹配'<任意內容>'
		pattern = re.compile(r'<[^>]+>', re.S)
		result = pattern.sub('', context)
		return result

	def savePerPageInfo(self, pageURL):
		contents = self.getAllTitleURL(pageURL)
		for item in contents:
			self.num += 1
			perPageURL = self.blogURLHead + item
			pageTitle = self.findPageTitle(item, perPageURL)
			brief = self.getBrief(perPageURL)
			result = self.subHtmlLabel(brief)
			self.saveBrief(result, pageTitle, item)


	def savePageInfo(self):
		pageNum = self.getPageNum()
		for i in range(1, int(pageNum)+1):
			pageURL = self.siteURL + "/article/list/" + str(i)
			self.savePerPageInfo(pageURL)


spider = Spider(sys.argv[1])
spider.savePageInfo()

python爬取csdn的部落格內容

使用python爬取csdn部落格訪問量

python爬蟲爬取csdn部落格專家所有部落格內容

Python進階(十八)-Python3爬蟲小試牛刀之爬取CSDN部落格個人資訊

部落格搬家系列（二）-爬取CSDN部落格

Jsoup爬取CSDN部落格

python爬取csdn的部落格內容

爬蟲系列（2）-----python爬取CSDN博客首頁所有文章

JAVA爬蟲挖取CSDN部落格文章(續)

爬取所有部落格

Python爬蟲實戰--CSDN部落格爬蟲（附贈瀏覽量小工具）

Python爬取ajax動態載入內容

Python爬取貼吧帖子內容

抓取csdn部落格的所有文章url

python實現kindle每天推送部落格2----python實現爬取部落格內容

python爬蟲設計刷部落格訪問量（刷訪問量，贊，爬取圖片）

爬取多頁資訊——爬取自己CSDN部落格

Python 爬取網頁中JavaScript動態添加的內容（二）

python爬取百度貼吧指定內容

python 爬蟲爬去自己部落格的訪問量

Python爬取新浪微博用戶信息及內容

python爬取csdn的部落格內容

相關推薦