團隊-爬蟲豆瓣top250項目-項目進度

阿新 • • 發佈：2017-10-15

rate .com top 取圖 line 標題 ring target 處理

註：

正則表達式在線檢測工具：http://tool.oschina.net/regex/

進程：

1.源代碼HTML

　　#將url轉換為HTML源碼
def getHtml(url):
try:
page = urllib.request.urlopen(url)
html = page.read()
except:
print("failed to geturl")
return ‘‘
else:
return html

2.爬取書名

　　#通過正則表達式獲取該網頁下每本書的title（換行符沒去掉）
def getTitle(html):
nameList = re.findall(r‘<a href="https.*?".*?target="_blank">(.*?)</a>‘,html,re.S)
newNameList = [];
global topnum
for index,item in enumerate(nameList):
if item.find("img") == -1:#通過檢測img,只保留中文標題
#item.replace(‘\n‘,‘‘)
#item.strip()
#item.splitlines()
#re.sub(‘\r|\n‘, ‘‘, item)
if topnum%26 !=0:
#newNameList.append("Top " + str(topnum) + " " + item);
newNameList.append(item);
topnum += 1;
return newNameList

3.爬取圖片

　　#通過正則表達式獲取該網頁下每本書的圖片鏈接
def getImg(html):
imgList = re.findall(r‘img.*?width=.*?src="(http.*?)"‘,html,re.S)
newImgList = []
for index,item in enumerate(imgList):
if item.find("js") == -1 and item.find("css") == -1 and item.find("dale") == -1 and item.find("icon") == -1 and item.find("png") == -1:
newImgList.append(item);

return newImgList;

4.翻頁

　　#實現翻頁,每頁25個
for page in range(0,450,25):
url = "https://www.douban.com/doulist/1264675/?start={}".format(page)
html = getHtml(url).decode("UTF-8");
if html == ‘‘:
namesUrl.extend(‘none‘);
imgsUrl.extend(‘none‘)
scoresUrl.extend(‘none‘)
commentsUrl.extend(‘none‘)
introductionsUrl.extend(‘none‘)
else:
namesUrl.extend(getTitle(html))
imgsUrl.extend(getImg(html))
scoresUrl.extend(getScore(html))
commentsUrl.extend(getComment(html))
introductionsUrl.extend(getDetail(html))

暫時完成以上的模塊

遇到的問題：

1.通過觀察爬取的結果，發現每一頁都會多出一個內容（並不是我需要的數據，確符合正則表達式，所以通過簡單的處理將其剔除掉）。這項有個小瑕疵：爬取的標題前後帶著換行符，試了幾種方法還是沒去掉！！！

2.因為頁面中符合條件的數據各式各樣，所以需要將其中不是我們需要的剔除掉（判斷條件有點暴力，暫時沒想到更好的辦法）

團隊-爬蟲豆瓣top250項目-項目進度

團隊-爬蟲豆瓣top250項目-項目進度

Forward團隊-爬蟲豆瓣top250項目-需求分析

Forward團隊-爬蟲豆瓣top250項目-團隊編程項目開發環境搭建過程

團隊-爬蟲豆瓣top250項目-項目進度

團隊-爬蟲豆瓣top250項目-模塊測試過程

Forward團隊-爬蟲豆瓣top250項目-模塊開發過程

Forward團隊-爬蟲豆瓣top250項目-模塊開發測試

團隊-爬蟲豆瓣top250項目-模塊開發過程

團隊-爬蟲豆瓣top250項目-開發文檔

《Forward團隊-爬蟲豆瓣top250項目-開發文檔》

Forward團隊-爬蟲豆瓣top250項目-開發文檔

Forward團隊-爬蟲豆瓣top250項目-最終程序

Forward團隊-爬蟲豆瓣top250項目-項目總結

python3爬蟲豆瓣top250圖書（並儲存到mysql資料庫）

python3爬蟲豆瓣top250電影（並儲存到mysql資料庫）

Tomcat+Nginx+Linux+Mysql部署豆瓣TOP250的項目到騰訊雲服務器

團隊項目-項目進度

《Blue Flke團隊》第二次作業通訊錄項目開題報告

python3 + Scrapy爬蟲學習之創建項目

大型分布式項目項目實戰Springmvc+Spring+Mybatis+Maven+CMS+Redis+Solr+Linux+Nginx+單點登錄、分布式緩存、負載均衡視頻課程

java crm 系統進銷存 springmvc SSM項目項目源碼

團隊-爬蟲豆瓣top250項目-項目進度

相關推薦