爬取豆瓣古典文學（數據庫存儲）

阿新 • • 發佈：2018-06-21

-s cursor .text code lxml qlite mat com etc

代碼如下：

  1 # coding:utf-8
  2 import cPickle
  3 import random
  4 import requests
  5 from lxml import etree
  6 import time
  7 import re
  8 import sys
  9 import codecs
 10 import sqlite3
 11 
 12 class Spider:
 13     def __init__(self):
 14         self.con = sqlite3.connect(r‘BookInformation.db 
‘)
 15         self.cur = self.con.cursor()
 16         self.home = ‘https://book.douban.com/tag/%E5%8F%A4%E5%85%B8%E6%96%87%E5%AD%A6‘
 17         self.Referer = ‘https://book.douban.com/‘
 18         self.user_agent_list = []
 19         self.books_list = []
 20         with open(‘user_agent.txt‘, ‘rb‘) as f:
 
 21             self.user_agent_list = cPickle.load(f)
 22 
 23     def GetHeaders(self):
 24         UserAgent = random.choice(self.user_agent_list)
 25         headers = {‘Referer‘: self.Referer, ‘User-Agent‘: UserAgent}
 26         return headers
 27 
 28     def SaveBook(self,info):
 29         sql = ‘ 
INSERT INTO BookInfo VALUES(?,?,?,?,?)‘
 30         info_list = (info["Name"],info["Author"],info["Rating"],info["ContentIntro"],info["AuthorIntro"])
 31         self.cur.execute(sql, info_list)
 32         self.con.commit()
 33 
 34     def Crawl(self):
 35         html = requests.get(self.home,headers=self.GetHeaders()).text
 36         html_tree = etree.HTML(html)
 37         booksList = html_tree.xpath(‘/html/body/div[3]/div[1]/div/div[1]/div/ul/li‘)
 38         num = 0
 39         for book in booksList:
 40             time.sleep(1)
 41             bookUrl = book.xpath(‘div[2]/h2/a‘)[0].get(‘href‘)
 42             pageHtml = requests.get(bookUrl,headers=self.GetHeaders()).text
 43             page_tree = etree.HTML(pageHtml)
 44             book_info = self.GetPage(page_tree)
 45             print book_info[‘Name‘]
 46             self.SaveBook(book_info)
 47             # self.books_list.append(book_info)
 48             # f = codecs.open(‘text.txt‘,‘a‘,encoding=‘utf-8‘)
 49             # f.write(book_info[‘AuthorIntro‘])
 50             # f.close()
 51             # print book_info[‘AuthorIntro‘]
 52             num = num+1
 53             if num==5:
 54                 break
 55 
 56 
 57     def GetPage(self, page_tree):
 58         book_info = {}
 59         try:
 60             Name = self.GetName(page_tree)
 61             book_info[‘Name‘] = Name
 62         except:
 63             book_info[‘Name‘] = ‘‘
 64         try:
 65             Author = self.GetAuthor(page_tree)
 66             book_info[‘Author‘] = Author
 67         except:
 68             book_info[‘Author‘] = ‘‘
 69         try:
 70             Rating = self.GetRating(page_tree)
 71             book_info[‘Rating‘] = Rating
 72         except:
 73             book_info[‘Rating‘] = ‘‘
 74         try:
 75             ContentIntro = self.GetContentIntro(page_tree)
 76             book_info[‘ContentIntro‘] = ContentIntro
 77         except:
 78             book_info[‘ContentIntro‘] = ‘‘
 79         try:
 80             AuthorIntro = self.GetAuthorIntro(page_tree)
 81             book_info[‘AuthorIntro‘] = AuthorIntro
 82         except:
 83             book_info[‘AuthorIntro‘] = ‘‘
 84 
 85 
 86         return book_info
 87 
 88     def GetName(self, page_tree):
 89         return page_tree.xpath(‘/html/body/div[3]/h1/span‘)[0].text
 90 
 91     def GetAuthor(self,page_tree):
 92         author_list = page_tree.xpath(‘/html/body/div[3]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div[2]/span[1]/a‘)
 93         result = ‘‘
 94         if len(author_list) is not 0:
 95             list = []
 96             for author in author_list:
 97                 list.append(author.text.strip())
 98             result = ‘/‘.join(list)
 99         else:
100             result = page_tree.xpath(‘/html/body/div[3]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div[2]/a‘)[0].text.strip()
101         return re.sub(r‘\s+‘,‘ ‘,result)
102 
103 
104     def GetRating(self, page_tree):
105         return page_tree.xpath(‘/html/body/div[3]/div[2]/div/div[1]/div[1]/div[1]/div[2]/div/div[2]/strong‘)[0].text.strip()
106 
107     def GetContentIntro(self, page_tree):
108         para_div = page_tree.xpath(‘//*[@id="link-report"]//div[@class="intro"]‘)
109         result = ‘‘
110         if len(para_div) is not 0:
111             para_para = para_div[len(para_div)-1].xpath(‘p‘)
112             for para in para_para:
113                 result = result+‘\t‘+para.text+‘\n‘
114         return result
115 
116     def GetAuthorIntro(self, page_tree):
117         para_div = page_tree.xpath(‘/html/body/div[3]/div[2]/div/div[1]/div[3]/div[@class="indent "]//div[@class="intro"]‘)
118         result = ‘‘
119         if len(para_div) is not 0:
120             para_para = para_div[len(para_div) - 1].xpath(‘p‘)
121             for para in para_para:
122                 result = result + ‘\t‘ + para.text + ‘\n‘
123         return result
124 
125     # def GetCatalogue(self, page_tree):
126     #     pass
127     #
128     # def GetTag(self, page_tree):
129     #     pass
130     #
131     # def GetShortCommentary(self, page_tree):
132     #     pass
133 
134 if __name__ == ‘__main__‘:
135     s = Spider()
136     s.Crawl()

爬取豆瓣古典文學（數據庫存儲）

-s cursor .text code lxml qlite mat com etc 代碼如下： 1 # coding:utf-8 2 import cPickle 3 import random 4 import requests 5 fr

selenium爬取NBA並將數據存儲到MongoDB

per lis lan auth fin wait aik target span from selenium import webdriver driver = webdriver.Chrome() url = ‘https://www.basketball-re

如何通過當前地區經緯度，搜索數據庫存儲的地區（距離最近的地區）

庫存 () round 數據庫 bsp java程序 .com png distance 數據庫表結構 2.最關鍵的就是這條sql SELECT address_, longitude_, latitude_, ROUND(6378.138 * 2 * ASIN(

Python爬取貓眼top100排行榜數據【含多線程】

代碼 status log col return map result port htm # -*- coding: utf-8 -*- import requests from multiprocessing import Pool from requests.e

Python3.5：爬取網站上電影數據

x64 沒有 () nbsp 運行 lpar target __init__ doc 首先我們導入幾個pyhton3的庫: from urllib import requestimport urllibfrom html.parser import HTMLParser 在

Python爬蟲之利用BeautifulSoup爬取豆瓣小說（三）——將小說信息寫入文件

設置 one 行為 blog 應該 += html uil rate 1 #-*-coding:utf-8-*- 2 import urllib2 3 from bs4 import BeautifulSoup 4 5 class dbxs: 6 7

Python爬蟲抓取東方財富網股票數據並實現MySQL數據庫存儲

alt 插入 pytho width 重新 tab 空值 utf word Python爬蟲可以說是好玩又好用了。現想利用Python爬取網頁股票數據保存到本地csv數據文件中，同時想把股票數據保存到MySQL數據庫中。需求有了，剩下的就是實現了。在開始之前，保證已經

學習筆記（九）——數據庫存儲結構：頁、聚集索引、非聚集索引

分享 style end 宋體 blog lec storage rop cas 1、頁 SQL Server用8KB 的頁來存儲數據，並且在SQL Server裏磁盤 I/O 操作在頁級執行。也就是說，SQL Server 讀取或寫入所有數據頁。頁有不同的類型，像

python爬取微博圖片數據存到Mysql中遇到的各種坑python Mysql存儲圖片

字符轉義 process 程序 zha 有一個 utf-8 get ctime python3 本人長期出售超大量微博數據，並提供特定微博數據打包，Message to [email protected] 前言由於硬件等各種原因需要把大概

誰說Python不能爬取APP上面的數據？看我把快手視頻弄到手！

網絡設置 5.5 .com 熱門 user imp 9.4 type prev 設置代理，重啟，下一步，查看本機ip 手機打開網絡設置通過代理服務器；設置好，刷新快手app 看到請求，去找自己要用的，非了九牛二虎之力找到了

利用Python爬取幾百萬github數據！這些源碼都是我的囊中之物！

.py .com exchange 非阻塞問題判斷 recursion 異步調用 direct 進群：548377875 即可獲取數十套PDF哦！看到這麽簡單的流程，內心的第一想法就是先簡單的寫一個遞歸實現唄，要是性能差再慢慢優化，所以第一

scrapy框架爬取豆瓣讀書（1）

tin rap 豆瓣 pipe 網頁 xpath from lin tor 1.scrapy框架 Scrapy，Python開發的一個快速、高層次的屏幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的數據。Scrapy用途廣泛，可以用於數據挖掘、監測和自動化

python實現用戶登陸（sqlite數據庫存儲用戶信息）

自動入學添加 sqlite數據庫 lec 輸入密碼獲取 where char python實現用戶登陸（sqlite數據庫存儲用戶信息）目錄創建數據庫數據庫管理簡單登陸有些地方還未完善。創建數據庫 import sqlite3 #建一個

Python爬蟲入門教程 42-100 爬取兒歌多多APP數據-手機APP爬蟲部分

如何分類提取地址一個本科 fiddler 系列案例 1. 兒歌多多APP簡單分析今天是手機APP數據爬取的第一篇案例博客，我找到了一個兒歌多多APP，沒有加固，沒有加殼，沒有加密參數，對新手來說，比較友好，咱就拿它練練手，熟悉一下Fiddler和夜神模擬器是如

第五部分(三) 數據存儲（非關系型數據庫存儲：MongoDB存儲、Redis存儲）

inux end 啟動詳細 cat 鍵值對示例屬性獲取 password 非關系型數據庫存儲NoSQL全稱Not Only SQL，意為不僅僅是SQL，泛指非關系型數據庫。NoSQL基於鍵值對，不經過SQL層的解析，數據間沒有耦合性，性能高。非關系型數據庫細分如下：鍵

爬取動態分頁數據案例

標題頁碼 [] use 當前 sap list style 內容需求：爬取東方財富證券http://kuaixun.eastmoney.com/ssgs.html的財經新聞數據1.爬取頁面中的標題和對應的內容：【標題】內容2.進行分頁操作，爬取當前頁面所有頁碼對應的

數據庫存儲過程、觸發器、連接

upd 事件連接存儲安全 for after 相關用戶存儲過程：存儲過程就是編譯好了的一些sql語句。1.存儲過程因為SQL語句已經預編繹過了，因此運行的速度比較快。2. 可保證數據的安全性和完整性。通過存儲過程可以使沒有權限的用戶在控制之下間接地存取數據庫，從

python3第一天學習（數據類型）

絕對值 pre return col pytho tar .html art 整數參考blog地址：http://www.cnblogs.com/wupeiqi/articles/5444685.html，然後根據上面知識點練習並總結。一.數字(int) 1.數字

MYSQL，數據庫存儲引擎！

mem com 數據庫 alter 開發負數 char lte 命令行本人安裝mysql版本為：mysql Ver 14.14 Distrib 5.7.18, for Win64 (x86_64)，查看mysql的版本號方式：cmd--》mysql --version

10、管理數據庫存儲(行遷移及行連接)

管理數據庫存儲(行遷移及行連接)管理數據庫存儲1block=8192bytes案例1：行遷移1、表中數據如何存儲create table test as select * from hr.employees;create index idx_test on test(employee_id);只看執行計劃，不

爬取豆瓣古典文學（數據庫存儲）

相關推薦