尚矽谷讀書網爬取筆記

阿新 • • 發佈：2022-04-08

#read.py

import scrapy
from readbook.items import ReadbookItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ReadSpider(CrawlSpider):
    name = 'read'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1188_1.html']

    rules = (
        Rule(LinkExtractor(allow=r'/book/\d+/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):

        img_list=response.xpath('//div[@class="bookslist"]//img')
        for img in img_list:
            src=img.xpath('./@data-original').extract_first()
            name=img.xpath('./@alt').extract_first()
            book=ReadbookItem(name=name,src=src)
            yield book

#settings.py
DB_host='mysql主機地址'
DB_port=3306#埠號
DB_user='root'#使用者名稱
DB_password='1234'
DB_name='spider01'#資料庫名
#utf-8的-不允許寫，識別不了
DB_charset='utf-8'

BOT_NAME = 'readbook'

SPIDER_MODULES = ['readbook.spiders']
NEWSPIDER_MODULE = 'readbook.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'readbook (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'readbook.middlewares.ReadbookSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'readbook.middlewares.ReadbookDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'readbook.pipelines.ReadbookPipeline': 300,
   #MysqlPipline
   'readbook.pipelines.MysqlPipeline': 301,


}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#piplines.py
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class ReadbookPipeline:
    def open_spider(self,spider):
        self.fp=open('book.json','w',encoding='utf-8')
    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item
    def close_spider(self,spider):
        self.fp.close()
#載入settings檔案
from scrapy.utils.project import get_project_settings
import pymysql

class MysqlPipline:
    def open_spider(self,spider):
        settings=get_project_settings()
        '''DB_host='mysql主機地址'
      DB_port=3306#埠號
      DB_user='root'#使用者名稱
      DB_password='1234'
      DB_name='spider01'#資料庫名
      DB_harset='utf-8'
        '''
        self.host=settings['DB_host']
        self.port=settings['DB_port']
        self.user=settings['DB_user']
        self.password=settings['DB_password']
        self.name=settings['DB_name']
        self.charest=settings['DB_charset']
        self.coonect()
    def connect(self):
        self.conn=pymysql.connect(
            host=self.host,
            port=self.port,
            user=self.user,
            password=self.password,
            db=self.name,
            charset=self.charest
        )
        self.cursor=self.conn.cursor()#執行sql語句
    def process_item(self,item,spider):
        sql='insert into book(name,src) values("{}","{}")'.format(item['name'],item['src'])
        self.cursor.execute(sql)#執行sql語句
        self.conn.commit()#提交


        return  item
    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

#items.py
import scrapy


class ReadbookItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name=scrapy.Field()
    src=scrapy.Field()

尚矽谷讀書網爬取筆記

#read.pyimport scrapyfrom readbook.items import ReadbookItemfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass ReadSpider(CrawlSpider):name = \'read\'allo

知網爬取

知網爬取勿做商用 import requests, time, parsel, re from selenium.webdriver.chrome.options import Options

寫一個簡單node爬蟲,將苑一峰 es6 教程網爬取轉為pdf 檔案

準備工作，很簡單，只需要安裝好node 環境就可以了，另外安裝一個谷歌開發的一個爬蟲框架，puppeteer,這個模組很強大，可以模擬瀏覽器做很多事情，大家可以去官網去學習一下，不多說，直接上程式碼

尚矽谷B站視訊官方筆記

b站視訊連線本單元目標一、為什麼要學習資料庫二、資料庫的相關概念 DBMS、DB、SQL

JAVA_SSM框架入門-尚矽谷SSM框架實戰學習筆記（專案基礎環境搭建）

一、專案簡介專案名：SSM-CRUD 什麼是SSM？ SSM是指SpringMVC＋Spring＋MyBatis 什麼是CRUD？

爬蟲_scrapy_噹噹網爬取資料

1.建立專案 scrapy startproject scrapy_dangdang 2.建立一個爬蟲檔案爬取地址：http://category.dangdang.com/cp01.01.02.00.00.00.html

實操 | 從0到1教你用Python來爬取整站天氣網

Scrapy Scrapy是Python開發的一個快速、高層次的螢幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的資料。

用 Python 爬取網易嚴選妹子內衣資訊，探究妹紙們的偏好

今天繼續來分析爬蟲資料分析文章，一起來看看網易嚴選商品評論的獲取和分析。

Python爬蟲 scrapy框架爬取某招聘網存入mongodb解析

建立專案 scrapy startproject zhaoping 建立爬蟲 cd zhaoping scrapy genspider hr zhaopingwang.com 目錄結構

Python爬蟲爬取煎蛋網圖片程式碼例項

這篇文章主要介紹了Python爬蟲爬取煎蛋網圖片程式碼例項,文中通過示例程式碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

python爬蟲爬取筆趣網小說網站過程圖解

首先：文章用到的解析庫介紹 BeautifulSoup： Beautiful Soup提供一些簡單的、python式的函式用來處理導航、搜尋、修改分析樹等功能。

Python爬蟲實現使用beautifulSoup4爬取名言網功能案例

本文例項講述了Python爬蟲實現使用beautifulSoup4爬取名言網功能。分享給大家供大家參考，具體如下：

selenium+PhantomJS爬取豆瓣讀書

本文例項為大家分享了selenium+PhantomJS爬取豆瓣讀書的具體程式碼，供大家參考，具體內容如下

Python利用Xpath選擇器爬取京東網商品資訊

HTML檔案其實就是由一組尖括號構成的標籤組織起來的，每一對尖括號形式一個標籤，標籤之間存在上下關係，形成標籤樹；XPath 使用路徑表示式在 XML 文件中選取節點。節點是通過沿著路徑或者 step 來選取的。

Python CSS選擇器爬取京東網商品資訊過程解析

CSS選擇器目前，除了官方文件之外，市面上及網路詳細介紹BeautifulSoup使用的技術書籍和部落格軟文並不多，而在這僅有的資料中介紹CSS選擇器的少之又少。在網路爬蟲的頁面解析中，CCS選擇器實際上是一把效率甚高的利

Python爬蟲例項——scrapy框架爬取拉勾網招聘資訊

本文例項為爬取拉勾網上的python相關的職位資訊,這些資訊在職位詳情頁上,如職位名,薪資,公司名等等.

selenium自動爬取網易易盾的驗證碼

我們在爬蟲過程中難免會遇到一些攔路虎，比如各種各樣的驗證碼，時不時蹦出來，這時候我們需要去識別它來繼續我們的工作，接下來我將爬取網一些滑動驗證碼，然後通過百度的EasyDL平臺進行資料標註，建立模型，訓練模

python爬蟲學習筆記(二十八)-Scrapy 框架爬取JS生成的動態頁面

問題有的頁面的很多部分都是用JS生成的，而對於用scrapy爬蟲來說就是一個很大的問題，因為scrapy沒有JS engine，所以爬取的都是靜態頁面，對於JS生成的動態頁面都無法獲得

Python 爬取網易雲歌手的50首熱門作品

使用requests爬取網易雲音樂 Python程式碼： import json import os import time from bs4 import BeautifulSoup

04爬取拉勾網Python崗位分析報告

# 匯入需要的包import requestsimport time,randomfrom openpyxl import Workbookimport pymysql.cursors#@ 連線資料庫；# 這個是我本地上邊執行的程式，用來獲取代理伺服器。def get_proxy():try:PROXY_POOL_URL =

尚矽谷讀書網爬取筆記

相關推薦