爬蟲爬取金庸小說--回顧經典小說

阿新 • • 發佈：2018-12-19

import random
import requests
import time
from multiprocessing import Process, Queue
from bs4 import BeautifulSoup as bs


class NovelDownload(Process):
    def __init__(self, name, url_queue):
        super(NovelDownload, self).__init__()
        self.name = name
        self.url_queue = url_queue

    def run(self):
        print('%s 開始' % self.name)
        time.sleep(1)
        while 1:
            if self.url_queue.empty():
                break
            book = self.url_queue.get()
            print('%s 開始下載' % book[0])
            for chapter in self.getChapterList(book):
                with open('d:\\data\\金庸\\%s.txt' % book[0], 'a', encoding='utf8') as f:
                    f.write(chapter[0] + '\n')
                    for content in self.getContent(chapter):
                        f.write(content + '\n')
            print('%s 下載完畢' % book[0])
        print('%s 結束' % self.name)

    @staticmethod
    def getBookList(url_queue):
        book_list = []
        url = 'http://www.jinyongwang.com/book/'
        html = NovelDownload.getHtmlText(url)
        soup = bs(html, 'lxml')
        booklist = soup.find('div', attrs={"class": 'booklist'})
        ul = booklist.find('ul', attrs={'class': 'list'})
        lis = ul.find_all('li')
        for li in lis:
            book_url = url.rsplit('/', 2)[0] + li.find('a').get('href')
            book_name = li.find('img').get('alt')
            book_list.append([book_name, book_url])
            url_queue.put([book_name, book_url])
        return book_list

    def getChapterList(self, book):
        html = NovelDownload.getHtmlText(book[1])
        soup = bs(html, 'lxml')
        ul = soup.find('ul', attrs={'class', 'mlist'})
        lis = ul.find_all('li')
        for li in lis:
            chapter_url = 'http://www.jinyongwang.com' + li.find('a').get('href')
            chapter_name = li.find('a').get_text()
            yield [
                chapter_name, chapter_url
            ]

    def getContent(self, chapter):
        html = NovelDownload.getHtmlText(chapter[1])
        soup = bs(html, 'lxml')
        div = soup.find('div', attrs={'class': 'vcon'})
        ps = div.find_all('p')
        for p in ps:
            content = p.get_text()
            yield content

    @staticmethod
    def getHtmlText(url):
        time.sleep(random.random())
        headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
        try:
            r = requests.get(url, headers=headers)
            r.encoding = r.apparent_encoding
            r.raise_for_status()
            if r.status_code == 200:
                return r.text
        except:
            return None


def create_pool(url_queue):
    pool_list = []
    pool_name = ['程序1', '程序2', '程序3', '程序4']
    for name in pool_name:
        p = NovelDownload(name, url_queue)
        pool_list.append(p)
    return pool_list


def create_queue():
    url_queue = Queue()
    return url_queue


def main():
    url_queue = create_queue()
    NovelDownload.getBookList(url_queue)
    pool_list = create_pool(url_queue)
    for p in pool_list:
        p.start()
    for p in pool_list:
        p.join()


if __name__ == '__main__':
    temp = time.time()
    main()
    print(time.time() - temp)

爬蟲爬取金庸小說--回顧經典小說

import random import requests import time from multiprocessing import Process, Queue from bs4 import BeautifulSoup as bs class NovelDownload(Process):

python3爬蟲爬取金庸小說所有角色

# -*- coding:utf-8 -*- import requests from bs4 import BeautifulSoup url = 'http://www.jinyongwang.c

40行代碼爬取金庸所有武俠小說

col 鹿鼎記 image ext .text chap ror python代碼 gpo 　　我們今天來用Python爬蟲爬取金庸所有的武俠小說，網址為：http://jinyong.zuopinj.com/，網頁如下：　　Python代碼如下： 1 # -*-

聰哥哥教你學Python之爬取金庸系列的小說

話不多說，程式碼貼起: # -*- coding: utf-8 -*- import urllib.request from bs4 import BeautifulSoup #獲取每本書的章節內容 def get_chapter(url): # 獲取網頁的原始碼 html

java爬蟲爬取資源，小白必須會的入門程式碼塊

java作為目前最火的語言之一，他的實用性也在被無數的java語言愛好者逐漸的開發，目前比較流行的爬取資源，用java來做也更簡單一些,下面是爬取網頁上所有手機型號，引數等極為簡便的資料 package day1805; import java.io.IOException; im

python爬蟲爬取全站url，完美小demo（可防止連結到外網等各種強大篩選）

上次完成的url爬取專案並不能滿足需求，在此完成了一個更為強大的爬取程式碼，有需要的可以直接執行，根據自己爬取的網站更改部分正則和形參即可。前排提示：執行需要耐心，因為幾千個url爬完的話，還是建議花生瓜子可樂電影準備好。話不多說，直接上程式碼，程式碼有註釋，很容易理解。

第一個小爬蟲--爬取圖片並儲存

import urllib.request import re import os def url_open(url): req=urllib.request.Request(url) req.add_header('User-Agent','

【python】爬蟲爬取美麗小姐姐圖片美女桌布

爬蟲爬取蜂鳥裡的高清桌布　　想要自動下載某個網站的高清桌布，不能一個個點選下載，所以用爬蟲實現自動下載。改程式碼只針對特定網站，不同網站需要特別分析。一、分析網站　　https://photo.fengniao.com/ 　　隨便點選一張，發現可以上一頁，下一頁的翻頁

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

Node.js爬蟲-爬取慕課網課程信息

reac 分享 function apt txt sta eject 賦值 find 第一次學習Node.js爬蟲，所以這時一個簡單的爬蟲，Node.js的好處就是可以並發的執行這個爬蟲主要就是獲取慕課網的課程信息，並把獲得的信息存儲到一個文件中，其中要用到cheerio

python制作爬蟲爬取京東商品評論教程

頭文件天津 ref back 文字 eai 目的格式 open 作者：藍鯨類型：轉載本文是繼前2篇Python爬蟲系列文章的後續篇，給大家介紹的是如何使用Python爬取京東商品評論信息的方法，並根據數據繪制成各種統計圖表，非常的細致，有需要的小夥伴可以參考下

Python簡易爬蟲爬取百度貼吧圖片

decode works 接口 def 讀取 min baidu 得到 internal 　　　　　通過python 來實現這樣一個簡單的爬蟲功能，把我們想要的圖片爬取到本地。(Python版本為3.6.0) 一.獲取整個頁面數據　　 def getHtml(url)

python爬蟲爬取頁面源碼在本頁面展示

一個 nts ring 想要 strip code 空白列表 ngs python爬蟲在爬取網頁內容時，需要將內容連同內容格式一同爬取過來，然後在自己的web頁面中顯示，自己的web頁面為django框架首先定義一個變量html，變量值為一段HTML代碼 >&

python 爬蟲爬取證券之星網站

爬蟲周末無聊，找點樂子。。。#coding:utf-8 import requests from bs4 import BeautifulSoup import random import time #抓取所需內容 user_agent = ["Mozilla/5.0 (Windows NT 10.0

python爬蟲爬取海量病毒文件

tle format nbsp contex logs request spl tde __name__ 因為工作需要，需要做深度學習識別惡意二進制文件，所以爬一些資源。 # -*- coding: utf-8 -*- import requests import re

Python3爬蟲爬取淘寶商品數據

表格 name 錯誤處理 from [0 https iat turn 感覺這次的主要的目的是從淘寶的搜索頁面獲取商品的信息。其實分析頁面找到信息很容易，頁面信息的存放都是以靜態的方式直接嵌套的頁面上的，很容易找到。主要困難是將信息從HTML源碼中剝離出來，數據和網頁源碼

用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）

enc 用途 css選擇器狀態 csv文件表格 area 加密重要用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）在進行爬取前，首先要了解： 1、什麽是CSS選擇器？每一條css樣式定義由兩部分組成，形式如下： [code] 選擇器{樣式} [/code

python爬蟲——爬取古詩詞

爬蟲古詩詞實現目標 1.古詩詞網站爬取唐詩宋詞 2.落地到本地數據庫頁面分析通過firedebug進行頁面定位：源碼定位：根據lxml etree定位div標簽：# 通過 lxml進行頁面分析 response = etree.HTML(data

利用Python爬蟲爬取淘寶商品做數據挖掘分析實戰篇，超詳細教程

實戰趨勢 fat sts AI top 名稱 2萬安裝模塊項目內容本案例選擇>> 商品類目：沙發；數量：共100頁 4400個商品；篩選條件：天貓、銷量從高到低、價格500元以上。項目目的 1. 對商品標題進行文本分析詞雲可視化 2.

爬蟲-爬取豆瓣圖書TOP250

info spa data inf code pla select lac lec import requests from bs4 import BeautifulSoup def get_book(url): wb_data = requests.get(u

爬蟲爬取金庸小說--回顧經典小說

相關推薦