python 上個小爬蟲的改進

阿新 • • 發佈：2018-11-29

上一篇寫的是個爬取古詩詞網上的古詩，並寫到檔案裡，今天做了一下改進。

1、之前寫的是網頁上顯示的摺疊詩，就是詩的內容是摺疊的，內容短的話可能是詩的全部，如果長的話只是一部分了，所以這裡做出改進，讓它爬取連結處的內容，因為如果你點選連線後會顯示詩的全部內容。

體會：首先你要自己分析好頁面在進行爬取。

2、改進了容錯性，可能點選連結會失效，這裡失效後就爬去下一條內容而非報錯終止。

以下為程式碼：（ps 重點是分析頁面，每個頁面有自己的結構，找到共性就好）

#-*-  coding:gbk  -*-
import urllib.request
import time,os
import numpy as np
from bs4 import BeautifulSoup

hds=[{'User-Agent': 'Mozilla/5.0 (Windows; U;Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}, \
     {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11'}, \
     {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)'}]


def search(value,page):

    url='http://so.gushiwen.org/search.aspx?type=author&page='+urllib.request.quote(str(page))+'&value='+urllib.request.quote(value)
    #拼接url使用urllib.request.quote（）把漢字和數字拼接到url裡
    time.sleep(np.random.rand() * 2)#隨機等一段時間再進行訪問
    try:
        req=urllib.request.Request(url,headers=hds[page%3])
        html_resource=urllib.request.urlopen(req)
        # print(html_resource.info()) 輸出伺服器資訊
        plain_text=html_resource.read().decode()
        return plain_text

    except (urllib.request.HTTPError,urllib.request.URLError) as e:
        print(e)

def rmline(str):
    #這是一個去除字串中的空行的函式
    data=''
    for line in str:
        l=line.strip()
        if len(l)!=0:
            data+=l
    return data

def getpoemMessage(poem_id,page):
    url = 'http://so.gushiwen.org/shiwen/ajaxshiwen.aspx?id=' + poem_id + '&from=search'
    print(url)
    time.sleep(np.random.rand() * 1)  # 隨機等一段時間再進行訪問
    try:
        req=urllib.request.Request(url,headers=hds[page%3])
        html_resource=urllib.request.urlopen(req)
        # print(html_resource.info()) 輸出伺服器資訊
        plain_text=html_resource.read().decode()
        return plain_text
    except (urllib.request.HTTPError,urllib.request.URLError) as e:
        print(e)

def parse_poem(html):

    soup=BeautifulSoup(html,'lxml')
    if (soup.find('title')):
        pass
    else:
        poem_title = soup.p
        poem_title_str = poem_title.a.string
        poem_author = poem_title.next_sibling.next_sibling
        poem_author_str = poem_author.string.split(u'\xa0')[0]
        # 注意gbk無法識別u'\xa0'編碼，他的意思為空格
        poem_text = list(soup.div.next_sibling.next_sibling.strings)
        poem_text_str = ''
        for text in list(soup.div.next_sibling.next_sibling.strings)[6:-2]:
            poem_text_str += text + '\n'
        with open('poem2.txt', mode='a', encoding='utf-8') as f:
            f.write('題目： ' + rmline(poem_title_str) + '\n')
            f.write(poem_author_str + '\n')
            f.write('內容： ' + poem_text_str + '\n')
            f.write('-----------------------------------------' + '\n')




def parseHtml(html,page):

    # soup=BeautifulSoup(html,'lxml')
    # title=soup.title#獲取網頁title
    # title_name=title.name#獲取title的名字 也就是標籤的名字
    # title_string=title.string#獲取title的值
    # title_parent=soup.title.parent#title的父物件
    #
    # soup.p#獲取第一個p標籤
    # soup.p['class']#獲取第一個p標籤裡面的class值
    #
    # soup.find_all('a')#找到所有的a標籤
    # soup.find(id='***')#獲取id為***的標籤
    soup=BeautifulSoup(html,'lxml')
    with open('poem2.txt',mode='a',encoding='utf-8') as f:
        f.write('第'+str(page)+'頁'+'\n')

    if page==1:
        for poem in soup.find_all("div", "sons")[1:]:
            print(poem['id'])
            id=poem['id'][11:]
            parse_poem(getpoemMessage(id,page))

    else:
        for poem in soup.find_all("div", "sons"):
            print(poem['id'])
            id = poem['id'][11:]
            parse_poem(getpoemMessage(id, page))




if __name__=="__main__":
    filename = 'poem2.txt'
    if os.path.exists(filename):
        os.remove(filename)
    for page in range(1,3):
        #1，4是指頁數，也就是下載前三頁資料。下面的作者名字可以隨意改，或者寫詩的名字也可以
        parseHtml(search('納蘭性德',page), page)

編碼問題也是個大問題，具體問題具體解決吧。

python 上個小爬蟲的改進

上一篇寫的是個爬取古詩詞網上的古詩，並寫到檔案裡，今天做了一下改進。 1、之前寫的是網頁上顯示的摺疊詩，就是詩的內容是摺疊的，內容短的話可能是詩的全部，如果長的話只是一部分了，所以這裡做出改進，讓它爬取連結處的內容，因為如果你點選連線後會顯示詩的全部內容。 &n

用Python做個小遊戲：環境篇

一、安裝Python和pygame 1、在Windows環境下，安裝Python 略 2、安裝pygame，網址： http://pygame.org 使用Python自帶pip工具即可快速安裝pygame： python3 -m pip install -U py

第一個小爬蟲--爬取圖片並儲存

import urllib.request import re import os def url_open(url): req=urllib.request.Request(url) req.add_header('User-Agent','

python3+Scrapy環境配置外送兩個小爬蟲

windows開發，環境很重要！ windows開發，環境很重要！ windows開發，環境很重要！ ===========================重要的事情說三遍====================================== 今天講一下python3

python幾個小程式

import random while True: player = input("輸入：|剪刀（0）| 石頭（1）| 布（2）| 退出(q)|：\n") if(player == "

《Python-第一個小程式》---購物車購物學習小例子

概要按理說，我們入門的第一個小程式都應該是Hello World。因為比較簡單，我這也就不做過多的演示了。下面是我寫的一個小程式。主要用於練習Python的基本語法，以及入門。主要實

用Python做個小網站（MVC架構）

１.　基本結構，採用 MVC 模式。控制器（controller）負責轉發請求，對請求進行處理檢視　（View）: 介面設計人員進行圖形介面設計。模型　（Model）:　程式設計師編寫程式應有的功能（

用 Java 拿下 HTML 分分鐘寫個小爬蟲

> 本文適合有 Java 基礎知識的人群 ![](https://img2020.cnblogs.com/blog/759200/202009/759200-20200901132548961-938462130.png) 本文作者：HelloGitHub-秦人 HelloGitHub 推出的[《

Python爬蟲：抓取內涵段子1000張搞笑圖片-上篇（小爬蟲誕生篇）

出於興趣，在《幕課網：Python 開發簡單爬蟲》上學習了點兒 Python 爬蟲的入門知識，跟著視訊教程抓取了百度百科的 1000 個頁面。然後自己嘗試抓取一個國外網站的資料，但可能是由於最近召開

Python爬蟲&視覺化-舌尖上的“小龍蝦”

點選上方“程式人生”，選擇“置頂公眾號”第一時間關注程式猿（媛）身邊的故事作者資料森麟如需轉載，

Python將一個大文件按段落分隔為多個小文件的簡單方法

解決 list 之前一點 open ews 切片 compile popu 今天幫同學處理一點語料。語料文件有點大，而且是以連續兩個換行符作為段落標誌，他想把它按段落分隔成多個小文件。即每3個段落組成一個新文件。因為曾經沒有遇到過類似的操作，在網上找了

python小爬蟲練手

spider cep 文件名 found dha visit doc try ret 一個人無聊，寫了個小爬蟲爬取不可描述圖片.... 代碼太短，就暫時先往這裏貼一下做備份吧。 #! /usr/bin/python import chardet import urlli

用python零基礎寫爬蟲--編寫第一個網絡爬蟲

等待客戶端瀏覽器身份驗證 1.2 不存在 ssp 地址執行c ade 首先要說明的是，一下代碼是在python2.7版本下檢測的一.最簡單的爬蟲程序 --下載網頁 import urllib2 request=urllib2.Request("http://www.

python第一個爬蟲的例子抓取數據到mysql，實測有數據

入mysql數據庫 nor gecko /usr png 支持 web local webkit python3.5 先安裝庫或者擴展 1 requests第三方擴展庫 pip3 install requests 2 pymysql pip3 install pym

[Python] 記一波閑來無事寫的小爬蟲

閑來無事 .html .cn path remove markdown 講解 parser log Python爬取一波簡書的文章就沒什麽時間寫講解... 早知道把剛才的視頻錄下來發出來了。 import requests from bs4 import Beautifu

Python的幾個小程序，其實我覺得可以稱作初學時的基礎算法

基本什麽否則 col 重新保留 put span pri 昨天學習的，今天做一下整理，以前學過幾天c，感覺什麽都沒有搞出來，有點泄氣，看到Python後試試，從最基本的東西學起，希望不要辜負我的這一點熱情。 if語句的應用 1 n=1 2 while

python第一個爬蟲腳本

python -c get makedirs www 腳本 data close htm import urllib.requestimport reimport os url = "http://www.budejie.com/" # 爬的地址 def get_page

python之4個小作業

set 進行顯示 imp 位置 new 出現 primary 數值第一道小題 """ 1.雙人版2048遊戲 """ import random,itertools,pygame,sys pygame.init() screencaption=pygame.dis

爬蟲——三個小實戰

操作 aik x64 函數 win 調用 pan 添加 {} 貼吧爬取寫代碼前，構思需要的功能塊；寫代碼時，把各個功能模塊名提前寫好初始化初始化必要參數，完成基礎設置爬取百度貼吧lol吧：爬取地址中的get參數須傳遞(可以指定不同主題的貼吧和頁碼) 主題名

python 列表轉為字典的兩個小方法

ict python 列表兩種 adding {} st2 zip 列表 ... # 在有兩個列表， list1 = [‘key1‘,‘key2‘,‘key3‘] list2 = [‘1‘,‘2‘,‘3‘] #把他們轉為這樣的字典： # {‘key1‘:‘1‘,‘key2

python 上個小爬蟲的改進

相關推薦