Python爬蟲常見問題總結

阿新 • • 發佈：2019-01-06

Python爬蟲常見問題總結

問題一

背景：連結：https://blog.csdn.net/xxzj_zz2017/article/details/79739077

怎麼都無法測試成功

# -*- coding: utf-8 -*-
"""
Created on Thu Nov  8 08:46:45 2018
@author: zwz
"""
#參考網站：https://blog.csdn.net/xxzj_zz2017/article/details/79739077、
from splinter.browser import Browser
from bs4 import BeautifulSoup
import 
 pandas as pd
import time
from PIL import Image  
import time  
import snownlp  
import jieba  
import jieba.analyse  
import numpy as np
import re
import requests
from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
b = Browser()
url0 = "https://book.douban.com/subject/bookid/comments/hot?p="#bookid換成你自己想要爬取的書籍的評論資料 

list_1 = []
list_2 = []
for i in range(1,3):
    url = "https://book.douban.com/subject/25862578/comments/hot?p="+str(i)
    #b.visit(url)
    soup = BeautifulSoup(b.driver.page_source, "html.parser")
    comments = soup.find_all("p","comment-content")
    for item in comments:
        comment = item.string#評論內容 

        list_1.append(comment)
        print(list_1)
    pattern = re.compile('span class="user-stars allstar(.*?) rating"')
    p = re.findall(pattern,b.driver.page_source)
    list_2 += list(map(int,p))
    time.sleep(4)
print("平均分:",sum(list_2)//len(list_2))
pd1 = pd.DataFrame(list_1)
pd1.to_csv("comments.csv",index = False)  
pd2 = pd.DataFrame(list_2)
pd2.to_csv("result.csv",index=False)
print("yes")
comments = ""
for item in list_1:
    item = item.strip(" ")
    nlp = snownlp.SnowNLP(item)
    comments += " ".join(jieba.analyse.extract_tags(item,6))#關鍵字提取
back_coloring = np.array(Image.open("jyzhd.jpg"))
word_cloud = WordCloud(font_path='simkai.ttf',background_color='white',max_words=2000,mask=back_coloring,margin=10)
word_cloud.generate(comments)
#從背景圖片生成顏色值
image_colors = ImageColorGenerator(back_coloring)
plt.figure(figsize=(8,5),dpi=160)  
plt.imshow(word_cloud.recolor(color_func=image_colors))  
plt.axis("off")  
plt.show()  
word_cloud.to_file("comments.jpg")

執行出現的問題：
在這裡插入圖片描述
解決了一小時，還是沒解決，現放下問題，等待有緣人解決。

問題二

連結背景：https://mp.weixin.qq.com/s/E4EEgmQverifK5mc6W8onw
程式碼在這：百度雲連結：https://pan.baidu.com/s/17zlP3AMNCQdvEQpU7Rx_tw 提取碼：9z9q

問題程式：

# -*- coding: utf-8 -*-
"""
Created on Mon Sep 10 19:36:24 2018

@author: hzp0625
"""
from selenium import webdriver
import pandas as pd
from datetime import datetime
import numpy as np
import time
import os

os.chdir('D:\data_work')
def gethtml(url):

    browser = webdriver.PhantomJS(executable_path="F:\Study_software\Anaconda\setup\Lib\site-packages\selenium\webdriver\phantomjs")    
    browser.get(url)
    browser.implicitly_wait(10)
    return(browser)

def getComment(url):
   
    browser =  gethtml(url)
    i = 1
    AllArticle = pd.DataFrame(columns = ['id','author','comment','stars1','stars2','stars3','stars4','stars5','unlike','like'])
    print('連線成功，開始爬取資料')    
    while True:

        xpath1 = '//*[@id="app"]/div[2]/div[2]/div/div[1]/div/div/div[4]/div/div/ul/li[{}]'.format(i)
        try:
            target = browser.find_element_by_xpath(xpath1)
        except:
            print('全部爬完')
            break
            
        author = target.find_element_by_xpath('div[1]/div[2]').text
        comment = target.find_element_by_xpath('div[2]/div').text
        stars1 = target.find_element_by_xpath('div[1]/div[3]/span/i[1]').get_attribute('class')
        stars2 = target.find_element_by_xpath('div[1]/div[3]/span/i[2]').get_attribute('class')
        stars3 = target.find_element_by_xpath('div[1]/div[3]/span/i[3]').get_attribute('class')
        stars4 = target.find_element_by_xpath('div[1]/div[3]/span/i[4]').get_attribute('class')
        stars5 = target.find_element_by_xpath('div[1]/div[3]/span/i[5]').get_attribute('class')
        date = target.find_element_by_xpath('div[1]/div[4]').text
        like = target.find_element_by_xpath('div[3]/div[1]').text
        unlike = target.find_element_by_xpath('div[3]/div[2]').text
        
        
        comments = pd.DataFrame([i,author,comment,stars1,stars2,stars3,stars4,stars5,like,unlike]).T
        comments.columns = ['id','author','comment','stars1','stars2','stars3','stars4','stars5','unlike','like']
        AllArticle = pd.concat([AllArticle,comments],axis = 0)
        browser.execute_script("arguments[0].scrollIntoView();", target)
        i = i + 1
        if i%100 == 0:
            print('已爬取{}條'.format(i))
    AllArticle = AllArticle.reset_index(drop = True)
    return AllArticle
url = 'https://www.bilibili.com/bangumi/media/md102392/?from=search&seid=8935536260089373525#short'
result = getComment(url)
#result.to_csv('工作細胞爬蟲.csv',index = False)

問題截圖：
1541689390933

解決辦法：

1.首先，先自己安裝：pip install phantomjs （我是在anconda的基礎上進行的,windows 64)

2.發現，無法全部安裝成功，特別是這個phantomjs.exe

最後，通過查詢網上，該網址：https://stackoverflow.com/questions/37903536/phantomjs-with-selenium-error-message-phantomjs-executable-needs-to-be-in-pa

有較好的解決辦法，我是通過其中的它給出的網址，進行下載相應的phantomjs.exe。

我最後把上面的那句，更改為：

browser = webdriver.PhantomJS(executable_path="F:\Study_software\Anaconda\setup\Lib\site-packages\selenium\webdriver\phantomjs\phantomjs.exe")

就是把最後的指向是指向一個.exe檔案，結果就可以了。

問題三：

我發現原來詞雲的生成效果是與圖片的高清程度是有關的。

如果有需要，可以去看一下，我的文章：https://blog.csdn.net/weixin_38809485/article/details/83892939

分享一個下載高清圖片的網站：https://unsplash.com/

區別：
在這裡插入圖片描述

Python爬蟲常見問題總結

Python爬蟲常見問題總結問題一背景：連結：https://blog.csdn.net/xxzj_zz2017/article/details/79739077 怎麼都無法測試成功 # -*- coding: utf-8 -*- """ Created on Thu N

Python 爬蟲常見的坑和解決方法

gpo 爬蟲 nic 詳細 true wow user html encoding 1.請求時出現HTTP Error 403: Forbidden headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64;

python爬蟲知識總結

con import spa 運行結果 span 知識總結 har 環境環境要求：　　1、編程語言版本python3；　　2、系統：win10; 一、安裝python3 不是本文重點，提供幾個思路：　　1、官網：https://www.python.org/

python爬蟲知識點總結（七）PyQuery詳解

get 初始化 span 2個查看 sel docs lin query 官方學習文檔：http://pyquery.readthedocs.io/en/latest/api.html 一、什麽是PyQuery? 答：強大有靈活的網頁解析庫，模仿jQuery實現。如果你覺

python爬蟲知識點總結（九）Requests+正則表達式爬取貓眼電影

bsp code item 代碼 proc action none width auth 一、爬取流程二、代碼演示 #-*- coding: UTF-8 -*- #_author:AlexCthon #mail:[email protected] #date:20

【20181104】python--爬蟲入門總結

前段時間對python爬蟲技術進行了簡單學習，主要目的是為了配合Release Manager日常工作開展相關資料的自動化度量晾晒，比如針對Jira系統中產品需求實現情況和缺陷處理情況進行定時抓取分析併發送郵件報告。 Python爬蟲的常用方案包括幾個部分：排程器、url管理、資料下載、資料解析

python爬蟲常見異常及處理方法

在編寫python爬蟲時經常會遇到異常中斷的情況，導致爬蟲意外終止，一個理想的爬蟲應該能夠在遇到這些異常時繼續執行。下面就談談這幾種常見異常及其處理方法：異常1：requests.exceptions.ProxyError 對於這個錯誤，stack

Python爬蟲個人總結持續更新

爬蟲中經常遇到這樣的程式碼： ids=['id1','id2','id3']#爬蟲快取id資料 contents=['content1','content2','content3']#爬蟲快取內容資料 for id,conten in zip(ids,contents): info={

python爬蟲常見問題（一）

Python爬蟲所見問題集合：1. NotImplementedError: Only the following pseudo-classes are implemented: nth-of-typeAnswer:nth-child 改為 nth-of-type3.expe

[轉]用python爬蟲抓站的一些技巧總結 zz

內容 req xxxxx pic 個數相關 choice 都是 observe 來源網站：http://www.pythonclub.org/python-network-application/observer-spider 學用python也有3個多月了，用得最

最全Python爬蟲總結(轉載)

其中網頁 -i 變量 oba cati nod style 應該 [html] view plain copy 最近總是要爬取一些東西，索性就把Python爬蟲的相關內容都總結起來了，自己多動手還是好。（1）普通的內容爬取（2）保存爬取的圖片/

常見的Python爬蟲面試題，叫面試官唱征服

python 爬蟲 web開發系統入門快速詳細是否了解線程的同步和異步？線程同步：多個線程同時訪問同一資源，等待資源訪問結束，浪費時間，效率低線程異步：在訪問資源時在空閑等待時同時訪問其他資源，實現多線程機制是否了解網絡的同步和異步？同步：提交請求->等待服務器處理->

python爬蟲總結

use 禁止訪問不同安裝docker 初學者 http sel 類型破解 [TOC] 由於某些原因最近終於可以從工作的瑣事中抽出身來，有時間把之前的一些爬蟲知識進行了一個簡單的梳理，也從中體會到階段性地對過往知識進行梳理是真的很有必要。常用第三方庫對於爬蟲初學者

python爬蟲的常見方式

adsl ajax 打碼平臺 pid shark 方式 -a 分布式 rapyd requests+bs4+lxml直接獲取並解析html數據抓包ajax請求，使用requests獲取並解析json數據反爬嚴重的網站，使用selenium爬取設置代理 a.urllib

Python：爬蟲技巧總結！

gen name server 解析 num erro dde 資料 pre 一些常用的爬蟲技巧歸納與以下幾點： 1、基本抓取網頁 get方法 import urllib2 url "http://www.baidu.com" respons = urllib2.urlop

潭州課堂25班：Ph201805201 爬蟲基礎第七課 Python與常見加密方式 (課堂筆記)

加密算法 string 寫法 one python dac 獲得分解符號前言我們所說的加密方式，都是對二進制編碼的格式進行加密的，對應到Python中，則是我們的Bytes。所以當我們在Python中進行加密操作的時候，要確保我們操作的是Bytes，否則就會報錯。

Python爬蟲面試常見問題

優化組合動態加載介紹 lin xss 狀態碼 csrf 加載先收藏一下，有空再整理答案爬蟲面試常見問題一、項目問題： 1.你寫爬蟲的時候都遇到過什麽反爬蟲措施，你是怎樣解決的 2.用的什麽框架。為什麽選擇這個框架二、框架問題： 1.scrapy的基本結構（五個

Python基礎常見問題總結(一)

ecs 字段避免定義靜態方法工作如果數位元組tuple method 1.__ foo 、foo_ 和 __foo__ 三者之間的區別是什麽？__foo表示私有屬性、_foo表示受保護的屬性、__foo__表示Python自帶的屬性 2.請您簡述Python編

Python中常見字符串去除空格的方法總結

lac rip 生成 rst 字符串分割進行字符分割去除 Python中常見字符串去除空格的方法總結 1：strip()方法，去除字符串開頭或者結尾的空格>>> a = " a b c ">>> a.strip()‘a b c‘2

爬蟲工程師熬夜寫了這篇文章，關於Python爬蟲的一些方法總結！

爬蟲原理與資料抓取 Requests簡單使用新增 headers 和查詢引數學習Python中有不明白推薦加入交流群

Python爬蟲常見問題總結

Python爬蟲常見問題總結

問題一

問題二

問題三：

相關推薦