Python爬蟲：Selenium+ BeautifulSoup 爬取JS渲染的動態內容（雪球網新聞）

阿新 • • 發佈：2019-01-04

爬取目標：下圖中紅色方框部分的文章內容。（需要點選每篇文章的連結才能獲得文章內容）

注：該文章僅介紹爬蟲爬取新聞這一部分，爬蟲語言為Python。

乍一看，爬蟲的實現思路很簡單：

（2）通過第一步所獲得的各篇文章的URL，抓取文章內容。

但是發現簡單使用urllib2.urlopen()並不能獲得紅框部分的資料，原因是該部分資料是通過JS動態載入的。

最終發現可以採用Selenium框架來抓取動態資料。Selenium原本是Web測試工具，在Python爬蟲中，可以使用它來模擬真實瀏覽器對URL進行訪問，Selenium支援的瀏覽器包括Firefox、Chrome、Opera、Edge、IE 等

。在此我使用的是Firefox瀏覽器。

Python爬蟲指令碼如下，可以參考註釋來理解程式碼：

# coding=utf-8

import time
import Queue
import pymongo
import urllib2
import threading
from bs4 import BeautifulSoup
from BeautifulSoup import *
from selenium import webdriver
from selenium.webdriver.common.by import By


# 連線本地MongoDB資料庫
client = pymongo.MongoClient()
# 資料庫名為shsz_news
db = client.shsz_news
# collection名為news
collection = db.news


# 文章儲存資料結構為：標題  作者  文章釋出時間 閱讀量  文章內容
#                    title  author  timestamp    read   content
class Article:
    title = ""
    url = ""
    author = ""
    timestamp = ""
    read = 0
    content = ""

    def __init__(self, title, url, author, timestamp, read, content):
        self.title = title
        self.url = url
        self.author = author
        self.timestamp = timestamp
        self.read = read
        self.content = content


# 引數為：點選多少次"載入更多"
# 返回值為文章的url列表，資料總條數為：50 + 15 * num
def get_article_url(num):
    browser = webdriver.Firefox()
    browser.maximize_window()
    browser.get('http://xueqiu.com/#/cn')
    time.sleep(1)

    # 將螢幕上滑4次，之後會出現“載入更多”按鈕——此時有50篇文章
    for i in range(1, 5):
        browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
        time.sleep(1)

    # 點選num次“載入更多”——每次點選會載入15篇新聞
    for i in range(num):
        # 找到載入更多按鈕，點選
        browser.find_element(By.LINK_TEXT, "載入更多").click()
        time.sleep(1)

    soup = BeautifulSoup(browser.page_source)
    # 解析html，獲取文章列表
    article_queue = parse_html(soup)
    browser.close()
    return article_queue


# 解析html，返回Article的佇列
def parse_html(soup):
    article_queue = Queue.Queue()
    article_divs = soup.findAll('div', {'class': 'home__timeline__item'})
    if article_divs is not None:
        for article_div in article_divs:
            # 獲取文章url
            url = dict(article_div.h3.a.attrs)['href']
            article_url = 'https://xueqiu.com' + url
            # 獲取文章標題
            article_title = article_div.h3.a.string
            # 獲取文章作者
            article_author = article_div.find('a', {'class': 'user-name'}).string
            # 獲取文章釋出時間
            article_timestamp = article_div.find('span', {'class': 'timestamp'}).string
            # 獲取文章閱讀量
            article_read = article_div.find('div', {'class': 'read'}).string
            # 構造article物件，新增到article_queue佇列中
            article = Article(url=article_url, title=article_title, author=article_author,
                              timestamp=article_timestamp, read=article_read, content='')
            article_queue.put(article)
    return article_queue


# 獲取文章內容的執行緒
class GetContentThread(threading.Thread):
    def __init__(self, article_queue):
        threading.Thread.__init__(self)
        self.url_queue = article_queue

    def run(self):
        count = 0;
        while 1:
            try:
                count += 1
                # 列印每個執行緒的處理進度...
                if count % 100 == 0:
                    print count
                article = self.url_queue.get()
                # 獲取文章url
                article_url = article.url
                request = urllib2.Request(article_url)
                request.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6')
                response = urllib2.urlopen(request, timeout=10)
                chunk = response.read()
                soup = BeautifulSoup(chunk)
                # 將文章內容解析出來
                content = soup.find('div', {'class': 'detail'})
                # 需要使用str()函式，否則無法儲存到mongoDB中
                article.content = str(content)
                try:
                    # 將article資訊寫入mongoDB資料庫
                    collection.save(article.__dict__)
                except Exception, e:
                    # 該方法提示q.join()是否停止阻塞
                    self.url_queue.task_done()
                    # 將該文章重新放入佇列
                    self.url_queue.put(article)
                    print "Save into MongoDB error！Let's make a comeback "
                # 該方法提示q.join()是否停止阻塞
                self.url_queue.task_done()

            except Exception, e:
                # 該方法提示q.join()是否停止阻塞
                self.url_queue.task_done()
                print 'get content wrong! ', e, '\n'
                # 出現異常，將異常資訊寫入檔案
                file1 = open('get_content_wrong.txt', 'a')
                file1.write(str(article.title) + '\n')
                file1.write(str(article.url) + '\n')
                file1.write(str(e) + '\n')
                file1.close()
                if '404' in str(e):
                    print 'URL 404 Not Found:', article.url
                # 如果錯誤資訊中包含 'HTTP' or 'URL' or 'url' ，將該地址重新加入佇列，以便稍後重新嘗試訪問
                elif 'HTTP' or 'URL' or 'url' in str(e):
                    self.url_queue.put(article)
                    print "Let's make a comeback "
                    continue


def main():
    # 獲得所有的文章,並將它們放入佇列中
    article_queue = get_article_url(150)

    # 建立10個執行緒，獲取所有文章的具體內容,並寫入mongoDB資料庫
    for i in range(10):
        gct = GetContentThread(article_queue)
        gct.setDaemon(True)
        gct.start()

    # 等待佇列中的所有任務完成
    article_queue.join()


main()

Python爬蟲：Selenium+ BeautifulSoup 爬取JS渲染的動態內容（雪球網新聞）

爬取目標：下圖中紅色方框部分的文章內容。（需要點選每篇文章的連結才能獲得文章內容）注：該文章僅介紹爬蟲爬取新聞這一部分，爬蟲語言為Python。乍一看，爬蟲的實現思路很簡單：（2）通過第一步所獲得的各篇文章的URL，抓取文章內容。但是發現簡單使用urlli

Python爬蟲之利用BeautifulSoup爬取豆瓣小說（三）——將小說信息寫入文件

設置 one 行為 blog 應該 += html uil rate 1 #-*-coding:utf-8-*- 2 import urllib2 3 from bs4 import BeautifulSoup 4 5 class dbxs: 6 7

python爬蟲-使用多程序爬取美圖-人工智慧語言（高效爬蟲）

import os from multiprocessing.pool import Pool from urllib.parse import urlencode from hashlib import md5 import requests def loaDpage(fullurl):

Python爬蟲使用Selenium+PhantomJS抓取Ajax和動態HTML內容

1，引言在Python網路爬蟲內容提取器一文我們詳細講解了核心部件：可插拔的內容提取器類gsExtractor。本文記錄了確定gsExtractor的技術路線過程中所做的程式設計實驗。這是第二部分，第一部分實驗了用xslt方式一次性提取靜態網頁內容並轉換成xml格式。留下了

Python爬蟲：selenium掛shadowsocks代理爬取網頁內容

selenium掛ss代理爬取網頁內容 from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import

Python網路爬蟲（四）：selenium+chrome爬取美女圖片

說明： Python版本：Python IDE：PyCharm chrome版本：我的版本63 chromedriver.exe：因為是模擬瀏覽器訪問，chrome需要再下載一個驅動，具體方式在我的上一篇部落格，內容很詳細。傳送門：Python網路爬蟲（

Python 爬蟲技巧1 | 將爬取網頁中的相對路徑轉換為絕對路徑

1.背景：在爬取網頁中的過程中，我對目前爬蟲專案後端指令碼中拼接得到絕對路徑的方法很不滿意，今天很無意瞭解到在python3 的 urllib.parse模組對這個問題有著非常完善的解決策略，真的是上天有眼，感動！ 2.urllib.parse模組 This module define

python爬蟲——40行程式碼爬取「筆趣看」全部小說你都看了嗎？

需求分析 ”筆趣看“ 是一個盜版小說網站，這裡有各大知名小說網站的小說，更新速度略慢於正版網站。但是該網站只支援線上瀏覽，不支援小說下載，對於想要下載下來以防斷網或者網速不好時也能看的童鞋來說不太友好。因此，本次練習將爬取該網站所有小說。PS：本次練習僅為學習交流，請各位童鞋支援正版。爬取

經典爬蟲：用Scrapy爬取百度股票

前言今天我們編寫一個用 Scrapy 框架來爬取百度股票的程式碼，之前寫過一篇爬取百度股票的文章（點我），程式碼的邏輯和這篇文章的邏輯是一樣的，用到的解析器不同罷了。 Scrapy 爬蟲框架 Scrapy 爬蟲框架是由 7+2 的結構構成：引擎

一個月入門Python爬蟲學習，輕鬆爬取大規模資料

利用爬蟲我們可以獲取大量的價值資料，從而獲得感性認識中不能得到的資訊，這篇文章給大家帶來了一個月入門Python學習,爬蟲輕鬆爬取大規模資料，感興趣的朋友一起看看吧資料獲取方式：Python技術學習QQ群832339352 新增即可免費獲取！ Python爬蟲為

Python爬蟲：selenium開啟新視窗和多視窗切換

上說可以通過傳送按鍵事件觸發，比如ctrl+T，不過我沒成功，使用了js開啟新視窗的方式程式碼示例 # -*- coding: utf-8 -*- # @File : switch_tab.py # @Date : 2018-07-27

python 爬蟲使用正則爬取51job內容並存入txt

python爬蟲基礎–使用正則提取51job內容輸出到txt from urllib import request #url url = 'https://search.51job.com/list/020000%252C010000%252C080200%25

selenium+chromedrive 爬取js載入的動態網頁（下拉才能顯示內容）

from selenium import webdriver from selenium.webdriver.common.keys import Keys #呼叫鍵盤操作 from selenium.webdriver.chrome.options import Opt

python爬蟲【一】爬取文字

我們在安裝py是建議如果使用windows不要安裝原生的py因為windows的c編譯器原因會使某些套件安裝起來有麻煩也就是安裝anaconda版本的pyhttps://www.anaconda.com/download/#windows py官網下載的是原生版本https://www

python爬蟲學習之定向爬取股票資訊

一、功能描述目標：獲取上交所和深交所所有股票的名稱和交易資訊輸出：儲存到檔案中技術路線：requests-bs4-re 二、選取原則：股票資訊靜態存在於HTML頁面中，非js程式碼生成，沒有robots協議限制三、程式的結構設計

python爬蟲【二】爬取新聞

在一個新聞站點或者絢麗的網頁會有許多id和class 我們可以通過觀察來看到我們需要的資訊在那些id和class下但是這裡介紹兩種快速便捷的方法第一種使用谷歌瀏覽器自帶的開發者工具或者安裝infolite外掛安裝方法看這篇https:/

Python爬蟲：Selenium常用操作，下載youtube視訊例項

selenium常用操作： from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait

Python爬蟲：使用BeautifulSoup分析網頁結構注意事項

開始我用BeautifulSoup分析網頁時候這樣做： #從檔案讀取html原始檔內容 with open("html.txt", "r", encoding='utf-8') as file: content = file.read() #替換轉義字元 map

python爬蟲由淺入深9---定向爬取股票資料資訊並儲存至本地檔案

技術路線：requests庫+bs4庫+re庫的整合使用目標：獲得上交所和深交所所有股票的名稱和交易資訊輸出：儲存至本地檔案可選資料網路有：新浪股票和百度股票，，通過檢視網頁原始碼可知，新浪股票的資料是通過javascript指令碼獲取的，故通過以上方式無法解析呃呃

Python 爬蟲多執行緒爬取美女圖片儲存到本地

Wanning 我們不是生產者,我們只是搬運工資源來至於，程式碼基於Python 3.5.2 友情提醒：血氣方剛的騷年。請謹慎閱圖！！！謹慎閱圖！！！謹慎閱圖！

Python爬蟲：Selenium+ BeautifulSoup 爬取JS渲染的動態內容（雪球網新聞）

相關推薦