爬取Ajax動態載入網頁

阿新 • • 發佈：2019-09-04

常見的反爬機制及處理方式

1、Headers反爬蟲：Cookie、Referer、User-Agent

解決方案: 通過F12獲取headers,傳給requests.get()方法

2、IP限制：網站根據IP地址訪問頻率進行反爬,短時間內進位制IP訪問

解決方案:

1、構造自己IP代理池,每次訪問隨機選擇代理,經常更新代理池

2、購買開放代理或私密代理IP

3、降低爬取的速度

3、User-Agent限制：類似於IP限制

解決方案: 構造自己的User-Agent池,每次訪問隨機選擇

5、對查詢引數或Form表單資料認證(salt、sign)

解決方案: 找到JS檔案,分析JS處理方法,用Python按同樣方式處理

6、對響應內容做處理

解決方案: 列印並檢視響應內容,用xpath或正則做處理

python中正則處理headers和formdata

1、pycharm進入方法：Ctrl + r ，選中 Regex

2、處理headers和formdata

(.*): (.*)

"$1": "$2",

3、點選 Replace All

民政部網站資料抓取

目標: 抓取最新中華人民共和國縣以上行政區劃程式碼

URL: http://www.mca.gov.cn/article/sj/xzqh/2019/ - 民政資料 - 行政區劃程式碼

實現步驟

1、從民政資料網站中提取最新行政區劃程式碼連結

　　最新的在上面，命名格式: 2019年X月中華人民共和國縣以上行政區劃程式碼

import requests
from lxml import etree
import re

url = 'http://www.mca.gov.cn/article/sj/xzqh/2019/'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
html = requests.get(url, headers=headers).text
parse_html = etree.HTML(html)
article_list = parse_html.xpath('//a[@class="artitlelist"]')

for article in article_list:
    title = article.xpath('./@title')[0]
    # 正則匹配title中包含這個字串的連結
    if title.endswith('程式碼'):
        # 獲取到第1個就停止即可，第1個永遠是最新的連結
        two_link = 'http://www.mca.gov.cn' + article.xpath('./@href')[0]
        print(two_link)
        break

2、從二級頁面連結中提取真實連結（反爬-響應網頁內容中嵌入JS，指向新的網頁連結）

向二級頁面連結發請求得到響應內容，並檢視嵌入的JS程式碼
正則提取真實的二級頁面連結

# 爬取二級“假”連結
two_html = requests.get(two_link, headers=headers).text
# 從二級頁面的響應中提取真實的連結（此處為JS動態載入跳轉的地址）
new_two_link = re.findall(r'window.location.href="(.*?)"', two_html, re.S)[0]

3、在資料庫表中查詢此條連結是否已經爬取，建立增量爬蟲

資料庫中建立version表，儲存爬取的連結
每次執行程式和version表中記錄核對，檢視是否已經爬取過

cursor.execute('select * from version')
result = self.cursor.fetchall()
if result:
   if result[-1][0] == two_link:
       print('已是最新')
   else:
       # 有更新，開始抓取
       # 將連結再重新插入version表記錄

4、程式碼實現

import requests
from lxml import etree
import re
import pymysql


class GovementSpider(object):
    def __init__(self):
        self.url = 'http://www.mca.gov.cn/article/sj/xzqh/2019/'
        self.headers = {'User-Agent': 'Mozilla/5.0'}
        # 建立2個物件
        self.db = pymysql.connect('127.0.0.1', 'root', '123456', 'govdb', charset='utf8')
        self.cursor = self.db.cursor()

    # 獲取假連結
    def get_false_link(self):
        html = requests.get(url=self.url, headers=self.headers).text
        # 此處隱藏了真實的二級頁面的url連結，真實的在假的響應網頁中，通過js指令碼生成，
        # 假的連結在網頁中可以訪問，但是爬取到的內容卻不是我們想要的
        parse_html = etree.HTML(html)
        a_list = parse_html.xpath('//a[@class="artitlelist"]')
        for a in a_list:
            # get()方法:獲取某個屬性的值
            title = a.get('title')
            if title.endswith('程式碼'):
                # 獲取到第1個就停止即可，第1個永遠是最新的連結
                false_link = 'http://www.mca.gov.cn' + a.get('href')
                print("二級“假”連結的網址為", false_link)
                break
        # 提取真連結
        self.incr_spider(false_link)

    # 增量爬取函式
    def incr_spider(self, false_link):
        self.cursor.execute('select url from version where url=%s', [false_link])
        # fetchall: (('http://xxxx.html',),)
        result = self.cursor.fetchall()

        # not result:代表資料庫version表中無資料
        if not result:
            self.get_true_link(false_link)
            # 可選操作: 資料庫version表中只保留最新1條資料
            self.cursor.execute("delete from version")

            # 把爬取後的url插入到version表中
            self.cursor.execute('insert into version values(%s)', [false_link])
            self.db.commit()
        else:
            print('資料已是最新,無須爬取')

    # 獲取真連結
    def get_true_link(self, false_link):
        # 先獲取假連結的響應,然後根據響應獲取真連結
        html = requests.get(url=false_link, headers=self.headers).text
        # 從二級頁面的響應中提取真實的連結（此處為JS動態載入跳轉的地址）
        re_bds = r'window.location.href="(.*?)"'
        pattern = re.compile(re_bds, re.S)
        true_link = pattern.findall(html)[0]

        self.save_data(true_link)  # 提取真連結的資料

    # 用xpath直接提取資料
    def save_data(self, true_link):
        html = requests.get(url=true_link, headers=self.headers).text

        # 基準xpath,提取每個資訊的節點列表物件
        parse_html = etree.HTML(html)
        tr_list = parse_html.xpath('//tr[@height="19"]')
        for tr in tr_list:
            code = tr.xpath('./td[2]/text()')[0].strip()  # 行政區劃程式碼
            name = tr.xpath('./td[3]/text()')[0].strip()  # 單位名稱

            print(name, code)

    # 主函式
    def main(self):
        self.get_false_link()


if __name__ == '__main__':
    spider = GovementSpider()
    spider.main()

動態載入資料抓取-Ajax

特點

右鍵 -> 檢視網頁原始碼中沒有具體資料
滾動滑鼠滑輪或其他動作時載入

抓取

F12開啟控制檯，選擇XHR非同步載入資料包，找到頁面動作抓取網路資料包
通過XHR-->Header-->General-->Request URL，獲取json檔案URL地址
通過XHR-->Header-->Query String Parameters(查詢引數)

豆瓣電影資料抓取案例

目標

地址: 豆瓣電影 - 排行榜 - 劇情
- https://movie.douban.com/typerank?
- type_name=%E5%89%A7%E6%83%85&type=11&interval_id=100:90&action=
目標: 爬取電影名稱、電影評分

F12抓包（XHR）

1、Request URL(基準URL地址) ：https://movie.douban.com/j/chart/top_list?

2、Query String Paramaters(查詢引數)

# 查詢引數如下：
type: 13 # 電影型別
interval_id: 100:90
action: '[{},{},{}]'
start: 0  # 每次載入電影的起始索引值
limit: 20 # 每次載入的電影數量

json檔案在以下地址：

基準URL地址+查詢引數

'https://movie.douban.com/j/chart/top_list?'+'type=11&interval_id=100%3A90&action=&start=20&limit=20'

程式碼實現

import requests
import time
from fake_useragent import UserAgent


class DoubanSpider(object):
    def __init__(self):
        self.base_url = 'https://movie.douban.com/j/chart/top_list?'
        self.i = 0

    def get_html(self, params):
        headers = {'User-Agent': UserAgent().random}
        res = requests.get(url=self.base_url, params=params, headers=headers)
        res.encoding = 'utf-8'
        html = res.json()  # 將json格式的字串轉為python資料型別
        self.parse_html(html)  # 直接呼叫解析函式

    def parse_html(self, html):
        # html: [{電影1資訊},{電影2資訊},{}]
        item = {}
        for one in html:
            item['name'] = one['title']  # 電影名
            item['score'] = one['score']  # 評分
            item['time'] = one['release_date']  # 列印測試
            # 列印顯示
            print(item)
            self.i += 1

    # 獲取電影總數
    def get_total(self, typ):
        # 非同步動態載入的資料 都可以在XHR資料抓包
        url = 'https://movie.douban.com/j/chart/top_list_count?type={}&interval_id=100%3A90'.format(typ)
        ua = UserAgent()
        html = requests.get(url=url, headers={'User-Agent': ua.random}).json()
        total = html['total']

        return total

    def main(self):
        typ = input('請輸入電影型別(劇情|喜劇|動作):')
        typ_dict = {'劇情': '11', '喜劇': '24', '動作': '5'}
        typ = typ_dict[typ]
        total = self.get_total(typ)  # 獲取該型別電影總數量

        for page in range(0, int(total), 20):
            params = {
                'type': typ,
                'interval_id': '100:90',
                'action': '',
                'start': str(page),
                'limit': '20'}
            self.get_html(params)
            time.sleep(1)
        print('爬取的電影的數量:', self.i)


if __name__ == '__main__':
    spider = DoubanSpider()
    spider.main()

騰訊招聘資料抓取(Ajax)

確定URL地址及目標

URL: 百度搜索騰訊招聘 - 檢視工作崗位https://careers.tencent.com/search.html
目標: 職位名稱、工作職責、崗位要求

要求與分析

通過檢視網頁原始碼，得知所需資料均為 Ajax 動態載入
通過F12抓取網路資料包，進行分析
一級頁面抓取資料: 職位名稱
二級頁面抓取資料: 工作職責、崗位要求

一級頁面json地址(pageIndex在變,timestamp未檢查)

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1563912271089&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn

二級頁面地址(postId在變,在一級頁面中可拿到)

https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1563912374645&postId={}&language=zh-cn

useragents.py檔案

ua_list = [
  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',
  'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0',
  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)',
]

import time
import json
import random
import requests
from useragents import ua_list


class TencentSpider(object):
    def __init__(self):
        self.one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1563912271089&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
        self.two_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1563912374645&postId={}&language=zh-cn'
        self.f = open('tencent.json', 'a')  # 開啟檔案
        self.item_list = []  # 存放抓取的item字典資料

    # 獲取響應內容函式
    def get_page(self, url):
        headers = {'User-Agent': random.choice(ua_list)}
        html = requests.get(url=url, headers=headers).text
        html = json.loads(html)  # json格式字串轉為Python資料型別

        return html

    # 主線函式: 獲取所有資料
    def parse_page(self, one_url):
        html = self.get_page(one_url)
        item = {}
        for job in html['Data']['Posts']:
            item['name'] = job['RecruitPostName']  # 名稱
            post_id = job['PostId']  # postId，拿postid為了拼接二級頁面地址
            # 拼接二級地址,獲取職責和要求
            two_url = self.two_url.format(post_id)
            item['duty'], item['require'] = self.parse_two_page(two_url)
            print(item)
            self.item_list.append(item)  # 新增到大列表中

    # 解析二級頁面函式
    def parse_two_page(self, two_url):
        html = self.get_page(two_url)
        duty = html['Data']['Responsibility']  # 工作責任
        duty = duty.replace('\r\n', '').replace('\n', '')  # 去掉換行
        require = html['Data']['Requirement']  # 工作要求
        require = require.replace('\r\n', '').replace('\n', '')  # 去掉換行

        return duty, require

    # 獲取總頁數
    def get_numbers(self):
        url = self.one_url.format(1)
        html = self.get_page(url)
        numbers = int(html['Data']['Count']) // 10 + 1  # 每頁有10個推薦

        return numbers

    def main(self):
        number = self.get_numbers()
        for page in range(1, 3):
            one_url = self.one_url.format(page)
            self.parse_page(one_url)

        # 儲存到本地json檔案:json.dump
        json.dump(self.item_list, self.f, ensure_ascii=False)
        self.f.close()


if __name__ == '__main__':
    start = time.time()
    spider = TencentSpider()
    spider.main()
    end = time.time()
    print('執行時間:%.2f' % (end - start))

爬取Ajax動態載入網頁

常見的反爬機制及處理方式 1、Headers反爬蟲：Cookie、Referer、User-Agent 解決方案: 通過F12獲取headers,傳給requests.get()方法 2、IP限制：網站根據IP地址訪問頻率進行反爬,短時間內進位制IP訪問解決方案:

Python爬取ajax動態載入內容

import requests import json import csv url= "https://movie.douban.com/j/chart/top_list?" params={ "type":17, "interval_id" :"100

Java爬蟲——phantomjs抓取ajax動態載入網頁

（說好的第二期終於來了 >_<） 1、phantomjs介紹 phantomjs實現了一個無介面的webkit瀏覽器。雖然沒有介面，但dom渲染、js執行、網路訪問、canvas/svg繪製等功能都很完備，在頁面抓取、頁面輸出、自動化測試等方面有廣泛的應用。

scrapy由淺入深(三) selenium模擬爬取ajax動態頁面(智聯招聘)

爬取智聯招聘的網址：https://sou.zhaopin.com/?pageSize=60&jl=489&kw=python&kt=3 上一篇部落格爬取了前程無憂的職位招聘資訊，總體來說前程無憂的網站資訊並不難爬取，前程無憂的網站並

【爬蟲】Scrapy爬蟲框架教程-- 抓取AJAX非同步載入網頁

前一段時間工作太忙一直沒有時間繼續更新這個教程，最近離職了趁著這段時間充裕趕緊多寫點東西。之前我們已經簡單瞭解了對普通網頁的抓取，今天我就給大家講一講怎麼去抓取採用Ajax非同步加的網站。工具和環境語言：python 2.7 IDE： Pycharm 瀏覽器：Ch

通過爬取天貓商品評論例項分析Python爬取ajax動態生成的資料

本文主要通過爬取天貓商品kindle的評論為例來說明利用python爬取ajax動態生成的資料的方式，本文使用的工具如下：工具 chrome瀏覽器【尋找評論的動態連結】 python3.5【執行程式碼】 mysql【儲存爬蟲獲得

通過抓取淘寶評論為例講解Python爬取ajax動態生成的資料

'https://rate.taobao.com/feedRateList.htm?auctionNumId=538039793643&userNumId=2779992133&currentPageNum=6&pageSize=20&rateType=&orderTy

Scrapy爬取Ajax（非同步載入）網頁例項——簡書付費連載

這兩天學習了Scrapy爬蟲框架的基本使用，練習的例子爬取的都是傳統的直接載入完網頁的內容，就想試試爬取用Ajax技術載入的網頁。這裡以簡書裡的優選連載網頁為例分享一下我的爬取過程。網址為： https://www.jianshu.com/mobile/books?category_id=28

python 爬取鬥魚 Ajax動態載入js分頁使用phontomjs無介面瀏覽器

python2.7版本 #coding:utf8 import unittest from selenium import webdriver from bs4 import BeautifulSo

urllib:Post方式爬取AJAX載入的資料

Request請求物件的裡有data引數，它就是用在POST裡的，我們要傳送的資料就是這個引數data，data是一個字典，裡面要匹配鍵值對。 POST請求：程式碼模板匯入request模組 import urllib.request 程式碼模板 # 首先對data進行轉碼

使用python抓取js動態載入的網頁

原文地址：http://blog.csdn.net/lambert310/article/details/49248109 我們在做網頁抓取的時候，一般來說使用urllib和urllib2就能滿足大部分需求。但是有時候我們遇見那種使用js動態載入的網頁。就會發現urllib只能抓出一個部分內容空白的

python爬取JS動態網頁完整指南（selenium+chrome headless）

11.23：更新，每次還要開啟瀏覽器多影響執行效率，看到蟲師講了chrome headless，非常好用引用新增： from selenium.webdriver.chrome.options import Options 程式碼新增： chrome_options =

Ajax實現瀑布流動態載入網頁

瀑布流的動態載入效果是我一直以來覺得最喜歡的效果之一了。最近在工作中遇到了一個情景，當我登入到系統中時，需要向另一個系統傳送多次介面查詢資料，之後再將這些資料顯示到頁面上。這就存在了這樣一個問題：頁面載入的時間過長，使用者在登陸後就需要一直等待介面全部查詢完畢後才能登入到系

ajax 動態載入html後不能執行其中的js解決方法

cti 寫入 match IT 通過如果 frame tro 其他事件背景有一個公用頁面需要在多個頁面調用，其中涉及到部分js已經寫在了公用頁面中，通過ajax加載該頁面後無法執行其中的js。解決思路 1. 采用附加一個iframe的方法去執行js，為我等代碼潔癖者

C#利用phantomJS抓取AjAX動態頁面

tin 文件 stringbu == 導致 style 間隔 edi zip壓縮在C#中，一般常用的請求方式，就是利用HttpWebRequest創建請求，返回報文。但是有時候遇到到動態加載的頁面，卻只能抓取部分內容，無法抓取到動態加載的內容。如果遇到這種的話，推薦

python爬取ajax請求，返回的json資料格式化報錯json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

python爬取ajax請求，返回Json資料中帶有<html><head></head><body><prestyle="word-wrap: break-word; white-space: pre-wrap;"></pre>

selenium+python爬取資料跳轉網頁

專案要做一個四個層級欄的資料抓取，而且點選查詢後資料會在新跳出的網頁。原始碼如下註釋解釋 from selenium import webdriver import selenium #from time import sleep as sp url='http://202.127.42.15

scrapy利用scrapy-splash爬取JS動態生成的標籤

1 引言 scrapy處理爬取靜態頁面，可以說是很好的工具，但是隨著技術的發展，現在很多頁面都不再是靜態頁面了，都是通過AJAX非同步載入資料動態生成的，我們如何去解決問題呢？今天給大家介紹一種方法：scrapy-splash 2 準備工作首先需要安裝一下幾個工具（1）

爬取JS動態生成的URL

愛卡汽車論壇搜尋結果頁面：想要python爬取搜尋結果連結：a標籤中的href，但是這個url是動態生成的。網頁原始碼：用普通方式解析： import urllib.request url = "http://search.xcar.com.cn/metasea

jQuery: Ajax動態載入內容on()繫結事件無效的解決方法

問題描述這是一個很普遍但又經常犯錯的問題，還是記錄一下。一個例子 HTML: <div id="the_div"> </div> <button id="btn">test</button> javascript

爬取Ajax動態載入網頁

常見的反爬機制及處理方式

python中正則處理headers和formdata

民政部網站資料抓取

實現步驟

動態載入資料抓取-Ajax

豆瓣電影資料抓取案例

騰訊招聘資料抓取(Ajax)

相關推薦