python爬蟲xpath針對json程式碼的分析方法

阿新 • • 發佈：2019-02-04

本文學會使用多程序爬取的map方法，json提取頁面內容方法，xpath解析頁面的方法：

http://tieba.baidu.com/p/3522395718?pn=1

頁面程式碼：

程式設計程式碼：

def spider(url):
    html = requests.get(url)
    selector = etree.HTML(html.text)
    content_field = selector.xpath('//div[@class="l_post l_post_bright "]')
    item = {}
    for each in content_field:
        reply_info = json.loads(each.xpath('@data-field')[0].replace('&quot',''))
        author = reply_info['author' 
]['user_name']
        content = each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content "]/text()')[0]
        reply_time = reply_info['content']['date']
        #print content
        #print reply_time
        #print author
item['user_name'] = author
        item['topic_reply_content' 
] = content
        item['topic_reply_time'] = reply_time
        towrite(item)

針對json開發的頁面，我們可以使用json.loads方法載入提取，如：

reply_info = json.loads(each.xpath('@data-field')[0].replace('&quot',''))

此頁面也涉及多個字典，字典裡含有字典，解析方法為：

 author = reply_info['author']['user_name']

後附爬取貼吧user_name,內容，發表時間等內容的完整程式碼：

#-*-coding:utf8-*-
from lxml import etree
from multiprocessing.dummy import Pool as ThreadPool
import requests
import json
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

'''重新執行之前請刪除content.txt，因為檔案操作使用追加方式，會導致內容太多。'''
def towrite(contentdict):
    f.writelines(u'回帖時間:' + str(contentdict['topic_reply_time']) + '\n')
    f.writelines(u'回帖內容:' + unicode(contentdict['topic_reply_content']) + '\n')
    f.writelines(u'回帖人:' + contentdict['user_name'] + '\n\n')

def spider(url):
    html = requests.get(url)
    selector = etree.HTML(html.text)
    content_field = selector.xpath('//div[@class="l_post l_post_bright "]')
    item = {}
    for each in content_field:
        reply_info = json.loads(each.xpath('@data-field')[0].replace('&quot',''))
        author = reply_info['author']['user_name']
        content = each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content "]/text()')[0]
        reply_time = reply_info['content']['date']
        #print content
        #print reply_time
        #print author
item['user_name'] = author
        item['topic_reply_content'] = content
        item['topic_reply_time'] = reply_time
        towrite(item)

if __name__ == '__main__':
    pool = ThreadPool(4)
    f = open('content.txt','a')
    page = []
    for i in range(1,22):
        newpage = 'http://tieba.baidu.com/p/3522395718?pn=' + str(i)
        page.append(newpage)

    results = pool.map(spider, page)
    pool.close()
    pool.join()
    f.close()

python爬蟲xpath針對json程式碼的分析方法

本文學會使用多程序爬取的map方法，json提取頁面內容方法，xpath解析頁面的方法：

後附爬取貼吧user_name,內容，發表時間等內容的完整程式碼：

python爬蟲xpath針對json程式碼的分析方法

python爬蟲裡資訊提取的核心方法: Beautifulsoup、Xpath和正則表示式

python爬蟲xpath的語法

Python 爬蟲常見的坑和解決方法

Python爬蟲資料的幾種儲存方法

Python爬蟲解析網頁的三種方法，lxml、BeautifulSoup、re案例！

python爬蟲-XPath教程

python爬蟲之lxml簡單學習使用方法

Python爬蟲段子網全程式碼

python爬蟲#資料儲存#JSON/CSV/MYSQL/MongoDB/

比特幣程式碼分析方法總結

python爬蟲遇到驗證碼的處理方法（以爬取中國執行資訊公開網為例）

教你用Python爬蟲股票評論，簡單分析股民使用者情緒

一篇博文讓你了解，Python爬蟲庫的代理設置方法！

堪稱史上最強的python爬蟲，10行程式碼採集B站全部視訊！

python爬蟲使用Cookie的兩種方法

Python爬蟲(4):Beautiful Soup的常用方法

Python爬蟲5個高效實用的方法

Python爬蟲中的Get和Post方法

python爬蟲如何解析json檔案 json檔案的解析提取和jsonpath的應用

python爬蟲xpath針對json程式碼的分析方法

本文學會使用多程序爬取的map方法，json提取頁面內容方法，xpath解析頁面的方法：

後附爬取貼吧user_name,內容，發表時間等內容的完整程式碼：

相關推薦